From c174c0c8c946a7fb9f7f4047cc2e34343131d76a Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sun, 15 Mar 2026 19:39:50 +0000 Subject: [PATCH] auto-fix: strip 2 broken wiki links Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base. --- inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md b/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md index fd65dd02e..f33859d1d 100644 --- a/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md +++ b/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md @@ -41,7 +41,7 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford, ## Agent Notes -**Why this matters:** This is the formal impossibility result our KB has been gesturing at. Our claim [[RLHF and DPO both fail at preference diversity]] is an informal version of this trilemma. The formal result is stronger — it's not just that current implementations fail, it's that NO RLHF system can simultaneously achieve all three properties. This is analogous to the CAP theorem for distributed systems. +**Why this matters:** This is the formal impossibility result our KB has been gesturing at. Our claim RLHF and DPO both fail at preference diversity is an informal version of this trilemma. The formal result is stronger — it's not just that current implementations fail, it's that NO RLHF system can simultaneously achieve all three properties. This is analogous to the CAP theorem for distributed systems. **What surprised me:** The paper does NOT directly reference Arrow's theorem despite the structural similarity. The trilemma is proven through complexity theory rather than social choice theory. This is an independent intellectual tradition arriving at a compatible impossibility result — strong convergent evidence. @@ -50,7 +50,7 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford, **KB connections:** - [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — this paper FORMALIZES our existing claim - [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — independent confirmation from complexity theory -- [[scalable oversight degrades rapidly as capability gaps grow]] — the trilemma shows degradation is mathematically necessary +- scalable oversight degrades rapidly as capability gaps grow — the trilemma shows degradation is mathematically necessary **Extraction hints:** Claims about (1) the formal alignment trilemma as impossibility result, (2) preference collapse / sycophancy / bias amplification as computational necessities, (3) the 10^3 vs 10^8 representation gap in current RLHF.