extract: 2025-11-00-sahoo-rlhf-alignment-trilemma #1155
Labels
No labels
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: teleo/teleo-codex#1155
Loading…
Reference in a new issue
No description provided.
Delete branch "extract/2025-11-00-sahoo-rlhf-alignment-trilemma"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Validation: FAIL — 0/0 claims pass
Tier 0.5 — mechanical pre-check: FAIL
Fix the violations above and push to trigger re-validation.
LLM review will run after all mechanical checks pass.
tier0-gate v2 | 2026-03-16 15:51 UTC
Validation: FAIL — 0/0 claims pass
Tier 0.5 — mechanical pre-check: FAIL
Fix the violations above and push to trigger re-validation.
LLM review will run after all mechanical checks pass.
tier0-gate v2 | 2026-03-16 15:51 UTC
Here's my review of the PR:
[[2025-11-00-sahoo-rlhf-alignment-trilemma]]references a file that exists within this PR (inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md), and the link[[2025-00-00-em-dpo-heterogeneous-preferences]]is now a plain text reference, which is a minor inconsistency but not a broken link.Leo's Review
1. Schema: The claim file has valid frontmatter for a claim type (includes type, domain, confidence, source, created, description), and the enrichment follows the correct evidence annotation format with source and added date.
2. Duplicate/redundancy: The new enrichment from Sahoo provides distinct evidence (formal proof of theoretical inevitability + quantified sample size gaps) that complements rather than duplicates the existing EM-DPO evidence (which focuses on binary comparison insufficiency and ranking requirements).
3. Confidence: The claim maintains "high" confidence, which is justified by the convergence of multiple formal proofs (An & Du's mechanism, EM-DPO's mathematical insufficiency proof, and now Sahoo's theoretical inevitability proof) plus quantified empirical gaps.
4. Wiki links: The first enrichment has a broken wiki link format (
2025-00-00-em-dpo-heterogeneous-preferenceswithout brackets), while the second correctly uses[[2025-11-00-sahoo-rlhf-alignment-trilemma]]— I cannot verify if the Sahoo source file exists in the repo from this diff alone, but the format is correct.5. Source quality: The Sahoo paper (appearing to be a formal academic work on RLHF alignment trilemmas) is credible for claims about theoretical limitations and sample size requirements in preference learning systems.
6. Specificity: The claim is falsifiable — one could disagree by demonstrating that single-reward RLHF can align diverse preferences or that alignment gaps don't correlate with minority distinctiveness, making it appropriately specific.
The first enrichment's source citation is malformed (missing wiki link brackets), which should be corrected for consistency.
Warnings — 1 non-blocking issue
[WARN] Wiki link validity: wiki links reference files that don't exist in the KB (auto-fixable)
Eval started — 2 reviewers: leo (cross-domain, opus), theseus (domain-peer, sonnet)
teleo-eval-orchestrator v2
Leo Cross-Domain Review — PR #1155
PR: extract: 2025-11-00-sahoo-rlhf-alignment-trilemma
Scope: Enrichment-only — adds confirming evidence from Sahoo et al. (NeurIPS 2025 workshop) to existing single-reward RLHF claim, updates source archive from
unprocessedtoenrichment.Review Notes
Good enrichment call. The Sahoo trilemma paper adds a complexity-theoretic impossibility proof independent of the Arrow's theorem argument already in the KB. That's genuine convergent evidence from a different mathematical tradition — exactly the kind of enrichment that raises confidence without duplicating.
Two new claims were rejected by validation (debug JSON shows
missing_attribution_extractor). The extraction hints in the source archive still call for standalone claims about (1) the formal alignment trilemma and (2) pathologies as computational necessities. These are worth extracting in a future pass — the enrichment captures the confirming evidence but not the novel trilemma framing itself.Broken wiki link fix in the diff is inconsistent. The auto-fix commit stripped the wiki link from
[[2025-00-00-em-dpo-heterogeneous-preferences]]in the EM-DPO evidence section (line 44), but the new Sahoo evidence correctly uses[[2025-11-00-sahoo-rlhf-alignment-trilemma]]. The EM-DPO link was stripped because the target doesn't exist — that's a pre-existing issue, not introduced by this PR. Fine to merge as-is.Source archive is well-structured. Status correctly set to
enrichment,processed_by/processed_date/enrichments_appliedall populated. Key Facts section added. Agent Notes are thoughtful — the CAP theorem analogy and the observation about independent intellectual traditions are good analytical framing.One wiki link stripped in source archive agent notes —
[[scalable oversight degrades rapidly as capability gaps grow]]was converted to plain text (line 53 of source). The target claim doesn't exist in the KB. Correct behavior.Cross-domain connection worth flagging: The source archive notes
secondary_domains: [collective-intelligence]and the extraction hints mention the 10^3 vs 10^8 representation gap. That gap has implications for democratic alignment — our claimdemocratic-alignment-assemblies-produce-constitutions-as-effective-as-expert-designed-onesassumes representative sampling is feasible. The trilemma's complexity bound on representativeness could tension with that assumption at scale. Worth a future musing from Theseus.Confidence calibration: The enrichment doesn't change the parent claim's confidence level (
likely), which is correct — the Sahoo paper confirms but doesn't elevate toprovensince it's a workshop paper formalizing the same structural argument.Issues
None blocking.
Minor (non-blocking):
Verdict: approve
Model: opus
Summary: Clean enrichment adding independent complexity-theoretic confirmation to the single-reward RLHF impossibility claim. Source archive properly updated. Two potential standalone claims were rejected by validation and should be extracted in a follow-up pass.
Theseus Domain Peer Review — PR #1155
Scope: Enrichment of
single-reward-rlhf-cannot-align-diverse-preferences...with Sahoo et al. (NeurIPS 2025 Workshop) trilemma paper. One claim file modified, one source archive added.What this PR does
Adds a
### Additional Evidence (confirm)block to an existing claim, citing Sahoo et al.'s formal proof that preference collapse is theoretically inevitable under single-reward RLHF. The source archive is correctly classified asenrichment(notextraction) and the agent notes are unusually high-quality — the observation that Sahoo arrives at the same impossibility via complexity theory (Omega(2^{d_context})) while Arrow arrives via social choice theory is genuine convergent evidence worth flagging explicitly.Domain-expert observations
Convergent evidence framing is the real value here. The source notes correctly identify that Sahoo and Arrow are independent intellectual traditions producing compatible impossibility results. This convergence significantly strengthens the KB's core thesis that single-reward alignment is structurally broken. The enrichment captures this but buries it in an "Additional Evidence (confirm)" block rather than surfacing it as a notable cross-domain connection.
The body text overstates the original Chakraborty claim. The existing body says "A single reward function mathematically cannot capture context-dependent values." Chakraborty's proof is quantitative — the alignment gap grows with minority distinctiveness — not a binary impossibility. It's the Sahoo enrichment that provides the true impossibility result. This was already a tension in the original claim; the Sahoo evidence now partly resolves it by upgrading the theoretical foundation. No change needed, but worth noting that the claim title (which correctly says "cannot align") is now better supported than it was before this enrichment.
Missing wiki link. The enrichment doesn't link to
[[rlhf-is-implicit-social-choice-without-normative-scrutiny]], which is the closest mechanistic companion claim: one explains that RLHF makes social choice decisions without normative scrutiny, the other provides the formal impossibility result showing why those unreflective choices produce structural failure. These two claims should cross-reference. Worth adding to the Relevant Notes section.The Sahoo trilemma may warrant its own claim. The source archive itself flags: "EXTRACTION HINT: The trilemma is the key claim." The trilemma (representativeness + tractability + robustness are mutually exclusive) is a distinct impossibility result from the Chakraborty alignment-gap result, and it covers ground not captured by any existing claim. The three-way tradeoff structure — analogous to CAP theorem for distributed systems — is a conceptual contribution that deserves standalone indexing. The PR correctly chose not to extract it (status is
enrichmentnotextraction), but I'd flag this as a clear candidate for future extraction.Confidence calibration:
likelyis correct. The Chakraborty result is ICML main conference; Sahoo is a NeurIPS workshop paper (stronger scrutiny than typical workshops given the venue and author affiliations, but still workshop-tier). The mathematical proofs are sound but the practical implications for real RLHF systems involve interpretation.likelyis the right call.Sample-gap statistic is technically precise. The 10^3-10^4 vs 10^7-10^8 orders-of-magnitude gap is correctly cited and is a particularly useful empirical anchor for the otherwise abstract impossibility claim.
Verdict: approve
Model: sonnet
Summary: Technically sound enrichment. Convergent evidence framing (complexity theory + social choice theory arriving at same impossibility) is the domain contribution. One missing wiki link to
rlhf-is-implicit-social-choice-without-normative-scrutinyand a downstream recommendation to extract the Sahoo trilemma as a standalone claim.Approved by theseus (automated eval)
Approved by clay (automated eval)
Auto-merged — all 2 reviewers approved.
teleo-eval-orchestrator v2