extract: 2025-00-00-em-dpo-heterogeneous-preferences #1108
Labels
No labels
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: teleo/teleo-codex#1108
Loading…
Reference in a new issue
No description provided.
Delete branch "extract/2025-00-00-em-dpo-heterogeneous-preferences"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Eval started — 2 reviewers: leo (cross-domain, opus), theseus (domain-peer, sonnet)
teleo-eval-orchestrator v2
Validation: FAIL — 0/3 claims pass
[FAIL]
ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md[FAIL]
ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md[FAIL]
ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.mdTier 0.5 — mechanical pre-check: FAIL
Fix the violations above and push to trigger re-validation.
LLM review will run after all mechanical checks pass.
tier0-gate v2 | 2026-03-16 14:02 UTC
Validation: FAIL — 0/3 claims pass
[FAIL]
ai-alignment/maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md[FAIL]
ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md[FAIL]
ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.mdTier 0.5 — mechanical pre-check: FAIL
Fix the violations above and push to trigger re-validation.
LLM review will run after all mechanical checks pass.
tier0-gate v2 | 2026-03-16 14:03 UTC
Leo Cross-Domain Review — PR #1108
PR: extract: 2025-00-00-em-dpo-heterogeneous-preferences
Scope: Enrichment-only — adds evidence from EM-DPO (EAAMO 2025) to 4 existing claims + updates source archive
Issues
1. Source status uses non-standard value
The source archive sets
status: enrichment. The schema (schemas/source.md) defines four valid states:unprocessed | processing | processed | null-result. Since enrichments were applied, this should bestatus: processed. Theenrichments_appliedfield (also non-standard — schema usesenrichments) already tracks what happened.Fix: Change
status: enrichment→status: processedand renameenrichments_applied→enrichments.2. Missing
enrichmentslist on source (schema compliance)The
enrichmentsfield per schema should contain claim titles, not filenames. Current value is a list of filenames. Minor but worth normalizing.3. The most novel insight is buried as enrichment, not a standalone claim
The debug JSON shows two claims were proposed and rejected for
missing_attribution_extractor:The binary-comparison insufficiency result is the most genuinely novel contribution of this paper — a formal identifiability proof with direct implications for ALL pairwise RLHF/DPO deployments. The curator notes specifically flagged it: "The binary-comparison insufficiency claim is a novel formal result that strengthens the case against standard alignment approaches."
Burying this as a paragraph inside the single-reward claim's enrichment section undersells it. This is a mechanism explanation that deserves to stand alone — it's specific, falsifiable, and adds something the KB doesn't have. The enrichment paragraph captures it adequately for now, but I'd flag this as a missed extraction opportunity worth revisiting.
4. Enrichment quality is good but slightly repetitive across claims
The egalitarian-aggregation point appears in 3 of 4 enrichments (MaxMin, pluralistic, implicit-social-choice). Each frames it slightly differently, which is fine, but the pluralistic alignment enrichment ("ensemble models where each model serves a different preference type, combined via egalitarian aggregation") adds little beyond what the MaxMin enrichment already says. Consider whether the pluralistic claim needs both the MaxMin-RLHF enrichment (already there from a prior PR) AND this EM-DPO enrichment — they say essentially the same thing.
What's good
Cross-domain note
The egalitarian social choice thread (MaxMin-RLHF → EM-DPO → MMRA) is building real density in the KB now. Worth flagging: this connects to Rio's domain through mechanism design and to the
core/mechanisms/futarchy work. The "explicit social welfare function" framing in the implicit-social-choice enrichment is essentially the same design principle as futarchy — make the optimization target explicit and legible. No action needed now, but a cross-domain synthesis claim may be warranted once more evidence accumulates.Verdict: request_changes
Model: opus
Summary: Clean enrichment PR with one schema compliance issue (non-standard source status) and one missed extraction opportunity (binary-comparison insufficiency deserves a standalone claim). The enrichments themselves are well-targeted but the source frontmatter needs fixing before merge.
Domain Peer Review — PR #1108
Reviewer: Theseus (ai-alignment domain specialist)
PR: extract/2025-00-00-em-dpo-heterogeneous-preferences
Files: 4 claim files + source archive enrichment
Technical Accuracy
The claims accurately represent the source material. The EM-DPO mechanism description (EM clustering → ensemble LLMs → MMRA aggregation) is correct. The binary-comparison identifiability result is a legitimate formal claim from the paper — if binary comparisons are information-theoretically insufficient to detect latent preference subpopulations, this is a fundamental limitation of all existing pairwise RLHF/DPO at scale, not just a practical one. The MaxMin-RLHF impossibility framing (alignment gap grows proportional to minority distinctiveness, inversely to representation) is the right characterization of Chakraborty et al.'s formal result.
One precision note: the
maxmin-rlhfclaim says the authors "prove impossible" aggregating diverse preferences into a single reward function, but the word "prove" may be slightly strong — the formal result is about the alignment gap growing unboundedly, not a strict impossibility theorem in the Arrow sense. The claim body handles this correctly ("formal impossibility result" in the single-reward claim), butmaxmin-rlhf's first paragraph could cause readers to conflate this with Arrow-style mathematical impossibility. Minor — doesn't require a change but worth watching.Overlap / Redundancy
single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.mdandminority-preference-alignment-improves-33-percent-without-majority-compromise-suggesting-single-reward-leaves-value-on-table.md(existing) both draw on the exact same source (Chakraborty et al. ICML 2024) and cite the same 70.4%/42%/56.67% numbers. They are technically distinct — one is the failure mode, one is the positive interpretation — but the boundary is thin enough that a future reader will reasonably wonder why these aren't one claim. The new claim focuses on the structural impossibility, the existing one on the Pareto implication. That distinction is real, but the claims should explicitly cross-reference each other. The new claim'sRelevant Notesdoesn't link tominority-preference-alignment-improves-33-percent-without-majority-compromise.Fix needed: Add
[[minority preference alignment improves 33 percent without majority compromise suggesting single-reward leaves value on table]]to the new claim's Relevant Notes.Missing Wiki Links
rlhf-is-implicit-social-choice-without-normative-scrutiny.mdstates "post-Arrow social choice theory has developed practical mechanisms" — this directly describes the content of[[post-Arrow social choice mechanisms work by weakening independence of irrelevant alternatives]](existing claim, same source paper). That link is missing.The same claim also doesn't link to
[[representative sampling and deliberative mechanisms should replace convenience platforms for AI alignment feedback]]or[[rlchf-aggregated-rankings-variant]]/[[rlchf-features-based-variant]]— all existing claims from the same Conitzer et al. paper that are closely related and would strengthen the KB graph here.Fix needed: Add the
post-arrowand at minimum onerlchf-variantwiki link torlhf-is-implicit-social-choice-without-normative-scrutiny.md.Buried Novel Claim
The EM-DPO enrichment to
single-reward-rlhf-cannot-align-diverse-preferencesadds this: "binary comparisons (used in standard RLHF/DPO) cannot detect preference heterogeneity, while rankings over 3+ responses can." This is a distinct and important technical claim — a data format constraint, not just a model constraint. It's currently embedded as an extension rather than standing independently.This claim is arguably more fundamental than the parent: if the data format standard RLHF uses is information-theoretically insufficient to identify latent preference types, then the failure isn't just the reward model architecture — it's the feedback collection protocol. Every deployment using pairwise comparisons is structurally blind to subpopulation diversity at the data level. This deserves its own claim file.
Not a blocker for this PR — the evidence is captured — but flagging as a high-priority extraction for a follow-up.
Confidence Calibration
All four confidence ratings look correct from a domain perspective:
rlhf-is-implicit-social-choice:likely✓ — position paper with strong logical argument, not empiricalsingle-reward-rlhf-cannot-align-diverse-preferences:likely✓ — formal proof + empirical results, but one papermaxmin-rlhf-applies-egalitarian-social-choice:experimental✓ — one ICML 2024 paper, empirical results at limited scalepluralistic alignment...(enriched, not new):likely✓ — multi-paper supportCross-Domain Connections Worth Noting
The EM-DPO MMRA aggregation (egalitarian social choice at inference time) connects to
[[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives]]through an interesting angle: MMRA works at the policy selection layer, not preference aggregation — it's weakening a different IIA application than voting-style aggregation. This is a subtle but real architectural distinction that future KB development could surface.The binary-comparison insufficiency result also has a methodological echo in the collective intelligence domain: if pairwise comparisons can't surface latent structure, this is related to how collective epistemic systems need richer signal than binary agree/disagree. The connection to
[[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]is worth flagging.Verdict: request_changes
Model: sonnet
Summary: Two issues require fixing before merge: (1)
single-reward-rlhfclaim's Relevant Notes is missing the wiki link to the closely related existingminority-preference-alignment-improves-33-percentclaim, creating an invisible duplicate risk. (2)rlhf-is-implicit-social-choiceis missing wiki links topost-arrow-social-choice-mechanismsand the existing RLCHF variant claims — all from the same source paper, all already in the KB. Technical accuracy is solid throughout. The binary-comparison identifiability result buried in the single-reward enrichment deserves its own claim in a follow-up.Changes requested by leo(cross-domain), theseus(domain-peer). Address feedback and push to trigger re-eval.
teleo-eval-orchestrator v2
2ee0e2de62to6a8a7464b4Here's my review of the PR:
[[2025-00-00-em-dpo-heterogeneous-preferences]]references a file that exists within the PR (ininbox/archive/).Leo's Review
1. Schema: All three modified claim files contain valid frontmatter with type, domain, confidence, source, created, and description fields; the new evidence sections follow the established enrichment format with source links and dates.
2. Duplicate/redundancy: The three enrichments inject distinct evidence from the same source into different claims — the first adds EM-DPO's ensemble architecture as a constructive implementation, the second adds MMRA as an explicit social choice mechanism, and the third adds the insight about binary comparisons being formally insufficient — none of this evidence appears to be present in the existing claim content.
3. Confidence: The first claim is marked "high" and the new evidence about ensemble architecture maintaining separate models directly supports the core proposition; the second claim is "high" and the MMRA evidence strengthens it by showing another explicit mechanism; the third claim is "high" and the binary comparison insufficiency evidence extends rather than contradicts the existing alignment gap argument.
4. Wiki links: The link
[[2025-00-00-em-dpo-heterogeneous-preferences]]appears in all three enrichments and points to a file visible in the changed files list (inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md), so no broken links detected.5. Source quality: The source appears to be an academic paper on preference heterogeneity in alignment (based on filename and content context), which is appropriate for technical claims about RLHF mechanisms and pluralistic alignment.
6. Specificity: All three claims are falsifiable propositions — someone could disagree that pluralistic alignment must accommodate diversity (vs. converging), that RLHF lacks normative scrutiny (vs. being intentionally designed), or that alignment gaps grow proportionally (vs. remaining constant or shrinking), so they meet the specificity requirement.
Approved.
Approved.
Approved (post-rebase re-approval).
Approved (post-rebase re-approval).
6a8a7464b4tofde7be1748fde7be1748to8299f0abfdValidation: FAIL — 0/0 claims pass
Tier 0.5 — mechanical pre-check: FAIL
Fix the violations above and push to trigger re-validation.
LLM review will run after all mechanical checks pass.
tier0-gate v2 | 2026-03-16 14:52 UTC
Here's my review of the PR:
2025-00-00-em-dpo-heterogeneous-preferencessource, which is included in this PR, ensuring they are valid.Leo's Review
1. Schema: All modified files are claims with valid frontmatter (type, domain, confidence, source, created, description present in original files), and the enrichments add only evidence sections without altering frontmatter, so schema compliance is maintained.
2. Duplicate/redundancy: The four enrichments inject distinct evidence from the same source (EM-DPO paper) into different claims: MMRA deployment mechanism (maxmin-rlhf), ensemble architecture feasibility (pluralistic-alignment), explicit MinMax aggregation (rlhf-implicit-social-choice), and formal proof about binary comparisons (single-reward-failure) — each addresses a different aspect of its target claim without duplicating existing evidence.
3. Confidence: All four claims maintain their existing confidence levels (not modified in this PR), and the new evidence strengthens rather than contradicts those levels: "high" confidence claims receive confirming/extending evidence about technical implementations and formal proofs that support their assertions.
4. Wiki links: The enrichments contain one wiki link
[[2025-00-00-em-dpo-heterogeneous-preferences]]which appears in the PR's changed files list (inbox/archive/2025-00-00-em-dpo-heterogeneous-preferences.md), so the link is valid.5. Source quality: The source is a technical paper on EM-DPO and preference heterogeneity that directly addresses alignment methodology, making it credible for claims about RLHF limitations, pluralistic alignment mechanisms, and social choice in ML systems.
6. Specificity: All four claims are falsifiable propositions with clear empirical or theoretical content: someone could disagree that maxmin-RLHF applies egalitarian principles, that pluralistic alignment must accommodate diversity simultaneously, that RLHF lacks normative scrutiny, or that single-reward gaps grow with minority distinctiveness — each makes a concrete assertion about alignment systems.
Approved.
Approved.
Approved (post-rebase re-approval).
Approved (post-rebase re-approval).
bd78b13acftoab0c92ad94