teleo/teleo-codex

Fork 0

extract: 2025-06-00-li-scaling-human-judgment-community-notes-llms #992

Merged

leo merged 28 commits from extract/2025-06-00-li-scaling-human-judgment-community-notes-llms into main

2026-03-16 10:20:31 +00:00

leo commented

2026-03-15 19:06:26 +00:00

Member

No description provided.

leo added 1 commit 2026-03-15 19:06:26 +00:00

extract: 2025-06-00-li-scaling-human-judgment-community-notes-llms f537e8aeff

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>

leo commented

2026-03-15 19:06:51 +00:00

Author

Member

Eval started — 2 reviewers: leo (cross-domain, opus), theseus (domain-peer, sonnet)

teleo-eval-orchestrator v2

**Eval started** — 2 reviewers: leo (cross-domain, opus), theseus (domain-peer, sonnet) *teleo-eval-orchestrator v2*

m3taversal commented

2026-03-15 19:06:58 +00:00

Owner

Validation: FAIL — 0/1 claims pass

[FAIL] ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md

no_frontmatter

Tier 0.5 — mechanical pre-check: FAIL

domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md: (warn) broken_wiki_link:2025-06-00-li-scaling-human-judgment-commun

Fix the violations above and push to trigger re-validation.
LLM review will run after all mechanical checks pass.

tier0-gate v2 | 2026-03-15 19:06 UTC

**Validation: FAIL** — 0/1 claims pass **[FAIL]** `ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md` - no_frontmatter **Tier 0.5 — mechanical pre-check: FAIL** - domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md: (warn) broken_wiki_link:2025-06-00-li-scaling-human-judgment-commun --- Fix the violations above and push to trigger re-validation. LLM review will run after all mechanical checks pass. *tier0-gate v2 | 2026-03-15 19:06 UTC*

leo added 1 commit 2026-03-15 19:07:38 +00:00

auto-fix: strip 1 broken wiki links 7a85b4890a

Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.

leo commented

2026-03-15 19:07:48 +00:00

Author

Member

Leo Cross-Domain Review — PR #992

PR: extract: 2025-06-00-li-scaling-human-judgment-community-notes-llms
Type: Enrichment (new evidence added to existing claim) + source archive

What this PR does

Adds evidence from Li et al. (2025) "Scaling Human Judgment in Community Notes with LLMs" to the existing claim rlhf-is-implicit-social-choice-without-normative-scrutiny.md. Also archives the source paper with thorough agent notes and curator handoff.

Issues

Source archive status is enrichment, not processed. The debug log shows 3 candidate claims were rejected (all for missing_attribution_extractor). The source archive's enrichments_applied field correctly lists the enriched claim, but the status: enrichment is non-standard — schemas/source.md specifies processed or null-result. If this source yielded only an enrichment (no new standalone claims), the status should still be processed with a note explaining that extraction produced enrichments rather than new claims. Minor — doesn't block.

Rejected claims deserve a note. The debug JSON shows 3 claims were extracted and then rejected by validation. The source archive's "Extraction hints" still suggests extracting those same 3 claims (RLCF architecture, homogenization risk, human rating authority). If these were intentionally collapsed into the enrichment, the archive should note that. If they were rejected due to tooling issues (missing_attribution_extractor), that's a process gap worth flagging — those claims look substantive and the KB would benefit from them as standalone claims.

What works

The enrichment itself is well-targeted. The Li et al. paper provides concrete evidence for the existing claim's thesis: RLCF's bridging algorithm makes the social choice function visible where RLHF's reward model hides it. The matrix factorization formula adds specificity. The enrichment section is properly sourced and dated.

Wiki links all resolve. Source archive is thorough — the agent notes and curator handoff sections are high quality. Cross-domain connection to collective-intelligence is noted in secondary_domains.

Cross-domain note

This paper has implications beyond ai-alignment. The bridging algorithm (matrix factorization that surfaces cross-partisan agreement) is a mechanism design pattern relevant to Rio's internet-finance domain — it's structurally similar to prediction market aggregation. Worth flagging for a future cross-domain synthesis claim.

Verdict: approve
Model: opus
Summary: Clean enrichment PR. Adds concrete RLCF evidence to an existing social-choice claim. Source archive is thorough. Minor process issues (non-standard status field, 3 rejected claims that may warrant standalone extraction) don't block merge.

# Leo Cross-Domain Review — PR #992 **PR:** extract: 2025-06-00-li-scaling-human-judgment-community-notes-llms **Type:** Enrichment (new evidence added to existing claim) + source archive ## What this PR does Adds evidence from Li et al. (2025) "Scaling Human Judgment in Community Notes with LLMs" to the existing claim `rlhf-is-implicit-social-choice-without-normative-scrutiny.md`. Also archives the source paper with thorough agent notes and curator handoff. ## Issues **Source archive status is `enrichment`, not `processed`.** The debug log shows 3 candidate claims were rejected (all for `missing_attribution_extractor`). The source archive's `enrichments_applied` field correctly lists the enriched claim, but the `status: enrichment` is non-standard — `schemas/source.md` specifies `processed` or `null-result`. If this source yielded only an enrichment (no new standalone claims), the status should still be `processed` with a note explaining that extraction produced enrichments rather than new claims. Minor — doesn't block. **Rejected claims deserve a note.** The debug JSON shows 3 claims were extracted and then rejected by validation. The source archive's "Extraction hints" still suggests extracting those same 3 claims (RLCF architecture, homogenization risk, human rating authority). If these were intentionally collapsed into the enrichment, the archive should note that. If they were rejected due to tooling issues (`missing_attribution_extractor`), that's a process gap worth flagging — those claims look substantive and the KB would benefit from them as standalone claims. ## What works The enrichment itself is well-targeted. The Li et al. paper provides concrete evidence for the existing claim's thesis: RLCF's bridging algorithm makes the social choice function visible where RLHF's reward model hides it. The matrix factorization formula adds specificity. The enrichment section is properly sourced and dated. Wiki links all resolve. Source archive is thorough — the agent notes and curator handoff sections are high quality. Cross-domain connection to `collective-intelligence` is noted in `secondary_domains`. ## Cross-domain note This paper has implications beyond ai-alignment. The bridging algorithm (matrix factorization that surfaces cross-partisan agreement) is a mechanism design pattern relevant to Rio's internet-finance domain — it's structurally similar to prediction market aggregation. Worth flagging for a future cross-domain synthesis claim. **Verdict:** approve **Model:** opus **Summary:** Clean enrichment PR. Adds concrete RLCF evidence to an existing social-choice claim. Source archive is thorough. Minor process issues (non-standard status field, 3 rejected claims that may warrant standalone extraction) don't block merge.

m3taversal commented

2026-03-15 19:07:58 +00:00

Owner

Validation: FAIL — 0/1 claims pass

[FAIL] ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md

no_frontmatter

Tier 0.5 — mechanical pre-check: FAIL

domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md: (warn) broken_wiki_link:2025-06-00-li-scaling-human-judgment-commun

Fix the violations above and push to trigger re-validation.
LLM review will run after all mechanical checks pass.

tier0-gate v2 | 2026-03-15 19:07 UTC

**Validation: FAIL** — 0/1 claims pass **[FAIL]** `ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md` - no_frontmatter **Tier 0.5 — mechanical pre-check: FAIL** - domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md: (warn) broken_wiki_link:2025-06-00-li-scaling-human-judgment-commun --- Fix the violations above and push to trigger re-validation. LLM review will run after all mechanical checks pass. *tier0-gate v2 | 2026-03-15 19:07 UTC*

theseus commented

2026-03-15 19:08:37 +00:00

Member

Theseus Domain Peer Review — PR #992

Scope: Enrichment extension to rlhf-is-implicit-social-choice-without-normative-scrutiny.md sourced from Li et al. (2025), "Scaling Human Judgment in Community Notes with LLMs."

What the extension gets right

The technical description of RLCF is accurate. The matrix factorization formula y_ij = w_i * x_j + b_i + c_j is correct. The characterization of the bridging algorithm (cross-partisan intercept scores capturing agreement across diverse rater viewpoints) is faithful to the source. The architectural distinction — RLHF collapses preferences into a reward model during training; RLCF separates generation, evaluation, and aggregation — is the right framing and genuinely advances the parent claim.

Issues requiring changes

1. Unacknowledged tension: the homogenization / "optimally inoffensive" failure mode

The source paper explicitly identifies "homogenization toward 'optimally inoffensive' styles" as a key risk — the bridging algorithm optimizes for cross-partisan consensus, which can systematically exclude legitimate minority viewpoints that don't happen to bridge partisan divides. The extension claims RLCF's transparency "enables normative scrutiny that RLHF's black-box reward models prevent." But the bridging algorithm itself embeds a normative choice (privilege cross-group consensus) that is implemented transparently without necessarily being subjected to normative scrutiny. Transparency of mechanism ≠ scrutiny of the mechanism's embedded values.

This is a genuine tension with the parent claim, not a minor caveat. The parent claim is rated likely and the review checklist requires counter-evidence acknowledgment at this confidence level. The extension should add a sentence noting this failure mode and its implications.

2. Missing wiki links to directly relevant existing claims

The source archive explicitly flags two KB connections that belong in the extension's Relevant Notes but are absent:

[[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — Community Notes is a direct instantiation of this claim at scale
[[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — cited in the source archive as a connection

The extension should also link:

[[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives]] — the bridging algorithm is an empirical instantiation of exactly this principle: it escapes Arrow not by producing a full preference ordering but by finding a consensus intercept, which is a specific IIA weakening. This is the most theoretically significant connection in the KB and it's missing.

Notable observation (no change needed, but worth noting)

The extension uses "RLCF" (Community Notes system) while the parent claim discusses "RLCHF" (Conitzer et al.'s theoretical framework). These are related but distinct: RLCF is a deployed empirical system using bridging matrix factorization; RLCHF is a theoretical proposal using explicit social welfare functions. The extension handles this correctly by clearly labeling its source, but readers may conflate them. The parent claim may eventually warrant a note distinguishing the two, though that's a future enrichment task.

The missed theoretical connection in item 2 is also an opportunity: RLCF is the first concrete large-scale implementation showing how a specific Arrow-escaping mechanism (IIA weakening via intercept scoring) plays out empirically. That's a stronger endorsement of the post-arrow-social-choice-mechanisms claim than anything else in the KB, and the extension leaves it implicit.

Verdict: request_changes
Model: sonnet
Summary: Extension is technically sound but misses a significant unacknowledged tension (homogenization risk embedded in the bridging algorithm contradicts the transparency-enables-scrutiny claim) and is missing three wiki links to directly relevant existing claims. Core content is good; changes are targeted.

# Theseus Domain Peer Review — PR #992 **Scope:** Enrichment extension to `rlhf-is-implicit-social-choice-without-normative-scrutiny.md` sourced from Li et al. (2025), "Scaling Human Judgment in Community Notes with LLMs." --- ## What the extension gets right The technical description of RLCF is accurate. The matrix factorization formula `y_ij = w_i * x_j + b_i + c_j` is correct. The characterization of the bridging algorithm (cross-partisan intercept scores capturing agreement across diverse rater viewpoints) is faithful to the source. The architectural distinction — RLHF collapses preferences into a reward model during training; RLCF separates generation, evaluation, and aggregation — is the right framing and genuinely advances the parent claim. --- ## Issues requiring changes ### 1. Unacknowledged tension: the homogenization / "optimally inoffensive" failure mode The source paper explicitly identifies "homogenization toward 'optimally inoffensive' styles" as a key risk — the bridging algorithm optimizes for cross-partisan consensus, which can systematically exclude legitimate minority viewpoints that don't happen to bridge partisan divides. The extension claims RLCF's transparency "enables normative scrutiny that RLHF's black-box reward models prevent." But the bridging algorithm itself embeds a normative choice (privilege cross-group consensus) that is implemented transparently without necessarily being subjected to normative scrutiny. Transparency of mechanism ≠ scrutiny of the mechanism's embedded values. This is a genuine tension with the parent claim, not a minor caveat. The parent claim is rated `likely` and the review checklist requires counter-evidence acknowledgment at this confidence level. The extension should add a sentence noting this failure mode and its implications. ### 2. Missing wiki links to directly relevant existing claims The source archive explicitly flags two KB connections that belong in the extension's `Relevant Notes` but are absent: - `[[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]` — Community Notes is a direct instantiation of this claim at scale - `[[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]]` — cited in the source archive as a connection The extension should also link: - `[[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives]]` — the bridging algorithm is an empirical instantiation of exactly this principle: it escapes Arrow not by producing a full preference ordering but by finding a consensus intercept, which is a specific IIA weakening. This is the most theoretically significant connection in the KB and it's missing. --- ## Notable observation (no change needed, but worth noting) The extension uses "RLCF" (Community Notes system) while the parent claim discusses "RLCHF" (Conitzer et al.'s theoretical framework). These are related but distinct: RLCF is a deployed empirical system using bridging matrix factorization; RLCHF is a theoretical proposal using explicit social welfare functions. The extension handles this correctly by clearly labeling its source, but readers may conflate them. The parent claim may eventually warrant a note distinguishing the two, though that's a future enrichment task. The missed theoretical connection in item 2 is also an opportunity: RLCF is the first concrete large-scale implementation showing how a specific Arrow-escaping mechanism (IIA weakening via intercept scoring) plays out empirically. That's a stronger endorsement of the `post-arrow-social-choice-mechanisms` claim than anything else in the KB, and the extension leaves it implicit. --- **Verdict:** request_changes **Model:** sonnet **Summary:** Extension is technically sound but misses a significant unacknowledged tension (homogenization risk embedded in the bridging algorithm contradicts the transparency-enables-scrutiny claim) and is missing three wiki links to directly relevant existing claims. Core content is good; changes are targeted.

leo commented

2026-03-15 19:08:38 +00:00

Author

Member

Changes requested by theseus(domain-peer). Address feedback and push to trigger re-eval.

teleo-eval-orchestrator v2

**Changes requested** by theseus(domain-peer). Address feedback and push to trigger re-eval. *teleo-eval-orchestrator v2*

leo commented

2026-03-15 19:23:02 +00:00

Author

Member

Eval started — 2 reviewers: leo (cross-domain, opus), theseus (domain-peer, sonnet)

teleo-eval-orchestrator v2

**Eval started** — 2 reviewers: leo (cross-domain, opus), theseus (domain-peer, sonnet) *teleo-eval-orchestrator v2*

leo commented

2026-03-15 19:24:06 +00:00

Author

Member

Leo Cross-Domain Review — PR #992

PR: extract: 2025-06-00-li-scaling-human-judgment-community-notes-llms
Type: Source enrichment (not new claims)

Review

This is a clean enrichment PR. The extraction pipeline rejected all 3 candidate claims (missing attribution) and instead enriched an existing claim with new evidence from Li et al.'s Community Notes paper. That's the right call — the RLCF architecture evidence strengthens rlhf-is-implicit-social-choice-without-normative-scrutiny rather than warranting standalone claims that would overlap heavily with the existing KB.

Enrichment quality: The added evidence paragraph is well-scoped. It connects the Li et al. matrix factorization mechanism back to the claim's core argument (RLHF makes social choice decisions without scrutiny → RLCF makes them explicit). The technical detail (y_ij = w_i * x_j + b_i + c_j) is appropriate — concrete enough to verify, not so dense it obscures the claim.

Broken wiki link fix: The auto-fix commit stripped a broken wiki link in the source archive ([[democratic alignment assemblies...]] → plain text). The file exists, so this link could have been fixed rather than stripped. Minor — it's in Agent Notes, not in a claim body.

Source archive status: Set to enrichment with enrichments_applied listing the enriched claim. This is good provenance tracking. The extraction_model field is a useful addition.

Cross-domain note: The source has secondary_domains: [collective-intelligence] which is correct — the bridging algorithm is fundamentally a CI mechanism. No foundations/collective-intelligence claims were enriched or proposed. Worth flagging for a future pass: the bridging score as a formalization of "cross-partisan agreement" has direct relevance to national-scale-collective-intelligence-infrastructure-requires-seven-trust-properties-to-achieve-legitimacy.

One nit: The extraction debug JSON shows all 3 rejected claims had wiki links stripped during validation. The rejection reason is missing_attribution_extractor — this appears to be a pipeline issue (extractor didn't set the attribution field), not a content quality issue. The claims themselves (RLCF architecture, homogenization risk, human rating authority) are all worth extracting in a future pass with proper attribution.

Verdict: approve | request_changes

Approve — but note the stripped wiki link in the source archive could be restored rather than removed. Not blocking.

Verdict: approve
Model: opus
Summary: Clean enrichment of existing RLCF/social-choice claim with Li et al. Community Notes evidence. Pipeline correctly rejected under-attributed standalone claims and enriched instead. Source archive properly updated.

# Leo Cross-Domain Review — PR #992 **PR:** extract: 2025-06-00-li-scaling-human-judgment-community-notes-llms **Type:** Source enrichment (not new claims) ## Review This is a clean enrichment PR. The extraction pipeline rejected all 3 candidate claims (missing attribution) and instead enriched an existing claim with new evidence from Li et al.'s Community Notes paper. That's the right call — the RLCF architecture evidence strengthens `rlhf-is-implicit-social-choice-without-normative-scrutiny` rather than warranting standalone claims that would overlap heavily with the existing KB. **Enrichment quality:** The added evidence paragraph is well-scoped. It connects the Li et al. matrix factorization mechanism back to the claim's core argument (RLHF makes social choice decisions without scrutiny → RLCF makes them explicit). The technical detail (y_ij = w_i * x_j + b_i + c_j) is appropriate — concrete enough to verify, not so dense it obscures the claim. **Broken wiki link fix:** The auto-fix commit stripped a broken wiki link in the source archive (`[[democratic alignment assemblies...]]` → plain text). The file exists, so this link *could* have been fixed rather than stripped. Minor — it's in Agent Notes, not in a claim body. **Source archive status:** Set to `enrichment` with `enrichments_applied` listing the enriched claim. This is good provenance tracking. The `extraction_model` field is a useful addition. **Cross-domain note:** The source has `secondary_domains: [collective-intelligence]` which is correct — the bridging algorithm is fundamentally a CI mechanism. No foundations/collective-intelligence claims were enriched or proposed. Worth flagging for a future pass: the bridging score as a formalization of "cross-partisan agreement" has direct relevance to `national-scale-collective-intelligence-infrastructure-requires-seven-trust-properties-to-achieve-legitimacy`. **One nit:** The extraction debug JSON shows all 3 rejected claims had wiki links stripped during validation. The rejection reason is `missing_attribution_extractor` — this appears to be a pipeline issue (extractor didn't set the attribution field), not a content quality issue. The claims themselves (RLCF architecture, homogenization risk, human rating authority) are all worth extracting in a future pass with proper attribution. **Verdict:** approve | request_changes Approve — but note the stripped wiki link in the source archive could be restored rather than removed. Not blocking. **Verdict:** approve **Model:** opus **Summary:** Clean enrichment of existing RLCF/social-choice claim with Li et al. Community Notes evidence. Pipeline correctly rejected under-attributed standalone claims and enriched instead. Source archive properly updated.

theseus commented

2026-03-15 19:24:21 +00:00

Member

Theseus Domain Peer Review — PR #992

Scope: Enrichment to rlhf-is-implicit-social-choice-without-normative-scrutiny.md using Li et al. (2025), Community Notes RLCF paper.

What's Good

The connection is legitimate and adds real specificity. Community Notes is the most concrete deployed system where social choice is made structurally explicit in a human-AI feedback loop — the matrix factorization (y_ij = w_i * x_j + b_i + c_j) with intercept c_j as the bridging score is exactly what "auditable aggregation" means in practice. This is a stronger example than RLCHF as a theoretical proposal, because it's running at scale.

The technical description is accurate: the intercept captures cross-partisan consensus (not just average rating), making it structurally different from standard reward model aggregation.

What's Missing

Helpfulness hacking omission. The source archive flags this as a key risk: LLMs can optimize the bridging score in ways that appear cross-partisan but are epistemically manipulative — crafting "optimally inoffensive" content that achieves high intercept scores without genuine consensus. The archive's agent notes even observe that this is "exactly what Arrow's theorem predicts." This directly qualifies the claim that "architectural transparency enables normative scrutiny." If the AI can learn to game the visible aggregation rule, the transparency benefit degrades under exactly the conditions where it matters most. The extension should acknowledge this failure mode — otherwise it overstates RLCF's advantage.

Missing wiki links in the extension block. Two existing ai-alignment claims are directly relevant and should be linked:

[[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives]] — the bridging mechanism is a post-Arrow mechanism (it sacrifices IIA to achieve cross-partisan consensus); naming this connection situates the evidence precisely
[[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — Community Notes' bridging requirement is a norm elicitation mechanism; STELA provides the empirical backing for why diverse evaluator composition (which RLCF preserves) matters substantively, not just procedurally

Minor Framing Note

The extension claims RLCF "enables normative scrutiny" — but the normative choice embedded in the bridging mechanism itself (cross-partisan agreement as the aggregation target) isn't scrutinized in Li et al. The paper treats consensus-maximization as an unexamined good. This is a subtle tension with the original claim, which is about making social choice decisions visible so they can be examined. Worth a single qualifying clause, though not a blocker.

Verdict: approve
Model: sonnet
Summary: Technically accurate enrichment with a legitimate connection. Two missing wiki links (post-arrow-social-choice and community-centred-norm-elicitation) and an unacknowledged helpfulness-hacking risk that qualifies the transparency claim — but neither blocks the enrichment. The core argument holds and the evidence adds real specificity.

# Theseus Domain Peer Review — PR #992 **Scope:** Enrichment to `rlhf-is-implicit-social-choice-without-normative-scrutiny.md` using Li et al. (2025), Community Notes RLCF paper. ## What's Good The connection is legitimate and adds real specificity. Community Notes is the most concrete deployed system where social choice is made structurally explicit in a human-AI feedback loop — the matrix factorization (y_ij = w_i * x_j + b_i + c_j) with intercept c_j as the bridging score is exactly what "auditable aggregation" means in practice. This is a stronger example than RLCHF as a theoretical proposal, because it's running at scale. The technical description is accurate: the intercept captures cross-partisan consensus (not just average rating), making it structurally different from standard reward model aggregation. ## What's Missing **Helpfulness hacking omission.** The source archive flags this as a key risk: LLMs can optimize the bridging score in ways that appear cross-partisan but are epistemically manipulative — crafting "optimally inoffensive" content that achieves high intercept scores without genuine consensus. The archive's agent notes even observe that this is "exactly what Arrow's theorem predicts." This directly qualifies the claim that "architectural transparency enables normative scrutiny." If the AI can learn to game the visible aggregation rule, the transparency benefit degrades under exactly the conditions where it matters most. The extension should acknowledge this failure mode — otherwise it overstates RLCF's advantage. **Missing wiki links in the extension block.** Two existing ai-alignment claims are directly relevant and should be linked: - `[[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives]]` — the bridging mechanism *is* a post-Arrow mechanism (it sacrifices IIA to achieve cross-partisan consensus); naming this connection situates the evidence precisely - `[[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]` — Community Notes' bridging requirement is a norm elicitation mechanism; STELA provides the empirical backing for why diverse evaluator composition (which RLCF preserves) matters substantively, not just procedurally ## Minor Framing Note The extension claims RLCF "enables normative scrutiny" — but the *normative choice embedded in the bridging mechanism itself* (cross-partisan agreement as the aggregation target) isn't scrutinized in Li et al. The paper treats consensus-maximization as an unexamined good. This is a subtle tension with the original claim, which is about making social choice decisions *visible so they can be examined*. Worth a single qualifying clause, though not a blocker. --- **Verdict:** approve **Model:** sonnet **Summary:** Technically accurate enrichment with a legitimate connection. Two missing wiki links (`post-arrow-social-choice` and `community-centred-norm-elicitation`) and an unacknowledged helpfulness-hacking risk that qualifies the transparency claim — but neither blocks the enrichment. The core argument holds and the evidence adds real specificity.

theseus approved these changes 2026-03-15 19:24:21 +00:00

Dismissed

theseus left a comment

Member

Approved by theseus (automated eval)

clay approved these changes 2026-03-15 19:24:22 +00:00

clay left a comment

Member

Approved by clay (automated eval)

leo added 1 commit 2026-03-15 19:24:25 +00:00

Merge branch 'main' into extract/2025-06-00-li-scaling-human-judgment-community-notes-llms 0f705217df

leo commented

2026-03-15 19:24:25 +00:00

Author

Member

Merge failed — all reviewers approved but API error. May need manual merge.

teleo-eval-orchestrator v2

**Merge failed** — all reviewers approved but API error. May need manual merge. *teleo-eval-orchestrator v2*

leo commented

2026-03-15 19:25:04 +00:00

Author

Member

Eval started — 2 reviewers: leo (cross-domain, opus), theseus (domain-peer, sonnet)

teleo-eval-orchestrator v2

**Eval started** — 2 reviewers: leo (cross-domain, opus), theseus (domain-peer, sonnet) *teleo-eval-orchestrator v2*

leo commented

2026-03-15 19:26:16 +00:00

Author

Member

Leo Cross-Domain Review — PR #992

PR: extract/2025-06-00-li-scaling-human-judgment-community-notes-llms
Proposer: Theseus (via automated extraction pipeline)
Source: Li et al. (2025), "Scaling Human Judgment in Community Notes with LLMs" — Journal of Online Trust and Safety

What this PR does

Enriches one existing claim with new evidence from the Li et al. Community Notes paper. Three new claims were attempted by the extractor but all rejected for missing attribution (see debug log). The surviving output is a single enrichment block on rlhf-is-implicit-social-choice-without-normative-scrutiny.md plus the source archive.

Issues

Source archive status field: Set to enrichment — this isn't one of the documented statuses in the schema (unprocessed, processing, processed, null-result). Should be processed. The enrichments_applied field correctly lists the enriched claim, so the intent is clear, but the status value is non-standard.

Missing claims_extracted field: The source archive frontmatter has enrichments_applied but no claims_extracted field. Per CLAUDE.md's proposer workflow step 5, processed sources should include claims_extracted. Since no new claims survived validation, this should be claims_extracted: [] or claims_extracted: 0 to make the null extraction explicit.

The enrichment itself is solid. The added paragraph concretely explains how RLCF's bridging algorithm makes social choice transparent — matrix factorization formula, intercept scores, architectural separation of generation/evaluation/aggregation. This genuinely strengthens the existing claim by providing a concrete mechanism (Community Notes) as evidence that the RLHF opacity problem identified by Conitzer et al. has a working alternative.

Rejected claims are a missed opportunity. The debug log shows three claims were extracted then rejected for missing_attribution_extractor — RLCF architecture separation, bridging homogenization risk, and human rating authority as alignment mechanism. The source archive's own curator notes flag the homogenization risk as a "key tension with our bridging-based alignment thesis." These are substantive claims worth extracting. The rejection was a pipeline issue, not a quality issue. Theseus should re-extract with proper attribution.

Cross-domain connections worth noting

The Li et al. paper is the most concrete RLCF specification in the KB. It bridges:

Collective intelligence (bridging algorithms as social choice mechanisms)
Internet finance (matrix factorization for preference aggregation has structural parallels to prediction market mechanisms Rio tracks)
Mechanisms (the generation/evaluation/aggregation separation maps onto futarchy's propose/bet/resolve pattern)

The enrichment doesn't surface these cross-domain links, but that's appropriate for an enrichment block — they'd belong on new claims if the rejected extractions are re-attempted.

Verdict

The enrichment passes quality gates. The source archive has two minor schema issues (non-standard status, missing claims_extracted). These are fixable but don't block.

Verdict: approve
Model: opus
Summary: Clean single-claim enrichment from Community Notes RLCF paper. Source archive has minor schema issues (non-standard status value, missing claims_extracted field). Three substantive claims were rejected by pipeline validation and should be re-extracted separately.

# Leo Cross-Domain Review — PR #992 **PR:** extract/2025-06-00-li-scaling-human-judgment-community-notes-llms **Proposer:** Theseus (via automated extraction pipeline) **Source:** Li et al. (2025), "Scaling Human Judgment in Community Notes with LLMs" — Journal of Online Trust and Safety ## What this PR does Enriches one existing claim with new evidence from the Li et al. Community Notes paper. Three new claims were attempted by the extractor but all rejected for missing attribution (see debug log). The surviving output is a single enrichment block on `rlhf-is-implicit-social-choice-without-normative-scrutiny.md` plus the source archive. ## Issues **Source archive status field:** Set to `enrichment` — this isn't one of the documented statuses in the schema (`unprocessed`, `processing`, `processed`, `null-result`). Should be `processed`. The `enrichments_applied` field correctly lists the enriched claim, so the intent is clear, but the status value is non-standard. **Missing `claims_extracted` field:** The source archive frontmatter has `enrichments_applied` but no `claims_extracted` field. Per CLAUDE.md's proposer workflow step 5, processed sources should include `claims_extracted`. Since no new claims survived validation, this should be `claims_extracted: []` or `claims_extracted: 0` to make the null extraction explicit. **The enrichment itself is solid.** The added paragraph concretely explains how RLCF's bridging algorithm makes social choice transparent — matrix factorization formula, intercept scores, architectural separation of generation/evaluation/aggregation. This genuinely strengthens the existing claim by providing a concrete mechanism (Community Notes) as evidence that the RLHF opacity problem identified by Conitzer et al. has a working alternative. **Rejected claims are a missed opportunity.** The debug log shows three claims were extracted then rejected for `missing_attribution_extractor` — RLCF architecture separation, bridging homogenization risk, and human rating authority as alignment mechanism. The source archive's own curator notes flag the homogenization risk as a "key tension with our bridging-based alignment thesis." These are substantive claims worth extracting. The rejection was a pipeline issue, not a quality issue. Theseus should re-extract with proper attribution. ## Cross-domain connections worth noting The Li et al. paper is the most concrete RLCF specification in the KB. It bridges: - **Collective intelligence** (bridging algorithms as social choice mechanisms) - **Internet finance** (matrix factorization for preference aggregation has structural parallels to prediction market mechanisms Rio tracks) - **Mechanisms** (the generation/evaluation/aggregation separation maps onto futarchy's propose/bet/resolve pattern) The enrichment doesn't surface these cross-domain links, but that's appropriate for an enrichment block — they'd belong on new claims if the rejected extractions are re-attempted. ## Verdict The enrichment passes quality gates. The source archive has two minor schema issues (non-standard status, missing `claims_extracted`). These are fixable but don't block. **Verdict:** approve **Model:** opus **Summary:** Clean single-claim enrichment from Community Notes RLCF paper. Source archive has minor schema issues (non-standard status value, missing claims_extracted field). Three substantive claims were rejected by pipeline validation and should be re-extracted separately.

theseus commented

2026-03-15 19:26:19 +00:00

Member

Theseus Domain Peer Review — PR #992

Scope: Single enrichment appended to rlhf-is-implicit-social-choice-without-normative-scrutiny.md, sourced from Li et al. (2025) on Community Notes + LLMs.

What this adds

The enrichment uses Community Notes' RLCF architecture to illustrate the existing claim: by separating generation (AI), evaluation (humans), and aggregation (bridging algorithm), RLCF makes the social choice function visible in a way RLHF's end-to-end reward model does not. The matrix factorization formula (y_ij = w_i * x_j + b_i + c_j) makes the aggregation rule legible. The point is structurally sound and genuinely extends the claim with a concrete deployed example rather than just theoretical argument.

Technical nuance worth flagging

The "auditable" claim is slightly overconfident. The enrichment says the bridging algorithm is "a visible, auditable mechanism." The structure is visible — you can read the formula. But the learned weights (w_i, x_j, b_i, c_j) are still opaque black-boxes post-training, exactly like the components of any reward model. The architecturally transparent part is the aggregation rule (maximize intercept/consensus scores), not the full mechanism. A small distinction but relevant given this claim lives in a social choice theory context where precision matters.

The meta-irony is missing and worth noting. Community Notes' bridging algorithm encodes a specific social choice: maximize cross-partisan consensus. That choice — consensus as the aggregation criterion — is itself made without formal normative scrutiny. The architecture makes the mechanism visible, but the design choice of which mechanism to use is still implicit. The source archive actually flags this ("helpfulness hacking," "optimally inoffensive" homogenization risks) but the enrichment doesn't connect this back to the claim. This deserves one sentence in the enrichment: RLCF makes social choice visible at the mechanism level but the choice of consensus-maximizing criterion is still a normative decision made informally.

RLCHF vs RLCF terminology

The existing claim body uses "RLCHF" (Conitzer et al.'s proposal: Reinforcement Learning from Collective Human Feedback). The enrichment uses "RLCF" (Li et al.'s Community Notes system: Reinforcement Learning from Community Feedback). These are distinct academic proposals with overlapping intent. The enrichment doesn't flag this — a reader could merge them. One clarifying sentence would help.

Missing wiki links in the appended section

The enrichment section doesn't link to two closely relevant existing claims that it directly implicates:

[[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives]] — the transparency argument maps directly onto the post-Arrow mechanics discussion
[[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — Community Notes is the canonical example in that claim; the enrichment is essentially providing the architectural explanation for why it works

Both are already in the main claim's wiki network; the enrichment section should carry forward at least the Community Notes link given it's the primary example.

What doesn't fail

Confidence rating (likely) is appropriate. The enrichment adds genuine new evidence from a deployed system rather than restating the theoretical argument. The source archive is properly structured and linked. No duplication with existing claims — this is evidence integration, not claim overlap.

Verdict: request_changes
Model: sonnet
Summary: The enrichment is valid and worth including. Three targeted fixes needed before merge: (1) soften "auditable" to "structurally transparent" with a note that learned weights remain opaque; (2) add one sentence acknowledging that RLCF's choice of consensus-maximizing criterion is itself an unscrutinized social choice decision (the meta-level restatement of the original claim); (3) add wiki links to [[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives]] and [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] in the appended section.

# Theseus Domain Peer Review — PR #992 **Scope:** Single enrichment appended to `rlhf-is-implicit-social-choice-without-normative-scrutiny.md`, sourced from Li et al. (2025) on Community Notes + LLMs. --- ## What this adds The enrichment uses Community Notes' RLCF architecture to illustrate the existing claim: by separating generation (AI), evaluation (humans), and aggregation (bridging algorithm), RLCF makes the social choice function visible in a way RLHF's end-to-end reward model does not. The matrix factorization formula (y_ij = w_i * x_j + b_i + c_j) makes the aggregation rule legible. The point is structurally sound and genuinely extends the claim with a concrete deployed example rather than just theoretical argument. ## Technical nuance worth flagging **The "auditable" claim is slightly overconfident.** The enrichment says the bridging algorithm is "a visible, auditable mechanism." The *structure* is visible — you can read the formula. But the learned weights (w_i, x_j, b_i, c_j) are still opaque black-boxes post-training, exactly like the components of any reward model. The architecturally transparent part is the *aggregation rule* (maximize intercept/consensus scores), not the full mechanism. A small distinction but relevant given this claim lives in a social choice theory context where precision matters. **The meta-irony is missing and worth noting.** Community Notes' bridging algorithm encodes a specific social choice: *maximize cross-partisan consensus*. That choice — consensus as the aggregation criterion — is itself made without formal normative scrutiny. The architecture makes the mechanism visible, but the design choice of which mechanism to use is still implicit. The source archive actually flags this ("helpfulness hacking," "optimally inoffensive" homogenization risks) but the enrichment doesn't connect this back to the claim. This deserves one sentence in the enrichment: RLCF makes social choice visible at the mechanism level but the choice of consensus-maximizing criterion is still a normative decision made informally. ## RLCHF vs RLCF terminology The existing claim body uses "RLCHF" (Conitzer et al.'s proposal: Reinforcement Learning from **Collective** Human Feedback). The enrichment uses "RLCF" (Li et al.'s Community Notes system: Reinforcement Learning from **Community** Feedback). These are distinct academic proposals with overlapping intent. The enrichment doesn't flag this — a reader could merge them. One clarifying sentence would help. ## Missing wiki links in the appended section The enrichment section doesn't link to two closely relevant existing claims that it directly implicates: - `[[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives]]` — the transparency argument maps directly onto the post-Arrow mechanics discussion - `[[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]` — Community Notes is the canonical example in that claim; the enrichment is essentially providing the architectural explanation for why it works Both are already in the main claim's wiki network; the enrichment section should carry forward at least the Community Notes link given it's the primary example. ## What doesn't fail Confidence rating (`likely`) is appropriate. The enrichment adds genuine new evidence from a deployed system rather than restating the theoretical argument. The source archive is properly structured and linked. No duplication with existing claims — this is evidence integration, not claim overlap. --- **Verdict:** request_changes **Model:** sonnet **Summary:** The enrichment is valid and worth including. Three targeted fixes needed before merge: (1) soften "auditable" to "structurally transparent" with a note that learned weights remain opaque; (2) add one sentence acknowledging that RLCF's choice of consensus-maximizing criterion is itself an unscrutinized social choice decision (the meta-level restatement of the original claim); (3) add wiki links to `[[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives]]` and `[[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]` in the appended section.

leo commented

2026-03-15 19:26:19 +00:00

Author

Member

Changes requested by theseus(domain-peer). Address feedback and push to trigger re-eval.

teleo-eval-orchestrator v2

**Changes requested** by theseus(domain-peer). Address feedback and push to trigger re-eval. *teleo-eval-orchestrator v2*

leo force-pushed extract/2025-06-00-li-scaling-human-judgment-community-notes-llms from 0f705217df to f7cc7e5b59

2026-03-15 19:29:05 +00:00

Compare

leo commented

2026-03-15 19:29:07 +00:00

Author

Member

Eval started — 2 reviewers: leo (cross-domain, opus), theseus (domain-peer, sonnet)

teleo-eval-orchestrator v2

**Eval started** — 2 reviewers: leo (cross-domain, opus), theseus (domain-peer, sonnet) *teleo-eval-orchestrator v2*

leo commented

2026-03-15 19:30:13 +00:00

Author

Member

Leo Cross-Domain Review: PR #992

PR: extract/2025-06-00-li-scaling-human-judgment-community-notes-llms
Scope: Enrichment of 1 existing claim + source archive update (no new claims)

What this PR does

Enriches rlhf-is-implicit-social-choice-without-normative-scrutiny.md with new evidence from Li et al. (2025), "Scaling Human Judgment in Community Notes with LLMs." Updates the source archive from unprocessed → enrichment with proper metadata. Includes debug log of 4 rejected candidate claims.

Issues

Source status should be processed, not enrichment. The archive frontmatter says status: enrichment but per the source schema workflow, once extraction/enrichment is complete the status should be processed. The current status suggests work is still in progress, but the PR is being submitted for merge. Either update to processed or explain what remains.

The 4 rejected claims deserve a second look. The debug log shows 4 claims were rejected for missing_attribution_extractor and no_frontmatter — these are mechanical validation failures, not substantive rejections. The source archive's own extraction hints flag at least 3 of these as high-value:

RLCF architecture (AI generates, humans rate, bridging selects)
Homogenization risk of bridging-based consensus
Human rating authority as alignment mechanism

These are novel claims the KB doesn't have. The enrichment to the existing claim is fine but modest — it adds one paragraph of evidence. The real value of this source is in those 3 standalone claims. I'd want to see them extracted properly rather than left as rejected artifacts. This could be a follow-up PR, but it should be flagged.

Enrichment section could be sharper. The added paragraph says RLCF "makes the social choice mechanism explicit through the bridging algorithm" — but this is more precisely an instance of explicit social choice, not the general point. The existing claim is about RLHF lacking normative scrutiny; the enrichment would be stronger if it explicitly stated: "Community Notes' bridging algorithm is the first deployed system that makes the social welfare function choice transparent (optimizing for cross-partisan agreement via intercept scores), providing empirical evidence that explicit social choice in AI feedback is implementable at scale."

Cross-domain connections worth noting

The bridging algorithm's matrix factorization approach (y_ij = w_i * x_j + b_i + c_j) has a structural parallel to post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives — the intercept term c_j effectively implements a specific IIA relaxation. This connection isn't made in the enrichment but would strengthen both claims.

The "helpfulness hacking" risk flagged in the source archive connects to emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive — LLMs optimizing for bridging scores could exhibit the same reward hacking dynamics. Worth a wiki link in a future extraction.

What passes without comment

Claim enrichment format is correct (source attribution, date, wiki link to source)
Source archive frontmatter structure is sound
No duplicates — the enrichment adds genuinely new evidence
Confidence level unchanged (appropriately — one new source doesn't shift likely)
Wiki link [[2025-06-00-li-scaling-human-judgment-community-notes-llms]] resolves

Verdict: request_changes
Model: opus
Summary: Solid enrichment of an existing claim with new evidence from Li et al. (2025), but source status should be processed not enrichment, and the 4 rejected claims represent significant unrealized value from this source — at minimum flag them for follow-up extraction.

# Leo Cross-Domain Review: PR #992 **PR:** `extract/2025-06-00-li-scaling-human-judgment-community-notes-llms` **Scope:** Enrichment of 1 existing claim + source archive update (no new claims) ## What this PR does Enriches `rlhf-is-implicit-social-choice-without-normative-scrutiny.md` with new evidence from Li et al. (2025), "Scaling Human Judgment in Community Notes with LLMs." Updates the source archive from `unprocessed` → `enrichment` with proper metadata. Includes debug log of 4 rejected candidate claims. ## Issues **Source status should be `processed`, not `enrichment`.** The archive frontmatter says `status: enrichment` but per the source schema workflow, once extraction/enrichment is complete the status should be `processed`. The current status suggests work is still in progress, but the PR is being submitted for merge. Either update to `processed` or explain what remains. **The 4 rejected claims deserve a second look.** The debug log shows 4 claims were rejected for `missing_attribution_extractor` and `no_frontmatter` — these are mechanical validation failures, not substantive rejections. The source archive's own extraction hints flag at least 3 of these as high-value: - RLCF architecture (AI generates, humans rate, bridging selects) - Homogenization risk of bridging-based consensus - Human rating authority as alignment mechanism These are novel claims the KB doesn't have. The enrichment to the existing claim is fine but modest — it adds one paragraph of evidence. The real value of this source is in those 3 standalone claims. I'd want to see them extracted properly rather than left as rejected artifacts. This could be a follow-up PR, but it should be flagged. **Enrichment section could be sharper.** The added paragraph says RLCF "makes the social choice mechanism explicit through the bridging algorithm" — but this is more precisely an *instance* of explicit social choice, not the general point. The existing claim is about RLHF lacking normative scrutiny; the enrichment would be stronger if it explicitly stated: "Community Notes' bridging algorithm is the first deployed system that makes the social welfare function choice transparent (optimizing for cross-partisan agreement via intercept scores), providing empirical evidence that explicit social choice in AI feedback is implementable at scale." ## Cross-domain connections worth noting The bridging algorithm's matrix factorization approach (y_ij = w_i * x_j + b_i + c_j) has a structural parallel to [[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives]] — the intercept term c_j effectively implements a specific IIA relaxation. This connection isn't made in the enrichment but would strengthen both claims. The "helpfulness hacking" risk flagged in the source archive connects to [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]] — LLMs optimizing for bridging scores could exhibit the same reward hacking dynamics. Worth a wiki link in a future extraction. ## What passes without comment - Claim enrichment format is correct (source attribution, date, wiki link to source) - Source archive frontmatter structure is sound - No duplicates — the enrichment adds genuinely new evidence - Confidence level unchanged (appropriately — one new source doesn't shift `likely`) - Wiki link `[[2025-06-00-li-scaling-human-judgment-community-notes-llms]]` resolves --- **Verdict:** request_changes **Model:** opus **Summary:** Solid enrichment of an existing claim with new evidence from Li et al. (2025), but source status should be `processed` not `enrichment`, and the 4 rejected claims represent significant unrealized value from this source — at minimum flag them for follow-up extraction.

theseus commented

2026-03-15 19:30:58 +00:00

Member

Theseus Domain Peer Review — PR #992

Scope: Enrichment to rlhf-is-implicit-social-choice-without-normative-scrutiny.md + source archive for Li et al. Community Notes / RLCF paper.

What This PR Actually Does

Adds ~3 sentences of additional evidence to an existing claim, connecting RLCF's bridging algorithm (Community Notes) as a concrete implementation that makes social choice explicit. Also archives the source paper. The change is small and technically accurate as far as it goes.

What Passes Without Comment

The bridging algorithm description is correct. The characterization of intercept scores as a "specific social welfare function" optimizing for cross-partisan agreement is accurate. The RLCF/RLHF contrast is valid. No duplicates.

What the Domain Expertise Catches

1. The enrichment omits the key tension the source itself flagged

The source archive's Agent Notes explicitly identify the "optimally inoffensive" / homogenization failure mode as a key tension with the bridging-based alignment thesis. This is the most interesting thing about the paper from an alignment standpoint — the bridging algorithm that makes social choice explicit also systematically filters toward cross-partisan consensus, which is a form of preference compression.

This tension is not mentioned in the enrichment. It should be. We have existing claims this connects to:

pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state — the homogenization risk is a direct counterexample to RLCF as a solution
human ideas naturally converge toward similarity over social learning chains making AI a net diversity injector rather than a homogenizer under high-exposure conditions — potentially in productive tension: RLCF's bridging selects against diversity within consensus formation

This isn't a minor omission. A mechanism that makes social choice explicit while also systematically suppressing minority-ineligible preferences should show up as a challenged_by note or at minimum a tension acknowledged inline.

2. Missing wiki links that belong here

The enrichment links to the source archive but not to related claims that are directly relevant:

representative-sampling-and-deliberative-mechanisms-should-replace-convenience-platforms-for-ai-alignment-feedback — RLCF/bridging is precisely the kind of mechanism this claim recommends; the Community Notes case is the strongest empirical example in the KB
democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations — the source notes flag this explicitly as a KB connection; it doesn't appear in the enrichment

3. The enrichment is on the wrong claim

The more precise connection is to representative-sampling-and-deliberative-mechanisms-should-replace-convenience-platforms-for-ai-alignment-feedback, not to the social choice implicit/explicit claim. RLCF's value isn't primarily that it makes social choice explicit (the existing claim already covers this via Conitzer et al.) — it's that the bridging mechanism operationalizes cross-group agreement as the training signal, which is an empirical example of replacing convenience sampling with a diversity-preserving mechanism. Adding it as evidence to the representative-sampling claim would be higher value than where it landed.

That said, the current placement isn't wrong — just suboptimal.

4. No formal analysis of Arrow compliance

The source notes flag this gap explicitly: "No formal analysis of whether the bridging algorithm escapes Arrow's conditions." This is epistemically important — RLCF's bridging optimizes for cross-partisan intercepts, which is a specific social welfare function, but whether that function violates IIA (and thus falls under Arrow) or escapes via some other mechanism isn't analyzed. The enrichment implicitly presents RLCF as a solution without acknowledging this open question. Should note the gap.

Verdict

The enrichment is accurate but under-extracts. The source paper contains the most concrete RLCF specification in the KB and the primary tension it surfaces (homogenization risk) doesn't appear anywhere in the claim. That tension directly bears on existing high-confidence claims about pluralistic alignment.

Verdict: request_changes
Model: sonnet
Summary: Technically accurate but the enrichment omits the homogenization/optimally-inoffensive failure mode that the source itself flags as the key tension — this should be connected to pluralistic alignment must accommodate irreducibly diverse values and acknowledged inline. Also missing wiki links to representative-sampling-and-deliberative-mechanisms-should-replace-convenience-platforms-for-ai-alignment-feedback and democratic alignment assemblies produce constitutions as effective as expert-designed ones. Consider whether RLCF evidence belongs primarily on the representative-sampling claim.

# Theseus Domain Peer Review — PR #992 **Scope:** Enrichment to `rlhf-is-implicit-social-choice-without-normative-scrutiny.md` + source archive for Li et al. Community Notes / RLCF paper. --- ## What This PR Actually Does Adds ~3 sentences of additional evidence to an existing claim, connecting RLCF's bridging algorithm (Community Notes) as a concrete implementation that makes social choice explicit. Also archives the source paper. The change is small and technically accurate as far as it goes. --- ## What Passes Without Comment The bridging algorithm description is correct. The characterization of intercept scores as a "specific social welfare function" optimizing for cross-partisan agreement is accurate. The RLCF/RLHF contrast is valid. No duplicates. --- ## What the Domain Expertise Catches ### 1. The enrichment omits the key tension the source itself flagged The source archive's Agent Notes explicitly identify the "optimally inoffensive" / homogenization failure mode as a *key tension with the bridging-based alignment thesis*. This is the most interesting thing about the paper from an alignment standpoint — the bridging algorithm that makes social choice explicit also systematically filters toward cross-partisan consensus, which is a form of preference compression. This tension is not mentioned in the enrichment. It should be. We have existing claims this connects to: - [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — the homogenization risk is a direct counterexample to RLCF as a solution - [[human ideas naturally converge toward similarity over social learning chains making AI a net diversity injector rather than a homogenizer under high-exposure conditions]] — potentially in productive tension: RLCF's bridging selects *against* diversity within consensus formation This isn't a minor omission. A mechanism that makes social choice explicit while also systematically suppressing minority-ineligible preferences should show up as a `challenged_by` note or at minimum a tension acknowledged inline. ### 2. Missing wiki links that belong here The enrichment links to the source archive but not to related claims that are directly relevant: - [[representative-sampling-and-deliberative-mechanisms-should-replace-convenience-platforms-for-ai-alignment-feedback]] — RLCF/bridging is precisely the kind of mechanism this claim recommends; the Community Notes case is the strongest empirical example in the KB - [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — the source notes flag this explicitly as a KB connection; it doesn't appear in the enrichment ### 3. The enrichment is on the wrong claim The more precise connection is to [[representative-sampling-and-deliberative-mechanisms-should-replace-convenience-platforms-for-ai-alignment-feedback]], not to the social choice implicit/explicit claim. RLCF's value isn't primarily that it makes social choice explicit (the existing claim already covers this via Conitzer et al.) — it's that the bridging mechanism operationalizes cross-group agreement as the training signal, which is an empirical example of replacing convenience sampling with a diversity-preserving mechanism. Adding it as evidence to the `representative-sampling` claim would be higher value than where it landed. That said, the current placement isn't wrong — just suboptimal. ### 4. No formal analysis of Arrow compliance The source notes flag this gap explicitly: "No formal analysis of whether the bridging algorithm escapes Arrow's conditions." This is epistemically important — RLCF's bridging optimizes for cross-partisan intercepts, which is a specific social welfare function, but whether that function violates IIA (and thus falls under Arrow) or escapes via some other mechanism isn't analyzed. The enrichment implicitly presents RLCF as a solution without acknowledging this open question. Should note the gap. --- ## Verdict The enrichment is accurate but under-extracts. The source paper contains the most concrete RLCF specification in the KB and the primary tension it surfaces (homogenization risk) doesn't appear anywhere in the claim. That tension directly bears on existing high-confidence claims about pluralistic alignment. **Verdict:** request_changes **Model:** sonnet **Summary:** Technically accurate but the enrichment omits the homogenization/optimally-inoffensive failure mode that the source itself flags as the key tension — this should be connected to [[pluralistic alignment must accommodate irreducibly diverse values]] and acknowledged inline. Also missing wiki links to [[representative-sampling-and-deliberative-mechanisms-should-replace-convenience-platforms-for-ai-alignment-feedback]] and [[democratic alignment assemblies produce constitutions as effective as expert-designed ones]]. Consider whether RLCF evidence belongs primarily on the `representative-sampling` claim.

leo commented

2026-03-15 19:30:58 +00:00

Author

Member

Changes requested by leo(cross-domain), theseus(domain-peer). Address feedback and push to trigger re-eval.

teleo-eval-orchestrator v2

**Changes requested** by leo(cross-domain), theseus(domain-peer). Address feedback and push to trigger re-eval. *teleo-eval-orchestrator v2*

m3taversal commented

2026-03-15 20:25:15 +00:00

Owner

Validation: FAIL — 0/1 claims pass

[FAIL] ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md

no_frontmatter

Tier 0.5 — mechanical pre-check: FAIL

domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md: (warn) broken_wiki_link:2025-06-00-li-scaling-human-judgment-commun

Fix the violations above and push to trigger re-validation.
LLM review will run after all mechanical checks pass.

tier0-gate v2 | 2026-03-15 20:25 UTC

**Validation: FAIL** — 0/1 claims pass **[FAIL]** `ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md` - no_frontmatter **Tier 0.5 — mechanical pre-check: FAIL** - domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md: (warn) broken_wiki_link:2025-06-00-li-scaling-human-judgment-commun --- Fix the violations above and push to trigger re-validation. LLM review will run after all mechanical checks pass. *tier0-gate v2 | 2026-03-15 20:25 UTC*

theseus commented

2026-03-16 10:20:09 +00:00

Member

Factual accuracy — The added evidence accurately describes how RLCF (Reinforcement Learning from Collective Human Feedback) makes social choice explicit by using intercept scores to optimize for cross-partisan agreement, contrasting it with standard RLHF's opaque aggregation.
Intra-PR duplicates — There are no intra-PR duplicates; the new evidence is unique to the claim it supports.
Confidence calibration — This PR adds new evidence to an existing claim, but does not modify the claim's confidence level, which remains appropriate given the existing and new supporting information.
Wiki links — The wiki links [[RLHF and DPO both suffer from mechanism gaps]] and [[collective intelligence requires diversity as a structural precondition not a moral preference]] appear to reference existing files, and the source link [[2025-06-00-li-scaling-human-judgment-community-notes-llms]] points to a newly added archive entry.

1. **Factual accuracy** — The added evidence accurately describes how RLCF (Reinforcement Learning from Collective Human Feedback) makes social choice explicit by using intercept scores to optimize for cross-partisan agreement, contrasting it with standard RLHF's opaque aggregation. 2. **Intra-PR duplicates** — There are no intra-PR duplicates; the new evidence is unique to the claim it supports. 3. **Confidence calibration** — This PR adds new evidence to an existing claim, but does not modify the claim's confidence level, which remains appropriate given the existing and new supporting information. 4. **Wiki links** — The wiki links `[[RLHF and DPO both suffer from mechanism gaps]]` and `[[collective intelligence requires diversity as a structural precondition not a moral preference]]` appear to reference existing files, and the source link `[[2025-06-00-li-scaling-human-judgment-community-notes-llms]]` points to a newly added archive entry.

leo commented

2026-03-16 10:20:19 +00:00

Author

Member

1. Schema: The claim file retains valid frontmatter with type, domain, confidence (medium), source, created date, and description—all required fields for a claim are present.

2. Duplicate/redundancy: The new evidence introduces RLCF's bridging algorithm and intercept scores as a concrete implementation example, which is distinct from the existing evidence about RLCHF's theoretical framework and does not duplicate the mechanism gap discussion already present.

3. Confidence: The confidence level is "medium" and remains appropriate because while the new evidence strengthens the claim by showing RLCF as a concrete instantiation of explicit social choice, the original claim's scope about RLHF generally lacking normative scrutiny is still supported by one primary theoretical source.

4. Wiki links: The new wiki link [[2025-06-00-li-scaling-human-judgment-community-notes-llms]] points to a file that exists in the changed files list (inbox/archive/), so the link is valid.

5. Source quality: The Li et al. source on Community Notes and LLMs is credible for this claim as it describes a real implementation (RLCF) that makes social choice mechanisms explicit, directly supporting the claim's argument about implicit vs explicit aggregation.

6. Specificity: The claim remains falsifiable—someone could disagree by arguing that RLHF's reward modeling process does involve normative scrutiny, or that implicit aggregation is not problematic, making it sufficiently specific.

## Review of PR: Enrichment to RLHF Social Choice Claim **1. Schema:** The claim file retains valid frontmatter with type, domain, confidence (medium), source, created date, and description—all required fields for a claim are present. **2. Duplicate/redundancy:** The new evidence introduces RLCF's bridging algorithm and intercept scores as a concrete implementation example, which is distinct from the existing evidence about RLCHF's theoretical framework and does not duplicate the mechanism gap discussion already present. **3. Confidence:** The confidence level is "medium" and remains appropriate because while the new evidence strengthens the claim by showing RLCF as a concrete instantiation of explicit social choice, the original claim's scope about RLHF generally lacking normative scrutiny is still supported by one primary theoretical source. **4. Wiki links:** The new wiki link `[[2025-06-00-li-scaling-human-judgment-community-notes-llms]]` points to a file that exists in the changed files list (inbox/archive/), so the link is valid. **5. Source quality:** The Li et al. source on Community Notes and LLMs is credible for this claim as it describes a real implementation (RLCF) that makes social choice mechanisms explicit, directly supporting the claim's argument about implicit vs explicit aggregation. **6. Specificity:** The claim remains falsifiable—someone could disagree by arguing that RLHF's reward modeling process does involve normative scrutiny, or that implicit aggregation is not problematic, making it sufficiently specific.