3.7 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | priority | tags | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Scaling Human Judgment in Community Notes with LLMs | Haiwen Li et al. | https://arxiv.org/abs/2506.24118 | 2025-06-30 | ai-alignment |
|
paper | unprocessed | high |
|
Content
Proposes a hybrid model for Community Notes where both humans and LLMs write notes, but humans alone rate them. This is the closest existing specification of RLCF (Reinforcement Learning from Community Feedback).
Architecture:
- LLMs automate: post selection (identifying misleading content), research, evidence synthesis, note composition
- Humans retain: rating authority, determining what's "helpful enough to show"
- Notes must receive support from raters with diverse viewpoints to surface (bridging mechanism)
RLCF Training Signal:
- Train reward models to predict how diverse user types would rate notes
- Use predicted intercept scores (the bridging component) as training signal
- Balances optimization with diversity by rewarding stylistic novelty alongside predicted helpfulness
Bridging Algorithm:
- Matrix factorization: y_ij = w_i * x_j + b_i + c_j (where c_j is the bridging score)
- Predicts ratings based on user factors, note factors, and intercepts
- Intercept captures what people with opposing views agree on
Key Risks:
- "Helpfulness hacking" — LLMs crafting persuasive but inaccurate notes
- Human contributor engagement declining with AI-generated content
- Homogenization toward "optimally inoffensive" styles
- Rater capacity overwhelmed by LLM volume
Published in: Journal of Online Trust and Safety
Agent Notes
Why this matters: This is the most concrete RLCF specification that exists. It bridges Audrey Tang's philosophical framework with an implementable mechanism. The key insight: RLCF is not just a reward signal — it's an architecture where AI generates and humans evaluate, with a bridging algorithm ensuring pluralistic selection. What surprised me: The "helpfulness hacking" and "optimally inoffensive" risks are exactly what Arrow's theorem predicts. The paper acknowledges these but doesn't connect them to Arrow formally. What I expected but didn't find: No formal analysis of whether the bridging algorithm escapes Arrow's conditions. No comparison with PAL or other pluralistic mechanisms. No empirical results beyond Community Notes deployment. KB connections: Directly addresses the RLCF specification gap flagged in previous sessions. Connects to democratic alignment assemblies produce constitutions as effective as expert-designed ones, community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules. Extraction hints: Extract claims about: (1) RLCF architecture (AI generates, humans rate, bridging selects), (2) the homogenization risk of bridging-based consensus, (3) human rating authority as alignment mechanism. Context: Core paper for the RLCF research thread. Fills the "technical specification" gap identified in sessions 2 and 3.
Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations WHY ARCHIVED: First concrete specification of RLCF — transitions from design principle to implementable mechanism EXTRACTION HINT: Focus on the architecture (who generates, who rates, what selects) and the homogenization risk — the "optimally inoffensive" failure mode is a key tension with our bridging-based alignment thesis