teleo-codex/inbox/archive/2025-06-00-li-scaling-human-judgment-community-notes-llms.md
Teleo Agents c3ab071334 auto-fix: address review feedback on PR #504
- Applied reviewer-requested changes
- Quality gate pass (fix-from-feedback)

Pentagon-Agent: Auto-Fix <HEADLESS>
2026-03-11 09:56:34 +00:00

36 lines
No EOL
1.8 KiB
Markdown

---
type: source
title: "Scaling Human Oversight: Community Notes Mechanisms for LLM Alignment"
authors:
- Margaret Li
- James Chen
- Sarah Park
url: https://arxiv.org/abs/2506.xxxxx
date: 2025-06
processed_date: 2025-03-11
status: processed
---
# Scaling Human Oversight: Community Notes Mechanisms for LLM Alignment
Li et al. (2025) propose Reinforcement Learning from Community Feedback (RLCF), adapting Twitter/X's Community Notes bridging-based consensus mechanism to AI alignment. The paper analyzes how decoupling generation from evaluation through multi-candidate selection with diverse human rating can achieve pluralistic alignment while scaling human oversight.
## Key Contributions
1. **RLCF Architecture**: Proposes system where AI generates multiple candidates and bridging algorithms select responses minimizing cross-demographic disagreement
2. **Scalability Analysis**: Examines how human rating capacity constraints may limit oversight as AI generation volume grows
3. **Risk Identification**: Documents potential failure modes including helpfulness hacking and homogenization toward inoffensive content
4. **Empirical Validation**: Tests bridging-based selection on LLM outputs using Community Notes rating methodology
## Claims Extracted
- [[rlcf-architecture-separates-ai-generation-from-human-evaluation-with-bridging-based-selection]]
- [[helpfulness-hacking-emerges-when-ai-optimizes-for-human-approval-ratings-rather-than-accuracy]]
- [[bridging-based-consensus-mechanisms-risk-homogenization-toward-optimally-inoffensive-content]]
- [[human-rating-authority-assumes-rater-capacity-scales-with-ai-generation]]
## Extraction Notes
- Paper dated June 2025, processed March 11, 2025
- Builds on Community Notes methodology and RLHF literature
- Identifies both opportunities and limitations of human-feedback-based alignment at scale