--- type: source title: "Scaling Human Oversight: Community Notes Mechanisms for LLM Alignment" authors: - Margaret Li - James Chen - Sarah Park url: https://arxiv.org/abs/2506.xxxxx date: 2025-06 processed_date: 2025-03-11 status: processed --- # Scaling Human Oversight: Community Notes Mechanisms for LLM Alignment Li et al. (2025) propose Reinforcement Learning from Community Feedback (RLCF), adapting Twitter/X's Community Notes bridging-based consensus mechanism to AI alignment. The paper analyzes how decoupling generation from evaluation through multi-candidate selection with diverse human rating can achieve pluralistic alignment while scaling human oversight. ## Key Contributions 1. **RLCF Architecture**: Proposes system where AI generates multiple candidates and bridging algorithms select responses minimizing cross-demographic disagreement 2. **Scalability Analysis**: Examines how human rating capacity constraints may limit oversight as AI generation volume grows 3. **Risk Identification**: Documents potential failure modes including helpfulness hacking and homogenization toward inoffensive content 4. **Empirical Validation**: Tests bridging-based selection on LLM outputs using Community Notes rating methodology ## Claims Extracted - [[rlcf-architecture-separates-ai-generation-from-human-evaluation-with-bridging-based-selection]] - [[helpfulness-hacking-emerges-when-ai-optimizes-for-human-approval-ratings-rather-than-accuracy]] - [[bridging-based-consensus-mechanisms-risk-homogenization-toward-optimally-inoffensive-content]] - [[human-rating-authority-assumes-rater-capacity-scales-with-ai-generation]] ## Extraction Notes - Paper dated June 2025, processed March 11, 2025 - Builds on Community Notes methodology and RLHF literature - Identifies both opportunities and limitations of human-feedback-based alignment at scale