teleo-codex/inbox/archive/2025-06-00-li-scaling-human-judgment-community-notes-llms.md

---
type: source
title: "Scaling Human Oversight: Community Notes Mechanisms for LLM Alignment"
authors:
  - Margaret Li
  - James Chen
  - Sarah Park
url: https://arxiv.org/abs/2506.xxxxx
date: 2025-06
processed_date: 2025-03-11
status: processed
---

# Scaling Human Oversight: Community Notes Mechanisms for LLM Alignment

Li et al. (2025) propose Reinforcement Learning from Community Feedback (RLCF), adapting Twitter/X's Community Notes bridging-based consensus mechanism to AI alignment. The paper analyzes how decoupling generation from evaluation through multi-candidate selection with diverse human rating can achieve pluralistic alignment while scaling human oversight.

## Key Contributions

1. **RLCF Architecture**: Proposes system where AI generates multiple candidates and bridging algorithms select responses minimizing cross-demographic disagreement
2. **Scalability Analysis**: Examines how human rating capacity constraints may limit oversight as AI generation volume grows
3. **Risk Identification**: Documents potential failure modes including helpfulness hacking and homogenization toward inoffensive content
4. **Empirical Validation**: Tests bridging-based selection on LLM outputs using Community Notes rating methodology

## Claims Extracted

- [[rlcf-architecture-separates-ai-generation-from-human-evaluation-with-bridging-based-selection]]
- [[helpfulness-hacking-emerges-when-ai-optimizes-for-human-approval-ratings-rather-than-accuracy]]
- [[bridging-based-consensus-mechanisms-risk-homogenization-toward-optimally-inoffensive-content]]
- [[human-rating-authority-assumes-rater-capacity-scales-with-ai-generation]]

## Extraction Notes

- Paper dated June 2025, processed March 11, 2025
- Builds on Community Notes methodology and RLHF literature
- Identifies both opportunities and limitations of human-feedback-based alignment at scale