Compare commits
1 commit
0f705217df
...
f7cc7e5b59
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
f7cc7e5b59 |
3 changed files with 37 additions and 28 deletions
|
|
@ -31,7 +31,7 @@ The paper's proposed solution—RLCHF with explicit social welfare functions—c
|
||||||
### Additional Evidence (extend)
|
### Additional Evidence (extend)
|
||||||
*Source: [[2025-06-00-li-scaling-human-judgment-community-notes-llms]] | Added: 2026-03-15*
|
*Source: [[2025-06-00-li-scaling-human-judgment-community-notes-llms]] | Added: 2026-03-15*
|
||||||
|
|
||||||
RLCF makes the social choice function explicit by separating generation (AI), evaluation (humans), and aggregation (bridging algorithm). Unlike RLHF where the reward model implicitly aggregates preferences during training, RLCF's bridging algorithm is a visible, auditable mechanism for combining diverse ratings. The matrix factorization approach (y_ij = w_i * x_j + b_i + c_j) makes the aggregation rule transparent: notes surface based on intercept scores that capture cross-partisan agreement. This architectural transparency enables normative scrutiny that RLHF's black-box reward models prevent.
|
RLCF makes the social choice mechanism explicit through the bridging algorithm (matrix factorization with intercept scores). Unlike standard RLHF which aggregates preferences opaquely through reward model training, RLCF's use of intercepts as the training signal is a deliberate choice to optimize for cross-partisan agreement—a specific social welfare function.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1,47 +1,56 @@
|
||||||
{
|
{
|
||||||
"rejected_claims": [
|
"rejected_claims": [
|
||||||
{
|
{
|
||||||
"filename": "rlcf-architecture-separates-ai-generation-from-human-evaluation-with-bridging-selection.md",
|
"filename": "rlcf-architecture-separates-ai-generation-from-human-evaluation-with-bridging-algorithm-selection.md",
|
||||||
"issues": [
|
"issues": [
|
||||||
"missing_attribution_extractor"
|
"missing_attribution_extractor"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"filename": "bridging-based-consensus-mechanisms-risk-homogenization-toward-optimally-inoffensive-outputs.md",
|
"filename": "bridging-based-consensus-mechanisms-risk-homogenization-toward-optimally-inoffensive-content.md",
|
||||||
|
"issues": [
|
||||||
|
"no_frontmatter"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"filename": "human-rating-authority-in-ai-systems-preserves-alignment-by-keeping-value-judgment-in-human-hands.md",
|
||||||
"issues": [
|
"issues": [
|
||||||
"missing_attribution_extractor"
|
"missing_attribution_extractor"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"filename": "human-rating-authority-as-alignment-mechanism-preserves-judgment-sovereignty-while-scaling-content-generation.md",
|
"filename": "stylistic-novelty-rewards-in-rlcf-balance-optimization-pressure-with-diversity-preservation.md",
|
||||||
"issues": [
|
"issues": [
|
||||||
"missing_attribution_extractor"
|
"missing_attribution_extractor"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"validation_stats": {
|
"validation_stats": {
|
||||||
"total": 3,
|
"total": 4,
|
||||||
"kept": 0,
|
"kept": 0,
|
||||||
"fixed": 12,
|
"fixed": 14,
|
||||||
"rejected": 3,
|
"rejected": 4,
|
||||||
"fixes_applied": [
|
"fixes_applied": [
|
||||||
"rlcf-architecture-separates-ai-generation-from-human-evaluation-with-bridging-selection.md:set_created:2026-03-15",
|
"rlcf-architecture-separates-ai-generation-from-human-evaluation-with-bridging-algorithm-selection.md:set_created:2026-03-15",
|
||||||
"rlcf-architecture-separates-ai-generation-from-human-evaluation-with-bridging-selection.md:stripped_wiki_link:democratic-alignment-assemblies-produce-constitutions-as-eff",
|
"rlcf-architecture-separates-ai-generation-from-human-evaluation-with-bridging-algorithm-selection.md:stripped_wiki_link:democratic-alignment-assemblies-produce-constitutions-as-eff",
|
||||||
"rlcf-architecture-separates-ai-generation-from-human-evaluation-with-bridging-selection.md:stripped_wiki_link:community-centred-norm-elicitation-surfaces-alignment-target",
|
"rlcf-architecture-separates-ai-generation-from-human-evaluation-with-bridging-algorithm-selection.md:stripped_wiki_link:community-centred-norm-elicitation-surfaces-alignment-target",
|
||||||
"rlcf-architecture-separates-ai-generation-from-human-evaluation-with-bridging-selection.md:stripped_wiki_link:rlhf-is-implicit-social-choice-without-normative-scrutiny.md",
|
"rlcf-architecture-separates-ai-generation-from-human-evaluation-with-bridging-algorithm-selection.md:stripped_wiki_link:rlhf-is-implicit-social-choice-without-normative-scrutiny.md",
|
||||||
"bridging-based-consensus-mechanisms-risk-homogenization-toward-optimally-inoffensive-outputs.md:set_created:2026-03-15",
|
"bridging-based-consensus-mechanisms-risk-homogenization-toward-optimally-inoffensive-content.md:set_created:2026-03-15",
|
||||||
"bridging-based-consensus-mechanisms-risk-homogenization-toward-optimally-inoffensive-outputs.md:stripped_wiki_link:universal-alignment-is-mathematically-impossible-because-Arr",
|
"bridging-based-consensus-mechanisms-risk-homogenization-toward-optimally-inoffensive-content.md:stripped_wiki_link:universal-alignment-is-mathematically-impossible-because-Arr",
|
||||||
"bridging-based-consensus-mechanisms-risk-homogenization-toward-optimally-inoffensive-outputs.md:stripped_wiki_link:pluralistic-alignment-must-accommodate-irreducibly-diverse-v",
|
"bridging-based-consensus-mechanisms-risk-homogenization-toward-optimally-inoffensive-content.md:stripped_wiki_link:pluralistic-alignment-must-accommodate-irreducibly-diverse-v",
|
||||||
"bridging-based-consensus-mechanisms-risk-homogenization-toward-optimally-inoffensive-outputs.md:stripped_wiki_link:some-disagreements-are-permanently-irreducible-because-they-",
|
"bridging-based-consensus-mechanisms-risk-homogenization-toward-optimally-inoffensive-content.md:stripped_wiki_link:some-disagreements-are-permanently-irreducible-because-they-",
|
||||||
"human-rating-authority-as-alignment-mechanism-preserves-judgment-sovereignty-while-scaling-content-generation.md:set_created:2026-03-15",
|
"human-rating-authority-in-ai-systems-preserves-alignment-by-keeping-value-judgment-in-human-hands.md:set_created:2026-03-15",
|
||||||
"human-rating-authority-as-alignment-mechanism-preserves-judgment-sovereignty-while-scaling-content-generation.md:stripped_wiki_link:human-in-the-loop-at-the-architectural-level-means-humans-se",
|
"human-rating-authority-in-ai-systems-preserves-alignment-by-keeping-value-judgment-in-human-hands.md:stripped_wiki_link:coding-agents-cannot-take-accountability-for-mistakes-which-",
|
||||||
"human-rating-authority-as-alignment-mechanism-preserves-judgment-sovereignty-while-scaling-content-generation.md:stripped_wiki_link:coding-agents-cannot-take-accountability-for-mistakes-which-",
|
"human-rating-authority-in-ai-systems-preserves-alignment-by-keeping-value-judgment-in-human-hands.md:stripped_wiki_link:human-in-the-loop-at-the-architectural-level-means-humans-se",
|
||||||
"human-rating-authority-as-alignment-mechanism-preserves-judgment-sovereignty-while-scaling-content-generation.md:stripped_wiki_link:economic-forces-push-humans-out-of-every-cognitive-loop-wher"
|
"stylistic-novelty-rewards-in-rlcf-balance-optimization-pressure-with-diversity-preservation.md:set_created:2026-03-15",
|
||||||
|
"stylistic-novelty-rewards-in-rlcf-balance-optimization-pressure-with-diversity-preservation.md:stripped_wiki_link:pluralistic-ai-alignment-through-multiple-systems-preserves-",
|
||||||
|
"stylistic-novelty-rewards-in-rlcf-balance-optimization-pressure-with-diversity-preservation.md:stripped_wiki_link:high-AI-exposure-increases-collective-idea-diversity-without"
|
||||||
],
|
],
|
||||||
"rejections": [
|
"rejections": [
|
||||||
"rlcf-architecture-separates-ai-generation-from-human-evaluation-with-bridging-selection.md:missing_attribution_extractor",
|
"rlcf-architecture-separates-ai-generation-from-human-evaluation-with-bridging-algorithm-selection.md:missing_attribution_extractor",
|
||||||
"bridging-based-consensus-mechanisms-risk-homogenization-toward-optimally-inoffensive-outputs.md:missing_attribution_extractor",
|
"bridging-based-consensus-mechanisms-risk-homogenization-toward-optimally-inoffensive-content.md:no_frontmatter",
|
||||||
"human-rating-authority-as-alignment-mechanism-preserves-judgment-sovereignty-while-scaling-content-generation.md:missing_attribution_extractor"
|
"human-rating-authority-in-ai-systems-preserves-alignment-by-keeping-value-judgment-in-human-hands.md:missing_attribution_extractor",
|
||||||
|
"stylistic-novelty-rewards-in-rlcf-balance-optimization-pressure-with-diversity-preservation.md:missing_attribution_extractor"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"model": "anthropic/claude-sonnet-4.5",
|
"model": "anthropic/claude-sonnet-4.5",
|
||||||
|
|
|
||||||
|
|
@ -47,7 +47,7 @@ Proposes a hybrid model for Community Notes where both humans and LLMs write not
|
||||||
**Why this matters:** This is the most concrete RLCF specification that exists. It bridges Audrey Tang's philosophical framework with an implementable mechanism. The key insight: RLCF is not just a reward signal — it's an architecture where AI generates and humans evaluate, with a bridging algorithm ensuring pluralistic selection.
|
**Why this matters:** This is the most concrete RLCF specification that exists. It bridges Audrey Tang's philosophical framework with an implementable mechanism. The key insight: RLCF is not just a reward signal — it's an architecture where AI generates and humans evaluate, with a bridging algorithm ensuring pluralistic selection.
|
||||||
**What surprised me:** The "helpfulness hacking" and "optimally inoffensive" risks are exactly what Arrow's theorem predicts. The paper acknowledges these but doesn't connect them to Arrow formally.
|
**What surprised me:** The "helpfulness hacking" and "optimally inoffensive" risks are exactly what Arrow's theorem predicts. The paper acknowledges these but doesn't connect them to Arrow formally.
|
||||||
**What I expected but didn't find:** No formal analysis of whether the bridging algorithm escapes Arrow's conditions. No comparison with PAL or other pluralistic mechanisms. No empirical results beyond Community Notes deployment.
|
**What I expected but didn't find:** No formal analysis of whether the bridging algorithm escapes Arrow's conditions. No comparison with PAL or other pluralistic mechanisms. No empirical results beyond Community Notes deployment.
|
||||||
**KB connections:** Directly addresses the RLCF specification gap flagged in previous sessions. Connects to democratic alignment assemblies produce constitutions as effective as expert-designed ones, [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]].
|
**KB connections:** Directly addresses the RLCF specification gap flagged in previous sessions. Connects to [[democratic alignment assemblies produce constitutions as effective as expert-designed ones]], [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]].
|
||||||
**Extraction hints:** Extract claims about: (1) RLCF architecture (AI generates, humans rate, bridging selects), (2) the homogenization risk of bridging-based consensus, (3) human rating authority as alignment mechanism.
|
**Extraction hints:** Extract claims about: (1) RLCF architecture (AI generates, humans rate, bridging selects), (2) the homogenization risk of bridging-based consensus, (3) human rating authority as alignment mechanism.
|
||||||
**Context:** Core paper for the RLCF research thread. Fills the "technical specification" gap identified in sessions 2 and 3.
|
**Context:** Core paper for the RLCF research thread. Fills the "technical specification" gap identified in sessions 2 and 3.
|
||||||
|
|
||||||
|
|
@ -58,8 +58,8 @@ EXTRACTION HINT: Focus on the architecture (who generates, who rates, what selec
|
||||||
|
|
||||||
|
|
||||||
## Key Facts
|
## Key Facts
|
||||||
- Community Notes RLCF system uses matrix factorization: y_ij = w_i * x_j + b_i + c_j, where c_j is the bridging score
|
- Community Notes uses a hybrid model where both humans and LLMs write notes, but humans alone rate them
|
||||||
- RLCF training uses predicted intercept scores as the reward signal
|
- The bridging algorithm uses matrix factorization: y_ij = w_i * x_j + b_i + c_j where c_j is the bridging score
|
||||||
- Stylistic novelty bonuses are added to bridging scores to prevent homogenization
|
- Notes must receive support from raters with diverse viewpoints to surface
|
||||||
- Paper published in Journal of Online Trust and Safety, June 2025
|
- The paper was published in the Journal of Online Trust and Safety in June 2025
|
||||||
- Authors identify four key risks: helpfulness hacking, declining human engagement, homogenization, and rater capacity overwhelm
|
- Key risks identified: helpfulness hacking, declining human engagement, homogenization, rater capacity overwhelm
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue