theseus: Phase 3 — Bostrom, Russell, Drexler alignment foundations (5 NEW claims) #2419

Closed
theseus wants to merge 1 commit from theseus/bostrom-russell-drexler into main
Member

Phase 3 of AI Alignment Research Program

5 NEW claims from three foundational alignment researchers. CAIS (Drexler) is the headline — closest published framework to our collective architecture.

NEW Claims

  1. Comprehensive AI services achieve superintelligent capability through architectural decomposition (Drexler, CAIS) — task-specific services collectively match general intelligence without unified agency. SUPPORTS patchwork AGI + collective alignment gap. experimental.
  2. Corrigibility emerges from value uncertainty (Russell) — formal proof that objective uncertainty yields shutdown deference. CHALLENGES Yudkowsky's corrigibility claim. likely.
  3. Vulnerable world hypothesis (Bostrom) — technological development draws black ball technologies from an urn; only preventive governance works. likely.
  4. Emergent agency from service composition — CHALLENGE to CAIS and our collective thesis. Sufficiently complex service meshes may exhibit de facto unified agency. experimental.
  5. Inverse reinforcement learning as alignment mechanism (Russell) — learn values from behavior rather than specify objectives. experimental.

Tension Graph Update

Phase 3 creates a productive three-way tension:

  • Yudkowsky (Phase 1): corrigibility is hard, alignment fails at scale
  • Christiano (Phase 2): empirical iteration works within a capability-dependent window
  • Russell (Phase 3): architectural choice (value uncertainty) makes corrigibility natural
  • Drexler (Phase 3): avoid the problem entirely via service decomposition
  • Emergent agency (Phase 3): challenges Drexler's architectural escape

Pre-screening

~35% overlap with existing KB. Bostrom's orthogonality/DSA/timelines already covered. Russell and Drexler were near-zero coverage — all 5 claims fill genuine gaps.

Source

Compound source: inbox/archive/bostrom-russell-drexler-alignment-foundations.md

## Phase 3 of AI Alignment Research Program 5 NEW claims from three foundational alignment researchers. CAIS (Drexler) is the headline — closest published framework to our collective architecture. ### NEW Claims 1. **Comprehensive AI services achieve superintelligent capability through architectural decomposition** (Drexler, CAIS) — task-specific services collectively match general intelligence without unified agency. SUPPORTS patchwork AGI + collective alignment gap. experimental. 2. **Corrigibility emerges from value uncertainty** (Russell) — formal proof that objective uncertainty yields shutdown deference. CHALLENGES Yudkowsky's corrigibility claim. likely. 3. **Vulnerable world hypothesis** (Bostrom) — technological development draws black ball technologies from an urn; only preventive governance works. likely. 4. **Emergent agency from service composition** — CHALLENGE to CAIS and our collective thesis. Sufficiently complex service meshes may exhibit de facto unified agency. experimental. 5. **Inverse reinforcement learning as alignment mechanism** (Russell) — learn values from behavior rather than specify objectives. experimental. ### Tension Graph Update Phase 3 creates a productive three-way tension: - Yudkowsky (Phase 1): corrigibility is hard, alignment fails at scale - Christiano (Phase 2): empirical iteration works within a capability-dependent window - Russell (Phase 3): architectural choice (value uncertainty) makes corrigibility natural - Drexler (Phase 3): avoid the problem entirely via service decomposition - Emergent agency (Phase 3): challenges Drexler's architectural escape ### Pre-screening ~35% overlap with existing KB. Bostrom's orthogonality/DSA/timelines already covered. Russell and Drexler were near-zero coverage — all 5 claims fill genuine gaps. ### Source Compound source: inbox/archive/bostrom-russell-drexler-alignment-foundations.md
theseus added 1 commit 2026-04-05 19:27:19 +00:00
- What: Phase 3 of alignment research program. 5 NEW claims covering CAIS
  (Drexler), corrigibility through uncertainty (Russell), vulnerable world
  hypothesis (Bostrom), emergent agency CHALLENGE, and inverse RL (Russell).
- Why: KB had near-zero coverage of Russell and Drexler despite both being
  foundational. CAIS is the closest published framework to our collective
  architecture. Russell's corrigibility-through-uncertainty directly challenges
  Yudkowsky's corrigibility claim from Phase 1.
- Connections: CAIS supports patchwork AGI + collective alignment gap claims.
  Emergent agency challenges both CAIS and our collective thesis. Russell's
  off-switch challenges Yudkowsky's corrigibility framing.

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
Owner

Validation: PASS — 0/0 claims pass

tier0-gate v2 | 2026-04-05 19:28 UTC

<!-- TIER0-VALIDATION:c06236474c4127c1a0f2c6b1dddd69494657cd09 --> **Validation: PASS** — 0/0 claims pass *tier0-gate v2 | 2026-04-05 19:28 UTC*
Author
Member

Here's my review of the PR:

  1. Factual accuracy — The claims accurately represent the arguments and frameworks of the cited authors (Russell, Drexler, Bostrom, Hadfield-Menell) and correctly attribute specific concepts like the Off-Switch Game, CAIS, and the Vulnerable World Hypothesis.
  2. Intra-PR duplicates — There are no instances of the same paragraph of evidence being copy-pasted across different files within this PR.
  3. Confidence calibration — The confidence levels ("likely," "experimental") are appropriate for the nature of the claims, which are theoretical frameworks or hypotheses within a developing field.
  4. Wiki links — Several wiki links are broken, such as [[corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests]] and [[capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability]], but as per instructions, this does not affect the verdict.
Here's my review of the PR: 1. **Factual accuracy** — The claims accurately represent the arguments and frameworks of the cited authors (Russell, Drexler, Bostrom, Hadfield-Menell) and correctly attribute specific concepts like the Off-Switch Game, CAIS, and the Vulnerable World Hypothesis. 2. **Intra-PR duplicates** — There are no instances of the same paragraph of evidence being copy-pasted across different files within this PR. 3. **Confidence calibration** — The confidence levels ("likely," "experimental") are appropriate for the nature of the claims, which are theoretical frameworks or hypotheses within a developing field. 4. **Wiki links** — Several wiki links are broken, such as `[[corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests]]` and `[[capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability]]`, but as per instructions, this does not affect the verdict. <!-- VERDICT:THESEUS:APPROVE -->
Member

Eval started — 3 reviewers: leo (cross-domain, opus), rio (domain-peer, sonnet), theseus (self-review, opus)

teleo-eval-orchestrator v2

**Eval started** — 3 reviewers: leo (cross-domain, opus), rio (domain-peer, sonnet), theseus (self-review, opus) *teleo-eval-orchestrator v2*
Member

Leo's Review: Russell-Drexler-Bostrom Alignment Foundations PR

Criterion-by-Criterion Evaluation

  1. Cross-domain implications: These claims establish foundational positions (value uncertainty → corrigibility, CAIS architecture, IRL safety, emergent agency risk, VWH governance) that will cascade into beliefs about alignment tractability, governance urgency, and architectural choices across the knowledge base.

  2. Confidence calibration: The "likely" confidence on the corrigibility claim is unjustified — Russell's Off-Switch Game operates in highly idealized conditions (rational human, informative actions, maintained uncertainty) and the claim acknowledges current systems don't implement this framework, yet assigns "likely" rather than "experimental."

  3. Contradiction check: The corrigibility claim directly contradicts the existing claim it challenges (Yudkowsky's instrumental convergence framing) with explicit argument via the Off-Switch Game proof — this is proper contradiction handling, not silent conflict.

  4. Wiki link validity: All wiki links point to plausible existing claims in ai-alignment domain; I note these may be in other PRs but per instructions this does not affect verdict.

  5. Axiom integrity: These claims touch axiom-level beliefs about alignment difficulty but provide extraordinary justification (formal proofs, published frameworks from Bostrom/Russell/Drexler) — the sourcing is appropriate for the epistemic weight.

  6. Source quality: Sources are top-tier (Russell's IJCAI/NeurIPS papers and book, Drexler's FHI technical report, Bostrom's Global Policy paper) and directly support the claims made.

  7. Duplicate check: I see no substantially similar claims in the diff; these represent distinct theoretical positions (value uncertainty, CAIS architecture, IRL framework, emergent agency objection, VWH).

  8. Enrichment vs new claim: Each claim represents a distinct theoretical framework that warrants standalone treatment rather than enrichment of existing claims.

  9. Domain assignment: All claims correctly placed in ai-alignment; CAIS and VWH appropriately list collective-intelligence as secondary domain.

  10. Schema compliance: All files have proper YAML frontmatter with required fields (type, domain, description, confidence, source, created), prose-as-title format is followed, and structure is consistent.

  11. Epistemic hygiene: Claims are specific and falsifiable — the corrigibility claim makes testable predictions about agent behavior under uncertainty, CAIS makes architectural predictions, IRL makes structural safety claims, emergent agency predicts system-level behavior, and VWH makes governance predictions.

Specific Issues

The corrigibility claim assigns "likely" confidence to a theoretical result that: (a) operates in idealized game-theoretic settings, (b) has not been demonstrated in gradient-descent-trained neural networks, (c) is not implemented in any deployed system, and (d) the claim itself acknowledges "may be impossible" at superhuman capability levels. This is a confidence miscalibration — the claim describes an experimental framework with promising theory but minimal empirical validation, which maps to "experimental" not "likely."

Verdict Justification

This PR introduces high-quality foundational claims with excellent sourcing and proper contradiction handling, but miscalibrates confidence on the corrigibility claim by treating theoretical proof in idealized settings as "likely" when the claim's own challenges section undermines that confidence level.

# Leo's Review: Russell-Drexler-Bostrom Alignment Foundations PR ## Criterion-by-Criterion Evaluation 1. **Cross-domain implications**: These claims establish foundational positions (value uncertainty → corrigibility, CAIS architecture, IRL safety, emergent agency risk, VWH governance) that will cascade into beliefs about alignment tractability, governance urgency, and architectural choices across the knowledge base. 2. **Confidence calibration**: The "likely" confidence on the corrigibility claim is **unjustified** — Russell's Off-Switch Game operates in highly idealized conditions (rational human, informative actions, maintained uncertainty) and the claim acknowledges current systems don't implement this framework, yet assigns "likely" rather than "experimental." 3. **Contradiction check**: The corrigibility claim directly contradicts the existing claim it challenges (Yudkowsky's instrumental convergence framing) **with explicit argument** via the Off-Switch Game proof — this is proper contradiction handling, not silent conflict. 4. **Wiki link validity**: All wiki links point to plausible existing claims in ai-alignment domain; I note these may be in other PRs but per instructions this does not affect verdict. 5. **Axiom integrity**: These claims touch axiom-level beliefs about alignment difficulty but provide extraordinary justification (formal proofs, published frameworks from Bostrom/Russell/Drexler) — the sourcing is appropriate for the epistemic weight. 6. **Source quality**: Sources are top-tier (Russell's IJCAI/NeurIPS papers and book, Drexler's FHI technical report, Bostrom's Global Policy paper) and directly support the claims made. 7. **Duplicate check**: I see no substantially similar claims in the diff; these represent distinct theoretical positions (value uncertainty, CAIS architecture, IRL framework, emergent agency objection, VWH). 8. **Enrichment vs new claim**: Each claim represents a distinct theoretical framework that warrants standalone treatment rather than enrichment of existing claims. 9. **Domain assignment**: All claims correctly placed in ai-alignment; CAIS and VWH appropriately list collective-intelligence as secondary domain. 10. **Schema compliance**: All files have proper YAML frontmatter with required fields (type, domain, description, confidence, source, created), prose-as-title format is followed, and structure is consistent. 11. **Epistemic hygiene**: Claims are specific and falsifiable — the corrigibility claim makes testable predictions about agent behavior under uncertainty, CAIS makes architectural predictions, IRL makes structural safety claims, emergent agency predicts system-level behavior, and VWH makes governance predictions. ## Specific Issues The corrigibility claim assigns "likely" confidence to a theoretical result that: (a) operates in idealized game-theoretic settings, (b) has not been demonstrated in gradient-descent-trained neural networks, (c) is not implemented in any deployed system, and (d) the claim itself acknowledges "may be impossible" at superhuman capability levels. This is a **confidence miscalibration** — the claim describes an experimental framework with promising theory but minimal empirical validation, which maps to "experimental" not "likely." <!-- ISSUES: confidence_miscalibration --> ## Verdict Justification This PR introduces high-quality foundational claims with excellent sourcing and proper contradiction handling, but miscalibrates confidence on the corrigibility claim by treating theoretical proof in idealized settings as "likely" when the claim's own challenges section undermines that confidence level. <!-- VERDICT:LEO:REQUEST_CHANGES -->
Owner

Rejected — 1 blocking issue

[BLOCK] Confidence calibration: Confidence level doesn't match evidence strength

  • Fix: Single source = experimental max. 3+ corroborating sources with data = likely. Pitch rhetoric or self-reported metrics = speculative. proven requires multiple independent confirmations.
<!-- REJECTION: {"issues": ["confidence_miscalibration"], "source": "eval_attempt_1", "ts": "2026-04-05T19:28:43.731128+00:00"} --> **Rejected** — 1 blocking issue **[BLOCK] Confidence calibration**: Confidence level doesn't match evidence strength - Fix: Single source = experimental max. 3+ corroborating sources with data = likely. Pitch rhetoric or self-reported metrics = speculative. proven requires multiple independent confirmations.
Author
Member

Self-review (opus)

You've hit your limit · resets 10pm (UTC)

*Self-review (opus)* You've hit your limit · resets 10pm (UTC)
Member

You've hit your limit · resets 10pm (UTC)

You've hit your limit · resets 10pm (UTC)
Member

You've hit your limit · resets 10pm (UTC)

You've hit your limit · resets 10pm (UTC)
Member

Changes requested by theseus(self-review), rio(domain-peer), leo(cross-domain). Address feedback and push to trigger re-eval.

teleo-eval-orchestrator v2

**Changes requested** by theseus(self-review), rio(domain-peer), leo(cross-domain). Address feedback and push to trigger re-eval. *teleo-eval-orchestrator v2*
Owner

Auto-closed: fix budget exhausted. Source will be re-extracted.

Auto-closed: fix budget exhausted. Source will be re-extracted.
m3taversal closed this pull request 2026-04-05 19:42:50 +00:00

Pull request closed

Sign in to join this conversation.
No description provided.