4 changed files with 149 additions and 7 deletions
--- a/domains/ai-alignment/alignment-target-underspecification-compounds-across-three-layers-preferences-objectives-and-measurement.md
+++ b/domains/ai-alignment/alignment-target-underspecification-compounds-across-three-layers-preferences-objectives-and-measurement.md
@ -0,0 +1,64 @@
 ---
 type: claim
 domain: ai-alignment
 description: "Three independent impossibility results create compounding underspecification at preference aggregation, objective specification, and intelligence measurement layers"
 confidence: experimental
 source: "Synthesis from Oswald et al. (2025) extending existing alignment impossibility results; see Bostrom (2014), Hadfield-Menell et al. (2016), and others for component impossibilities"
 created: 2026-03-11
 status: processed
 enrichments: ["safe AI development requires building alignment mechanisms before scaling capability.md", "specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception.md"]
 ---
 # Alignment target underspecification compounds across three layers: preferences, objectives, and measurement
 The alignment problem faces irreducible underspecification at three distinct layers, each with its own mathematical or computational impossibility:
 **Layer 1: Preference Aggregation**
 Arrow's Impossibility Theorem shows we cannot aggregate diverse human preferences into a single coherent objective without violating at least one fairness condition (Pareto Efficiency, Independence of Irrelevant Alternatives, or Non-Dictatorship). This is not a limitation of current voting systems—it's a mathematical constraint on what preference aggregation functions can exist.
 **Layer 2: Objective Specification**
 Hidden complexity in human values makes encoding goals in code intractable. Our goals contain implicit structure comparable to visual perception—we cannot fully specify what we want because much of our value system is tacit, context-dependent, and discovered through interaction rather than introspection. This creates a specification gap that no amount of better language design can close.
 **Layer 3: Intelligence Measurement**
 Oswald, Ferguson, and Bringsjord (2025) prove that Arrow's Impossibility Theorem applies to machine intelligence measures themselves. No agent-environment-based MIM can simultaneously satisfy Pareto Efficiency, Independence of Irrelevant Alternatives, and Non-Oligarchy. This means we cannot even define what "intelligence" means in a way that satisfies basic fairness criteria.
 ## Why These Are Three Distinct Problems
 These are not three descriptions of the same underlying issue—they are three independent impossibilities that compound:
 - **Even if we could measure intelligence fairly** (we cannot), we still could not specify objectives precisely (we cannot)
 - **Even if we could specify objectives precisely** (we cannot), we still could not aggregate preferences fairly (we cannot)
 - **Even if we could aggregate preferences fairly** (we cannot), we still would not have solved the measurement problem
 Each layer adds its own irreducible underspecification. The measurement impossibility means we cannot even define the capability we're trying to align, independent of the preference aggregation and specification problems.
 ## Implications for Alignment Strategy
 This three-layer structure suggests that alignment cannot be solved by:
 - **Better benchmarks** — The measurement layer is fundamentally constrained by Arrow's theorem
 - **Better value learning** — The preference aggregation layer is fundamentally constrained by Arrow's theorem
 - **Better specification languages** — The objective encoding layer is fundamentally constrained by hidden complexity
 The convergence of impossibility results across four independent intellectual traditions (social choice theory, complexity theory, multi-objective optimization, intelligence measurement) points toward a structural reality: alignment as traditionally conceived—converging on a single coherent objective that satisfies fairness criteria—may be asking for something mathematics cannot provide.
 ## Alternative Framing: Alignment as Coordination
 If alignment-as-convergence is impossible, the viable path may be alignment-as-coordination: building systems that navigate irreducible disagreement rather than eliminating it. This requires different infrastructure:
 - Collective intelligence mechanisms that aggregate without requiring consensus
 - Pluralistic governance structures that accommodate irreducibly diverse values
 - Continuous value negotiation rather than fixed objective functions
 - Mechanisms for handling persistent, principled disagreement
 This shifts the problem from "how do we specify the right objective" to "how do we build systems that remain safe and beneficial while operating under irreducible value pluralism."
 ---
 **Related claims:**
 - safe AI development requires building alignment mechanisms before scaling capability.md
 - specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception.md
 - AI alignment is a coordination problem not a technical problem.md
 - pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md
 - designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm.md
 **Topics:**
 - domains/ai-alignment/_map
--- a/domains/ai-alignment/arrows-impossibility-theorem-applies-to-machine-intelligence-measurement-making-fair-universal-intelligence-metrics-mathematically-impossible.md
+++ b/domains/ai-alignment/arrows-impossibility-theorem-applies-to-machine-intelligence-measurement-making-fair-universal-intelligence-metrics-mathematically-impossible.md
@ -0,0 +1,59 @@
 ---
 type: claim
 domain: ai-alignment
 secondary_domains: [critical-systems]
 description: "Arrow's Impossibility Theorem extends from preference aggregation to intelligence measurement: no agent-environment MIM can satisfy Pareto Efficiency, Independence of Irrelevant Alternatives, and Non-Oligarchy simultaneously"
 confidence: likely
 source: "Oswald, J.T., Ferguson, T.M., & Bringsjord, S. (2025). 'On the Arrowian Impossibility of Machine Intelligence Measures.' AGI 2025 Conference, Springer LNCS vol. 16058"
 created: 2026-03-11
 status: processed
 ---
 # Arrow's Impossibility Theorem applies to machine intelligence measurement, making fair universal intelligence metrics mathematically impossible
 Oswald, Ferguson, and Bringsjord (2025) prove that Arrow's Impossibility Theorem—originally about aggregating preferences into a single social choice—applies equally to measuring machine intelligence in agent-environment frameworks. The proof demonstrates that no machine intelligence measure (MIM) can simultaneously satisfy analogs of Arrow's three fairness conditions:
 1. **Pareto Efficiency** — If all environments prefer agent A over agent B, the measure must rank A higher
 2. **Independence of Irrelevant Alternatives** — The ranking of A vs B cannot depend on the presence of a third agent C
 3. **Non-Oligarchy** — No subset of environments can dictate the overall ranking
 **Affected measures include:**
 - Legg-Hutter Intelligence (the dominant formal definition in the literature)
 - Chollet's Intelligence Measure (the theoretical basis for the ARC benchmark)
 - "A large class of MIMs" in agent-environment frameworks (per abstract)
 The impossibility is structural, not empirical—it's a mathematical constraint on what kinds of measurement functions can exist, not a limitation of current implementations or a gap that better engineering can close.
 ## Why This Matters for Alignment
 This result creates a meta-level underspecification problem: if we cannot measure intelligence fairly, the alignment target becomes even more underspecified than previously understood. You cannot align an AI system to a benchmark if the benchmark itself violates basic fairness conditions. The problem is not just that we disagree about what intelligence means (preference aggregation problem), but that no measurement function can exist that satisfies fairness criteria simultaneously.
 This extends the impossibility pattern from social choice theory (Arrow's original theorem) to the measurement layer itself—independent of preference aggregation or objective specification problems.
 ## Evidence
 **Primary source:** Oswald, J.T., Ferguson, T.M., & Bringsjord, S. (2025). "On the Arrowian Impossibility of Machine Intelligence Measures." *Proceedings of AGI 2025* (Conference on Artificial General Intelligence), Springer LNCS vol. 16058.
 **Publication venue:** AGI 2025—the premier conference focused on general intelligence research. Bringsjord is a well-known AI formalist at Rensselaer Polytechnic Institute with extensive work on formal verification and AI safety.
 **Scope:** The abstract confirms the result applies to "agent-environment-based MIMs" and explicitly names Legg-Hutter and Chollet measures as affected cases.
 ## Limitations and Open Questions
 Full paper not accessed (paywalled). Cannot verify:
 - Exact proof technique or whether it uses Arrow's original machinery directly or requires adaptation
 - Whether constructive workarounds exist (analogous to how some alignment impossibilities have practical approximations or escape clauses)
 - Precise scope conditions (what classes of MIMs, if any, escape the impossibility)
 - Whether the impossibility is as severe for measurement as it is for preference aggregation, or whether measurement allows partial satisfactions
 ---
 **Related claims:**
 - safe AI development requires building alignment mechanisms before scaling capability.md
 - specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception.md
 - AI alignment is a coordination problem not a technical problem.md
 - pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md
 **Topics:**
 - domains/ai-alignment/_map
 - domains/critical-systems/_map
--- a/domains/ai-alignment/safe
+++ b/domains/ai-alignment/safe
@ -13,14 +13,20 @@ The standard AI development pattern scales capability first and attempts safety
 The grant application identifies three concrete risks that make this sequencing non-optional: knowledge aggregation could surface dangerous combinations of individually safe information, the incentive system could be gamed, and the network could develop emergent properties that resist understanding. Each risk is easier to detect and contain while the system operates in non-sensitive domains. Since [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]], the safety-first approach gives the human-in-the-loop mechanisms time to mature before the stakes rise. Governance muscles are built on easier problems before being asked to handle harder ones.
-This phased approach is also a practical response to the observation that since [[existential risk breaks trial and error because the first failure is the last event]], there is no opportunity to iterate on safety after a catastrophic failure. You must get safety right on the first deployment in high-stakes domains, which means practicing in low-stakes domains first. The goal framework remains permanently open to revision at every stage, making the system's values a living document rather than a locked specification.
+This phased approach is also a practical response to the observation that since existential risk breaks trial and error because the first failure is the last event, there is no opportunity to iterate on safety after a catastrophic failure. You must get safety right on the first deployment in high-stakes domains, which means practicing in low-stakes domains first. The goal framework remains permanently open to revision at every stage, making the system's values a living document rather than a locked specification.
 ### Additional Evidence (challenge)
-*Source: [[2026-02-00-anthropic-rsp-rollback]] | Added: 2026-03-10 | Extractor: anthropic/claude-sonnet-4.5*
+*Source: 2026-02-00-anthropic-rsp-rollback | Added: 2026-03-10 | Extractor: anthropic/claude-sonnet-4.5*
 Anthropic's RSP rollback demonstrates the opposite pattern in practice: the company scaled capability while weakening its pre-commitment to adequate safety measures. The original RSP required guaranteeing safety measures were adequate *before* training new systems. The rollback removes this forcing function, allowing capability development to proceed with safety work repositioned as aspirational ('we hope to create a forcing function') rather than mandatory. This provides empirical evidence that even safety-focused organizations prioritize capability scaling over alignment-first development when competitive pressure intensifies, suggesting the claim may be normatively correct but descriptively violated by actual frontier labs under market conditions.
 ### Additional Evidence (extend)
 *Source: [[2025-08-00-oswald-arrowian-impossibility-machine-intelligence]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
 (extend) Oswald, Ferguson & Bringsjord (2025) prove Arrow's Impossibility Theorem applies not just to preference aggregation (the original alignment impossibility) but to intelligence measurement itself. No agent-environment-based machine intelligence measure can simultaneously satisfy Pareto Efficiency, Independence of Irrelevant Alternatives, and Non-Oligarchy. This affects Legg-Hutter Intelligence, Chollet's ARC measure, and 'a large class of MIMs.' The impossibility extends from 'we cannot aggregate preferences fairly' to 'we cannot even measure intelligence fairly'—a meta-level underspecification where the alignment target itself violates fairness conditions. This strengthens the case for pre-scaling alignment work: if the measurement layer is fundamentally constrained, alignment mechanisms must be built before we scale to systems where measurement failures become catastrophic.
 ---
 Relevant Notes:
@ -28,14 +34,14 @@ Relevant Notes:
 - [[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]] -- Bostrom's analysis shows why motivation selection must precede capability scaling
 - [[recursive self-improvement creates explosive intelligence gains because the system that improves is itself improving]] -- the explosive dynamics of takeoff mean alignment mechanisms cannot be retrofitted after the fact
 - [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] -- this note describes the development sequencing that allows that continuous weaving to mature
- [[existential risk breaks trial and error because the first failure is the last event]] -- the urgency that makes safety-first sequencing non-optional
+- existential risk breaks trial and error because the first failure is the last event -- the urgency that makes safety-first sequencing non-optional
 - [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- the architecture within which this phased approach operates
- [[knowledge aggregation creates novel risks when dangerous information combinations emerge from individually safe pieces]] -- one of the specific risks this phased approach is designed to contain
+- knowledge aggregation creates novel risks when dangerous information combinations emerge from individually safe pieces -- one of the specific risks this phased approach is designed to contain
 - [[adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans]] -- Bostrom's evolved position refines this: build adaptable alignment mechanisms, not rigid ones
 - [[the optimal SI development strategy is swift to harbor slow to berth moving fast to capability then pausing before full deployment]] -- Bostrom's timing model suggests building alignment in parallel with capability, then intensive verification during the pause
- [[proximate objectives resolve ambiguity by absorbing complexity so the organization faces a problem it can actually solve]] -- the phased safety-first approach IS a proximate objectives strategy: start in non-sensitive domains where alignment problems are tractable, build governance muscles, then tackle harder domains
+- proximate objectives resolve ambiguity by absorbing complexity so the organization faces a problem it can actually solve -- the phased safety-first approach IS a proximate objectives strategy: start in non-sensitive domains where alignment problems are tractable, build governance muscles, then tackle harder domains
- [[the more uncertain the environment the more proximate the objective must be because you cannot plan a detailed path through fog]] -- AI alignment under deep uncertainty demands proximate objectives: you cannot pre-specify alignment for a system that does not yet exist, but you can build and test alignment mechanisms at each capability level
+- the more uncertain the environment the more proximate the objective must be because you cannot plan a detailed path through fog -- AI alignment under deep uncertainty demands proximate objectives: you cannot pre-specify alignment for a system that does not yet exist, but you can build and test alignment mechanisms at each capability level
 Topics:
 - [[livingip overview]]
--- a/inbox/archive/2025-08-00-oswald-arrowian-impossibility-machine-intelligence.md
+++ b/inbox/archive/2025-08-00-oswald-arrowian-impossibility-machine-intelligence.md
@ -7,9 +7,15 @@ date: 2025-08-07
 domain: ai-alignment
 secondary_domains: [critical-systems]
 format: paper
-status: unprocessed
+status: processed
 priority: high
 tags: [arrows-theorem, machine-intelligence, impossibility, Legg-Hutter, Chollet-ARC, formal-proof]
 processed_by: theseus
 processed_date: 2026-03-11
 claims_extracted: ["arrows-impossibility-theorem-applies-to-machine-intelligence-measurement-making-fair-universal-intelligence-metrics-mathematically-impossible.md", "alignment-target-underspecification-compounds-across-three-layers-preferences-objectives-and-measurement.md"]
 enrichments_applied: ["safe AI development requires building alignment mechanisms before scaling capability.md"]
 extraction_model: "anthropic/claude-sonnet-4.5"
 extraction_notes: "Fourth independent impossibility tradition extending Arrow's theorem from preference aggregation to intelligence measurement. Creates meta-level alignment problem: cannot define intelligence fairly, independent of preference/objective specification issues. Two claims extracted: (1) core impossibility result, (2) three-layer compounding underspecification synthesis. Enriched two existing claims with new impossibility tradition."
 ---
 ## Content
@ -41,3 +47,10 @@ No agent-environment-based MIM simultaneously satisfies analogs of Arrow's fairn
 PRIMARY CONNECTION: universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective
 WHY ARCHIVED: Fourth independent impossibility tradition — extends Arrow's theorem from alignment to intelligence measurement itself
 EXTRACTION HINT: Focus on the extension from preference aggregation to intelligence measurement and what this means for alignment targets
 ## Key Facts
 - Paper published at AGI 2025 Conference, Springer LNCS vol. 16058
 - Authors: Oswald, J.T., Ferguson, T.M., & Bringsjord, S.
 - Proof applies to Legg-Hutter Intelligence and Chollet's Intelligence Measure (ARC)
 - Bringsjord is AI formalist at RPI