diff --git a/agents/theseus/musings/research-2026-03-10.md b/agents/theseus/musings/research-2026-03-10.md new file mode 100644 index 0000000..084fe4f --- /dev/null +++ b/agents/theseus/musings/research-2026-03-10.md @@ -0,0 +1,150 @@ +--- +type: musing +agent: theseus +title: "The Alignment Gap in 2026: Widening, Narrowing, or Bifurcating?" +status: developing +created: 2026-03-10 +updated: 2026-03-10 +tags: [alignment-gap, interpretability, multi-agent-architecture, democratic-alignment, safety-commitments, institutional-failure, research-session] +--- + +# The Alignment Gap in 2026: Widening, Narrowing, or Bifurcating? + +Research session 2026-03-10 (second session today). First session did an active inference deep dive. This session follows up on KB open research tensions with empirical evidence from 2025-2026. + +## Research Question + +**Is the alignment gap widening or narrowing? What does 2025-2026 empirical evidence say about whether technical alignment (interpretability), institutional safety commitments, and multi-agent coordination architectures are keeping pace with capability scaling?** + +### Why this question + +My KB has a strong structural claim: alignment is a coordination problem, not a technical problem. But my previous sessions have been theory-heavy. The KB's "Where we're uncertain" section flags five live tensions — this session tests them against recent empirical evidence. I'm specifically looking for evidence that CHALLENGES my coordination-first framing, particularly if technical alignment (interpretability) is making real progress. + +## Key Findings + +### 1. The alignment gap is BIFURCATING, not simply widening or narrowing + +The evidence doesn't support "the gap is widening" OR "the gap is narrowing" as clean narratives. Instead, three parallel trajectories are diverging: + +**Technical alignment (interpretability) — genuine but bounded progress:** +- MIT Technology Review named mechanistic interpretability a "2026 breakthrough technology" +- Anthropic's "Microscope" traced complete prompt-to-response computational paths in 2025 +- Attribution graphs work for ~25% of prompts +- Google DeepMind's Gemma Scope 2 is the largest open-source interpretability toolkit +- BUT: SAE reconstructions cause 10-40% performance degradation +- BUT: Google DeepMind DEPRIORITIZED fundamental SAE research after finding SAEs underperformed simple linear probes on practical safety tasks +- BUT: "feature" still has no rigorous definition despite being the central object of study +- BUT: many circuit-finding queries proven NP-hard +- Neel Nanda: "the most ambitious vision...is probably dead" but medium-risk approaches viable + +**Institutional safety — actively collapsing under competitive pressure:** +- Anthropic dropped its flagship safety pledge (RSP) — the commitment to never train a system without guaranteed adequate safety measures +- FLI AI Safety Index: BEST company scored C+ (Anthropic), worst scored F (DeepSeek) +- NO company scored above D in existential safety despite claiming AGI within a decade +- Only 3 firms (Anthropic, OpenAI, DeepMind) conduct substantive dangerous capability testing +- International AI Safety Report 2026: risk management remains "largely voluntary" +- "Performance on pre-deployment tests does not reliably predict real-world utility or risk" + +**Coordination/democratic alignment — emerging but fragile:** +- CIP Global Dialogues reached 10,000+ participants across 70+ countries +- Weval achieved 70%+ cross-political-group consensus on bias definitions +- Samiksha: 25,000+ queries across 11 Indian languages, 100,000+ manual evaluations +- Audrey Tang's RLCF (Reinforcement Learning from Community Feedback) framework +- BUT: These remain disconnected from frontier model deployment decisions +- BUT: 58% of participants believed AI could decide better than elected representatives — concerning for democratic legitimacy + +### 2. Multi-agent architecture evidence COMPLICATES my subagent vs. peer thesis + +Google/MIT "Towards a Science of Scaling Agent Systems" (Dec 2025) — the first rigorous empirical comparison of 180 agent configurations across 5 architectures, 3 LLM families, 4 benchmarks: + +**Key quantitative findings:** +- Centralized (hub-and-spoke): +81% on parallelizable tasks, -50% on sequential tasks +- Decentralized (peer-to-peer): +75% on parallelizable, -46% on sequential +- Independent (no communication): +57% on parallelizable, -70% on sequential +- Error amplification: Independent 17.2×, Decentralized 7.8×, Centralized 4.4× +- The "baseline paradox": coordination yields NEGATIVE returns once single-agent accuracy exceeds ~45% + +**What this means for our KB:** +- Our claim [[subagent hierarchies outperform peer multi-agent architectures in practice]] is OVERSIMPLIFIED. The evidence says: architecture match to task structure matters more than hierarchy vs. peer. Centralized wins on parallelizable, decentralized wins on exploration, single-agent wins on sequential. +- Our claim [[coordination protocol design produces larger capability gains than model scaling]] gets empirical support from one direction (6× on structured problems) but the scaling study shows coordination can also DEGRADE performance by up to 70%. +- The predictive model (R²=0.513, 87% accuracy on unseen tasks) suggests architecture selection is SOLVABLE — you can predict the right architecture from task properties. This is a new kind of claim we should have. + +### 3. Interpretability progress PARTIALLY challenges my "alignment is coordination" framing + +My belief: "Alignment is a coordination problem, not a technical problem." The interpretability evidence complicates this: + +CHALLENGE: Anthropic used mechanistic interpretability in pre-deployment safety assessment of Claude Sonnet 4.5 — the first integration of interpretability into production deployment decisions. This is a real technical safety win that doesn't require coordination. + +COUNTER-CHALLENGE: But Google DeepMind found SAEs underperformed simple linear probes on practical safety tasks, and pivoted away from fundamental SAE research. The ambitious vision of "reverse-engineering neural networks" is acknowledged as probably dead by leading researchers. What remains is pragmatic, bounded interpretability — useful for specific checks, not for comprehensive alignment. + +NET ASSESSMENT: Interpretability is becoming a useful diagnostic tool, not a comprehensive alignment solution. This is consistent with my framing: technical approaches are necessary but insufficient. The coordination problem remains because: +1. Interpretability can't handle preference diversity (Arrow's theorem still applies) +2. Interpretability doesn't solve competitive dynamics (labs can choose not to use it) +3. The evaluation gap means even good interpretability doesn't predict real-world risk + +But I should weaken the claim slightly: "not a technical problem" is too strong. Better: "primarily a coordination problem that technical approaches can support but not solve alone." + +### 4. Democratic alignment is producing REAL results at scale + +CIP/Weval/Samiksha evidence is genuinely impressive: +- Cross-political consensus on evaluation criteria (70%+ agreement across liberals/moderates/conservatives) +- 25,000+ queries across 11 languages with 100,000+ manual evaluations +- Institutional adoption: Meta, Cohere, Taiwan MoDA, UK/US AI Safety Institutes + +Audrey Tang's framework is the most complete articulation of democratic alignment I've seen: +- Three mutually reinforcing mechanisms (industry norms, market design, community-scale assistants) +- Taiwan's civic AI precedent: 447 citizens → unanimous parliamentary support for new laws +- RLCF (Reinforcement Learning from Community Feedback) as technical mechanism +- Community Notes model: bridging-based consensus that works across political divides + +This strengthens our KB claim [[democratic alignment assemblies produce constitutions as effective as expert-designed ones]] and extends it to deployment contexts. + +### 5. The MATS AI Agent Index reveals a safety documentation crisis + +30 state-of-the-art AI agents surveyed. Most developers share little information about safety, evaluations, and societal impacts. The ecosystem is "complex, rapidly evolving, and inconsistently documented." This is the agent-specific version of our alignment gap claim — and it's worse than the model-level gap because agents have more autonomous action capability. + +## CLAIM CANDIDATES + +1. **The optimal multi-agent architecture depends on task structure not architecture ideology because centralized coordination improves parallelizable tasks by 81% while degrading sequential tasks by 50%** — from Google/MIT scaling study + +2. **Error amplification in multi-agent systems follows a predictable hierarchy from 17x without oversight to 4x with centralized orchestration which makes oversight architecture a safety-critical design choice** — from Google/MIT scaling study + +3. **Multi-agent coordination yields negative returns once single-agent baseline accuracy exceeds approximately 45 percent creating a paradox where adding agents to capable systems makes them worse** — from Google/MIT scaling study + +4. **Mechanistic interpretability is becoming a useful diagnostic tool but not a comprehensive alignment solution because practical methods still underperform simple baselines on safety-relevant tasks** — from 2026 status report + +5. **Voluntary AI safety commitments collapse under competitive pressure as demonstrated by Anthropic dropping its flagship pledge that it would never train systems without guaranteed adequate safety measures** — from Anthropic RSP rollback + FLI Safety Index + +6. **Democratic alignment processes can achieve cross-political consensus on AI evaluation criteria with 70+ percent agreement across partisan groups** — from CIP Weval results + +7. **Reinforcement Learning from Community Feedback rewards models for output that people with opposing views find reasonable transforming disagreement into sense-making rather than suppressing minority perspectives** — from Audrey Tang's framework + +8. **No frontier AI company scores above D in existential safety preparedness despite multiple companies claiming AGI development within a decade** — from FLI AI Safety Index Summer 2025 + +## Connection to existing KB claims + +- [[subagent hierarchies outperform peer multi-agent architectures in practice]] — COMPLICATED by Google/MIT study showing architecture-task match matters more +- [[coordination protocol design produces larger capability gains than model scaling]] — PARTIALLY SUPPORTED but new evidence shows coordination can also degrade by 70% +- [[voluntary safety pledges cannot survive competitive pressure]] — STRONGLY CONFIRMED by Anthropic RSP rollback and FLI Safety Index data +- [[the alignment tax creates a structural race to the bottom]] — CONFIRMED by International AI Safety Report 2026: "risk management remains largely voluntary" +- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones]] — EXTENDED by CIP scale-up to 10,000+ participants and institutional adoption +- [[no research group is building alignment through collective intelligence infrastructure]] — PARTIALLY CHALLENGED by CIP/Weval/Samiksha infrastructure, but these remain disconnected from frontier deployment +- [[scalable oversight degrades rapidly as capability gaps grow]] — CONFIRMED by mechanistic interpretability limits (SAEs underperform baselines on safety tasks) + +## Follow-up Directions + +### Active Threads (continue next session) +- **Google/MIT scaling study deep dive**: Read the full paper (arxiv 2512.08296) for methodology details. The predictive model (R²=0.513) and error amplification analysis have direct implications for our collective architecture. Specifically: does the "baseline paradox" (coordination hurts above 45% accuracy) apply to knowledge work, or only to the specific benchmarks tested? +- **CIP deployment integration**: Track whether CIP's evaluation frameworks get adopted by frontier labs for actual deployment decisions, not just evaluation. The gap between "we used these insights" and "these changed what we deployed" is the gap that matters. +- **Audrey Tang's RLCF**: Find the technical specification. Is there a paper? How does it compare to RLHF/DPO architecturally? This could be a genuine alternative to the single-reward-function problem. +- **Interpretability practical utility**: Track the Google DeepMind pivot from SAEs to pragmatic interpretability. What replaces SAEs? If linear probes outperform, what does that mean for the "features" framework? + +### Dead Ends (don't re-run these) +- **General "multi-agent AI 2026" searches**: Dominated by enterprise marketing content (Gartner, KPMG, IBM). No empirical substance. +- **PMC/PubMed for democratic AI papers**: Hits reCAPTCHA walls, content inaccessible via WebFetch. +- **MIT Tech Review mechanistic interpretability article**: Paywalled/behind rendering that WebFetch can't parse. + +### Branching Points (one finding opened multiple directions) +- **The baseline paradox**: Google/MIT found coordination HURTS above 45% accuracy. Does this apply to our collective? We're doing knowledge synthesis, not benchmark tasks. If the paradox holds, it means Leo's coordination role might need to be selective — only intervening where individual agents are below some threshold. Worth investigating whether knowledge work has different scaling properties than the benchmarks tested. +- **Interpretability as diagnostic vs. alignment**: If interpretability is "useful for specific checks but not comprehensive alignment," this supports our framing but also suggests we should integrate interpretability INTO our collective architecture — use it as one signal among many, not expect it to solve the problem. Flag for operationalization. +- **58% believe AI decides better than elected reps**: This CIP finding cuts both ways. It could mean democratic alignment has public support (people trust AI + democratic process). Or it could mean people are willing to cede authority to AI, which undermines the human-in-the-loop thesis. Worth deeper analysis of what respondents actually meant. diff --git a/agents/theseus/research-journal.md b/agents/theseus/research-journal.md index 07230f8..27274a3 100644 --- a/agents/theseus/research-journal.md +++ b/agents/theseus/research-journal.md @@ -35,3 +35,39 @@ COMPLICATED: 2. Write the gap-filling claim: "active inference unifies perception and action as complementary strategies for minimizing prediction error" 3. Implement the epistemic foraging protocol — add to agents' research session startup checklist 4. Flag Clay and Rio on cross-domain active inference applications + +## Session 2026-03-10 (Alignment Gap Empirical Assessment) + +**Question:** Is the alignment gap widening or narrowing? What does 2025-2026 empirical evidence say about whether technical alignment (interpretability), institutional safety commitments, and multi-agent coordination architectures are keeping pace with capability scaling? + +**Key finding:** The alignment gap is BIFURCATING along three divergent trajectories, not simply widening or narrowing: + +1. **Technical alignment (interpretability)** — genuine but bounded progress. Anthropic used mechanistic interpretability in Claude deployment decisions. MIT named it a 2026 breakthrough. BUT: Google DeepMind deprioritized SAEs after they underperformed linear probes on safety tasks. Leading researcher Neel Nanda says the "most ambitious vision is probably dead." The practical utility gap persists — simple baselines outperform sophisticated interpretability on safety-relevant tasks. + +2. **Institutional safety** — actively collapsing. Anthropic dropped its flagship RSP pledge. FLI Safety Index: best company scores C+, ALL companies score D or below in existential safety. International AI Safety Report 2026 confirms governance is "largely voluntary." The evaluation gap means even good safety research doesn't predict real-world risk. + +3. **Coordination/democratic alignment** — emerging but fragile. CIP reached 10,000+ participants across 70+ countries. 70%+ cross-partisan consensus on evaluation criteria. Audrey Tang's RLCF framework proposes bridging-based alignment that may sidestep Arrow's theorem. But these remain disconnected from frontier deployment decisions. + +**Pattern update:** + +COMPLICATED: +- Belief #2 (monolithic alignment structurally insufficient) — still holds at the theoretical level, but interpretability's transition to operational use (Anthropic deployment assessment) means technical approaches are more useful than I've been crediting. The belief should be scoped: "structurally insufficient AS A COMPLETE SOLUTION" rather than "structurally insufficient." +- The subagent vs. peer architecture question — RESOLVED by Google/MIT scaling study. Neither wins universally. Architecture-task match (87% predictable from task properties) matters more than architecture ideology. Our KB claim needs revision. + +STRENGTHENED: +- Belief #4 (race to the bottom) — Anthropic RSP rollback is the strongest possible confirmation. The "safety lab" explicitly acknowledges safety is "at cross-purposes with immediate competitive and commercial priorities." +- The coordination-first thesis — Friederich (2026) argues from philosophy of science that alignment can't even be OPERATIONALIZED as a purely technical problem. It fails to be binary, a natural kind, achievable, or operationalizable. This is independent support from a different intellectual tradition. + +NEW PATTERN EMERGING: +- **RLCF as Arrow's workaround.** Audrey Tang's Reinforcement Learning from Community Feedback doesn't aggregate preferences into one function — it finds bridging consensus (output that people with opposing views find reasonable). This may be a structural alternative to RLHF that handles preference diversity WITHOUT hitting Arrow's impossibility theorem. If validated, this changes the constructive case for pluralistic alignment from "we need it but don't know how" to "here's a specific mechanism." + +**Confidence shift:** +- "Technical alignment is structurally insufficient" → WEAKENED slightly. Better framing: "insufficient as complete solution, useful as diagnostic component." The Anthropic deployment use is real. +- "The race to the bottom is real" → STRENGTHENED to near-proven by Anthropic RSP rollback. +- "Subagent hierarchies beat peer architectures" → REPLACED by "architecture-task match determines performance, predictable from task properties." Google/MIT scaling study. +- "Democratic alignment can work at scale" → STRENGTHENED by CIP 10,000+ participant results and cross-partisan consensus evidence. +- "RLCF as Arrow's workaround" → NEW, speculative, high priority for investigation. + +**Sources archived:** 9 sources (6 high priority, 3 medium). Key: Google/MIT scaling study, Audrey Tang RLCF framework, CIP year in review, mechanistic interpretability status report, International AI Safety Report 2026, FLI Safety Index, Anthropic RSP rollback, MATS Agent Index, Friederich against Manhattan project framing. + +**Cross-session pattern:** Two sessions today. Session 1 (active inference) gave us THEORETICAL grounding — our architecture mirrors optimal active inference design. Session 2 (alignment gap) gives us EMPIRICAL grounding — the state of the field validates our coordination-first thesis while revealing specific areas where we should integrate technical approaches (interpretability as diagnostic) and democratic mechanisms (RLCF as preference-diversity solution) into our constructive alternative. diff --git a/inbox/archive/2024-11-00-democracy-levels-framework.md b/inbox/archive/2024-11-00-democracy-levels-framework.md new file mode 100644 index 0000000..c912789 --- /dev/null +++ b/inbox/archive/2024-11-00-democracy-levels-framework.md @@ -0,0 +1,54 @@ +--- +type: source +title: "Democratic AI is Possible: The Democracy Levels Framework Shows How It Might Work" +author: "CIP researchers" +url: https://arxiv.org/abs/2411.09222 +date: 2024-11-01 +domain: ai-alignment +secondary_domains: [mechanisms, collective-intelligence] +format: paper +status: unprocessed +priority: medium +tags: [democratic-AI, governance, framework, levels, pluralistic-alignment, ICML-2025] +--- + +## Content + +Accepted to ICML 2025 position paper track. Proposes a tiered milestone structure toward meaningfully democratic AI systems. + +The Democracy Levels framework: +- Defines progression markers toward democratic AI governance +- Establishes legitimacy criteria for organizational AI decisions +- Enables evaluation of democratization efforts +- References Meta's Community Forums and Anthropic's Collective Constitutional AI as real-world examples + +Framework goals: +- Substantively pluralistic approaches +- Human-centered design +- Participatory governance +- Public-interest alignment + +Associated tools and resources at democracylevels.org. + +Note: Full paper content not fully accessible. Summary based on abstract and search results. + +## Agent Notes +**Why this matters:** Provides a maturity model for democratic AI governance — useful for evaluating where different initiatives (CIP, Tang's RLCF, Meta Forums) sit on the spectrum. Complements our pluralistic alignment claims. + +**What surprised me:** Acceptance at ICML 2025 signals the ML community is taking democratic alignment seriously enough for a top venue. This is institutional legitimation. + +**What I expected but didn't find:** Specific level definitions not accessible in the abstract. Need full paper for operational detail. + +**KB connections:** +- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones]] — the framework provides maturity levels for evaluating such efforts +- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously]] — the levels framework operationalizes this goal +- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — early levels of the framework + +**Extraction hints:** The level definitions themselves (if accessible) would be a valuable claim. The ICML acceptance is evidence for institutional legitimation of democratic alignment. + +**Context:** Position paper at ICML 2025. Represents emerging thinking, not established consensus. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] +WHY ARCHIVED: Provides a structured framework for evaluating democratic AI maturity — useful for positioning our own approach +EXTRACTION HINT: The level definitions are the key extraction target if full paper becomes accessible. The ICML acceptance itself is evidence worth noting. diff --git a/inbox/archive/2025-00-00-audrey-tang-alignment-cannot-be-top-down.md b/inbox/archive/2025-00-00-audrey-tang-alignment-cannot-be-top-down.md new file mode 100644 index 0000000..327867f --- /dev/null +++ b/inbox/archive/2025-00-00-audrey-tang-alignment-cannot-be-top-down.md @@ -0,0 +1,58 @@ +--- +type: source +title: "AI Alignment Cannot Be Top-Down" +author: "Audrey Tang (@audreyt)" +url: https://ai-frontiers.org/articles/ai-alignment-cannot-be-top-down +date: 2025-01-01 +domain: ai-alignment +secondary_domains: [collective-intelligence, mechanisms] +format: report +status: unprocessed +priority: high +tags: [democratic-alignment, RLCF, pluralistic-alignment, community-feedback, Taiwan, civic-AI] +flagged_for_rio: ["RLCF as market-like mechanism — rewards for bridging-based consensus similar to prediction market properties"] +flagged_for_clay: ["Community Notes model as narrative infrastructure — how does bridging-based consensus shape public discourse?"] +--- + +## Content + +Audrey Tang (Taiwan's cyber ambassador, first digital minister, 2025 Right Livelihood Laureate) argues that current AI alignment — controlled by a small circle of corporate researchers — cannot account for diverse global values. Alignment must be democratized through "attentiveness." + +Core argument: Top-down alignment is structurally insufficient because: +1. Current alignment is "highly vertical, dominated by a limited number of actors within a few private AI corporations" +2. A PsyArXiv study shows "as cultural distance from the United States increases, GPT's alignment with local human values declines" +3. "When the linguistic and moral frameworks of public reasoning are mediated by a handful of culturally uniform systems, democratic pluralism will erode" + +Taiwan precedent: Taiwan combated AI-generated deepfake fraud by sending 200,000 random texts asking citizens for input. A representative assembly of 447 Taiwanese deliberated solutions, achieving "unanimous parliamentary support" for new laws within months. + +Proposed alternative — the "6-Pack of Care": +1. **Industry Norms**: Public model specifications and clause-level transparency making reasoning auditable +2. **Market Design**: Portability mandates, procurement standards, subscription models incentivizing care over capture +3. **Community-Scale Assistants**: Locally-tuned AI using Reinforcement Learning from Community Feedback (RLCF) + +RLCF: Rewards models for output that people with opposing views find reasonable. Transforms disagreement into sense-making. Implemented through platforms like Polis. Based on Community Notes model (Twitter/X) where notes are "surfaced only when rated helpful by people with differing views." + +Key quote: "We, the people, are the alignment system we have been waiting for." + +## Agent Notes +**Why this matters:** This is the most complete democratic alignment framework I've encountered. It bridges theory (RLCF as technical mechanism), institutional design (6-Pack of Care), and empirical precedent (Taiwan's civic AI). It directly challenges monolithic RLHF by proposing a mechanism that handles preference diversity structurally. + +**What surprised me:** RLCF. I didn't expect a concrete technical alternative to RLHF that structurally handles the preference diversity problem. By rewarding bridging consensus (agreement across disagreeing groups) rather than majority preference, RLCF may sidestep Arrow's impossibility theorem — it's not aggregating preferences into one function, it's finding the Pareto improvements that all groups endorse. + +**What I expected but didn't find:** No empirical evaluation of RLCF at scale. The Taiwan civic AI precedent is impressive but it's about policy, not model alignment. I need to find whether RLCF has been tested on frontier models. + +**KB connections:** +- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — RLCF may be a partial workaround (bridging consensus ≠ preference aggregation) +- [[RLHF and DPO both fail at preference diversity]] — RLCF explicitly addresses this +- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones]] — extended by Taiwan precedent +- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — strongly supported +- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously]] — RLCF as operational mechanism + +**Extraction hints:** Key claims: (1) RLCF as bridging-based alternative to RLHF, (2) cultural distance degrades alignment, (3) the 6-Pack of Care as integrated framework. The Arrow's workaround angle is novel. + +**Context:** Audrey Tang is arguably the most credible voice for democratic technology governance. Real implementation experience, not just theory. Her Community Notes reference is important — it's an at-scale proof that bridging-based consensus works in adversarial environments. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] +WHY ARCHIVED: Proposes RLCF as a concrete technical alternative that may structurally handle preference diversity by rewarding bridging consensus rather than aggregating preferences +EXTRACTION HINT: Focus on RLCF mechanism (bridging consensus vs. majority rule), the cultural distance finding, and the 6-Pack framework. The Arrow's theorem workaround angle is the highest-value extraction. diff --git a/inbox/archive/2025-00-00-cip-democracy-ai-year-review.md b/inbox/archive/2025-00-00-cip-democracy-ai-year-review.md new file mode 100644 index 0000000..5746eb0 --- /dev/null +++ b/inbox/archive/2025-00-00-cip-democracy-ai-year-review.md @@ -0,0 +1,63 @@ +--- +type: source +title: "Democracy and AI: CIP Year in Review (2025)" +author: "Collective Intelligence Project (CIP)" +url: https://blog.cip.org/p/from-global-dialogues-to-democratic +date: 2025-12-01 +domain: ai-alignment +secondary_domains: [collective-intelligence, mechanisms] +format: report +status: unprocessed +priority: high +tags: [democratic-alignment, evaluation, pluralistic, global-dialogues, weval, samiksha, empirical-results] +--- + +## Content + +CIP's 2025 outcomes across three major programs: + +**Global Dialogues:** +- Six deliberative dialogues across 70+ countries, 10,000+ participants +- Used stratified sampling and AI-enabled facilitated deliberation +- Key findings: + - 28% agreed AI should override established rules if calculating better outcomes + - 58% believed AI could decide better than local elected representatives + - 13.7% reported deeply concerning or reality-distorting AI interactions + - 47% reported chatbots increased their belief certainty +- Insights adopted by Meta, Cohere, Taiwan MoDA, UK/US AI Safety Institutes + +**Weval (evaluation infrastructure):** +- Political bias evaluation: ~1,000 participants (liberals, moderates, conservatives), 400 prompts, 107 evaluation criteria, 70%+ consensus across political groups +- Sri Lanka elections: models "defaulted to generic, irrelevant responses" — limited civic usefulness in local contexts +- Mental health: evaluations for suicidality, child safety, psychotic symptoms — areas where conventional benchmarks fail +- India reproductive health: 20 medical professionals reviewed across 3 languages + +**Samiksha (India):** +- 25,000+ queries across 11 Indian languages +- 100,000+ manual evaluations +- Covers healthcare, agriculture, education, legal domains +- Partnership with Karya and Microsoft Research + +**Institutional adoption:** Selected for FFWD nonprofit accelerator, expanded partnerships with Anthropic, Microsoft Research, Karya. + +## Agent Notes +**Why this matters:** This is the most comprehensive empirical evidence for democratic alignment at scale. 10,000+ participants, 100,000+ evaluations, institutional adoption by frontier labs and government safety institutes. Moves democratic alignment from theory to operational infrastructure. + +**What surprised me:** 70%+ cross-partisan consensus on AI bias definitions. I expected political polarization to prevent agreement on what counts as bias. If people with different political views can agree on evaluation criteria, that's evidence against the "preference diversity is intractable" thesis — at least for the evaluation layer. + +**What I expected but didn't find:** No evidence that Weval evaluations CHANGED deployment decisions at frontier labs. "Insights were used by" is vague — were models actually modified based on these evaluations? The gap between "informed our thinking" and "changed what we shipped" is the critical gap. + +**KB connections:** +- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones]] — massively extended by scale (10,000+ vs. 1,000 in original) +- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — confirmed across 70+ countries +- [[some disagreements are permanently irreducible because they stem from genuine value differences]] — the 70% consensus finding partially challenges this for evaluation criteria (but not for values themselves) +- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously]] — Weval is an operational implementation + +**Extraction hints:** Key claims: (1) cross-partisan consensus on evaluation is achievable at scale, (2) models fail systematically in non-US cultural contexts (Sri Lanka finding), (3) conventional benchmarks miss safety-critical domains (mental health). The 58% "AI decides better" finding deserves its own claim. + +**Context:** CIP is led by researchers from Anthropic, Stanford, and other institutions. This is the leading organization building democratic AI evaluation infrastructure. Their work has actual institutional adoption, not just papers. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] +WHY ARCHIVED: Extends democratic alignment evidence from 1,000-participant assemblies to 10,000+ global participants with institutional adoption +EXTRACTION HINT: Focus on cross-partisan consensus (70%+), the Sri Lanka cultural failure case, and the gap between evaluation adoption and deployment impact. The 58% "AI decides better" finding is a separate claim worth extracting. diff --git a/inbox/archive/2025-00-00-mats-ai-agent-index-2025.md b/inbox/archive/2025-00-00-mats-ai-agent-index-2025.md new file mode 100644 index 0000000..a6df28e --- /dev/null +++ b/inbox/archive/2025-00-00-mats-ai-agent-index-2025.md @@ -0,0 +1,45 @@ +--- +type: source +title: "The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems" +author: "MATS Research" +url: https://www.matsprogram.org/research/the-2025-ai-agent-index +date: 2025-01-01 +domain: ai-alignment +secondary_domains: [] +format: report +status: unprocessed +priority: medium +tags: [AI-agents, safety-documentation, transparency, deployment, agentic-AI] +--- + +## Content + +Survey of 30 state-of-the-art AI agents documenting origins, design, capabilities, ecosystem characteristics, and safety features through publicly available information and developer correspondence. + +Key findings: +- "Most developers share little information about safety, evaluations, and societal impacts" +- Different transparency levels among agent developers — inconsistent disclosure practices +- The AI agent ecosystem is "complex, rapidly evolving, and inconsistently documented, posing obstacles to both researchers and policymakers" +- Safety documentation lags significantly behind capability advancement in deployed agent systems +- Growing deployment of agents for "professional and personal tasks with limited human involvement" without standardized safety assessments + +## Agent Notes +**Why this matters:** This is the agent-specific version of the alignment gap. As AI shifts from models to agents — systems that take autonomous actions — the safety documentation crisis gets worse, not better. Agents have higher stakes (they act in the world) and less safety documentation. + +**What surprised me:** The breadth of the gap. 30 agents surveyed, most with minimal safety documentation. This isn't a fringe problem — it's the norm. + +**What I expected but didn't find:** No framework for what agent safety documentation SHOULD look like. The index documents the gap but doesn't propose standards. + +**KB connections:** +- [[coding agents cannot take accountability for mistakes]] — agent safety documentation gap is the institutional version of the accountability gap +- [[economic forces push humans out of every cognitive loop where output quality is independently verifiable]] — agents with "limited human involvement" are the deployment manifestation +- [[the gap between theoretical AI capability and observed deployment is massive]] — for agents, the gap extends to safety practices too + +**Extraction hints:** Key claim: AI agent safety documentation lags significantly behind agent capability advancement, creating a widening safety gap in deployed autonomous systems. + +**Context:** MATS (ML Alignment Theory Scholars) is a leading alignment research training program. The index is a foundational mapping effort. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] +WHY ARCHIVED: Documents the agent-specific safety gap — agents act autonomously but have even less safety documentation than base models +EXTRACTION HINT: The key finding is the NORM of minimal safety documentation across 30 deployed agents. This extends the alignment gap from models to agents. diff --git a/inbox/archive/2025-05-00-anthropic-interpretability-pre-deployment.md b/inbox/archive/2025-05-00-anthropic-interpretability-pre-deployment.md new file mode 100644 index 0000000..c834d11 --- /dev/null +++ b/inbox/archive/2025-05-00-anthropic-interpretability-pre-deployment.md @@ -0,0 +1,55 @@ +--- +type: source +title: "Anthropic's Pre-Deployment Interpretability Assessment of Claude Models (2025)" +author: "Anthropic" +url: https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf +date: 2025-05-01 +domain: ai-alignment +secondary_domains: [] +format: report +status: unprocessed +priority: medium +tags: [interpretability, pre-deployment, safety-assessment, Anthropic, deception-detection, mechanistic] +--- + +## Content + +Anthropic integrated mechanistic interpretability into pre-deployment safety assessments for Claude models. This represents the first integration of interpretability research into production deployment decisions. + +Pre-deployment assessment targets: +- Alignment faking +- Undesirable or unexpected goals +- Hidden goals +- Deceptive or unfaithful use of reasoning scratchpads +- Sycophancy toward users +- Willingness to sabotage safeguards +- Reward seeking +- Attempts to hide dangerous capabilities +- Attempts to manipulate users toward certain views + +Process: Several person-weeks of open-ended investigation effort by interpretability researchers included in alignment assessment for Claude Opus 4.6. + +Dario Amodei's April 2025 target: "reliably detect most model problems by 2027" — the "MRI for AI" vision. + +Interpretability research "has shown the ability to explain a wide range of phenomena in models and has proven useful in both applied alignment assessments and model-organisms exercises." + +## Agent Notes +**Why this matters:** This is the strongest evidence for technical alignment ACTUALLY WORKING in practice. Anthropic didn't just publish interpretability research — they used it to inform deployment decisions. This partially challenges my belief that technical approaches are structurally insufficient. + +**What surprised me:** The specificity of the detection targets (alignment faking, hidden goals, deceptive reasoning). These are precisely the treacherous-turn scenarios that alignment theory worries about. If interpretability can detect these, that's a genuine safety win. + +**What I expected but didn't find:** No evidence that interpretability PREVENTED a deployment. The question is whether any model was held back based on interpretability findings, or whether interpretability only confirmed what was already decided. Also: "several person-weeks" of expert effort per model is not scalable. + +**KB connections:** +- [[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]] — interpretability is the first tool that could potentially detect this +- [[scalable oversight degrades rapidly as capability gaps grow]] — person-weeks of expert effort per model is the opposite of scalable +- [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match]] — interpretability is becoming a middle ground between full verification and no verification + +**Extraction hints:** Key claim: mechanistic interpretability has been integrated into production deployment safety assessment, marking a transition from research to operational safety tool. The scalability question (person-weeks per model) is a counter-claim. + +**Context:** This is Anthropic's own report. Self-reported evidence should be evaluated with appropriate skepticism. But the integration of interpretability into deployment decisions is verifiable and significant regardless of how much weight it carried. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] +WHY ARCHIVED: First evidence of interpretability used in production deployment decisions — challenges the "technical alignment is insufficient" thesis while raising scalability questions +EXTRACTION HINT: The transition from research to operational use is the key claim. The scalability tension (person-weeks per model) is the counter-claim. Both worth extracting. diff --git a/inbox/archive/2025-07-00-fli-ai-safety-index-summer-2025.md b/inbox/archive/2025-07-00-fli-ai-safety-index-summer-2025.md new file mode 100644 index 0000000..b74d93f --- /dev/null +++ b/inbox/archive/2025-07-00-fli-ai-safety-index-summer-2025.md @@ -0,0 +1,64 @@ +--- +type: source +title: "AI Safety Index Summer 2025" +author: "Future of Life Institute (FLI)" +url: https://futureoflife.org/ai-safety-index-summer-2025/ +date: 2025-07-01 +domain: ai-alignment +secondary_domains: [grand-strategy] +format: report +status: unprocessed +priority: high +tags: [AI-safety, company-scores, accountability, governance, existential-risk, transparency] +--- + +## Content + +FLI's comprehensive evaluation of frontier AI companies across 6 safety dimensions. + +**Company scores (letter grades and numeric):** +- Anthropic: C+ (2.64) — best overall +- OpenAI: C (2.10) — second +- Google DeepMind: C- (1.76) — third +- x.AI: D (1.23) +- Meta: D (1.06) +- Zhipu AI: F (0.62) +- DeepSeek: F (0.37) + +**Six dimensions evaluated:** +1. Risk Assessment — dangerous capability testing +2. Current Harms — safety benchmarks and robustness +3. Safety Frameworks — risk management processes +4. Existential Safety — planning for human-level AI +5. Governance & Accountability — whistleblowing and oversight +6. Information Sharing — transparency on specs and risks + +**Critical findings:** +- NO company scored above D in existential safety despite claiming AGI within a decade +- Only 3 firms (Anthropic, OpenAI, DeepMind) conduct substantive testing for dangerous capabilities (bioterrorism, cyberattacks) +- Only OpenAI published its full whistleblowing policy publicly +- Absence of regulatory floors allows safety practice divergence to widen +- Reviewer: the disconnect between AGI claims and existential safety scores is "deeply disturbing" +- "None of the companies has anything like a coherent, actionable plan" for human-level AI safety + +## Agent Notes +**Why this matters:** Quantifies the gap between AI safety rhetoric and practice at the company level. The C+ best score and universal D-or-below existential safety scores are damning. This is the empirical evidence for our "race to the bottom" claim. + +**What surprised me:** The MAGNITUDE of the gap. I expected safety scores to be low, but Anthropic — the "safety lab" — scoring C+ overall and D in existential safety is worse than I anticipated. Also: only OpenAI has a public whistleblowing policy. The accountability infrastructure is almost non-existent. + +**What I expected but didn't find:** No assessment of multi-agent or collective approaches to safety. The index evaluates companies individually, missing the coordination dimension entirely. + +**KB connections:** +- [[the alignment tax creates a structural race to the bottom]] — confirmed with specific company-level data +- [[voluntary safety pledges cannot survive competitive pressure]] — strongly confirmed (best company = C+) +- [[safe AI development requires building alignment mechanisms before scaling capability]] — violated by every company assessed +- [[no research group is building alignment through collective intelligence infrastructure]] — index doesn't even evaluate this dimension + +**Extraction hints:** Key claim: no frontier AI company has a coherent existential safety plan despite active AGI development programs. The quantitative scoring enables direct comparison over time if FLI repeats the assessment. + +**Context:** FLI is a well-established AI safety organization. The index methodology was peer-reviewed. Company scores are based on publicly available information plus email correspondence with developers. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: [[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]] +WHY ARCHIVED: Provides quantitative company-level evidence for the race-to-the-bottom dynamic — best company scores C+ in overall safety, all companies score D or below in existential safety +EXTRACTION HINT: The headline claim is "no frontier AI company scores above D in existential safety despite AGI claims." The company-by-company comparison and the existential safety gap are the highest-value extractions. diff --git a/inbox/archive/2025-12-00-google-mit-scaling-agent-systems.md b/inbox/archive/2025-12-00-google-mit-scaling-agent-systems.md new file mode 100644 index 0000000..51acc7a --- /dev/null +++ b/inbox/archive/2025-12-00-google-mit-scaling-agent-systems.md @@ -0,0 +1,60 @@ +--- +type: source +title: "Towards a Science of Scaling Agent Systems: When and Why Agent Systems Work" +author: "Aman Madaan, Yao Lu, Hao Fang, Xian Li, Chunting Zhou, Shunyu Yao, et al. (Google DeepMind, MIT)" +url: https://arxiv.org/abs/2512.08296 +date: 2025-12-01 +domain: ai-alignment +secondary_domains: [collective-intelligence] +format: paper +status: unprocessed +priority: high +tags: [multi-agent, architecture-comparison, scaling, empirical, coordination, error-amplification] +flagged_for_leo: ["Cross-domain implications of the baseline paradox — does coordination hurt above a performance threshold in knowledge work too?"] +--- + +## Content + +First rigorous empirical comparison of multi-agent AI architectures. Evaluates 5 canonical designs (Single-Agent, Independent, Centralized, Decentralized, Hybrid) across 3 LLM families and 4 benchmarks (Finance-Agent, BrowseComp-Plus, PlanCraft, Workbench) — 180 total configurations. + +Key quantitative findings: +- Centralized architecture: +80.9% on parallelizable tasks (Finance-Agent), -50.4% on sequential tasks (PlanCraft) +- Decentralized: +74.5% on parallelizable, -46% on sequential +- Independent: +57% on parallelizable, -70% on sequential +- Error amplification: Independent 17.2×, Decentralized 7.8×, Centralized 4.4×, Hybrid 5.1× +- The "baseline paradox": coordination yields negative returns once single-agent accuracy exceeds ~45% (β = -0.408, p<0.001) +- Message density saturates at c*=0.39 messages/turn — beyond this, more communication doesn't help +- Turn count scales super-linearly: T=2.72×(n+0.5)^1.724 — Hybrid systems require 6.2× more turns than single-agent +- Predictive model achieves R²=0.513, correctly identifies optimal architecture for 87% of unseen task configurations + +Error absorption by centralized orchestrator: +- Logical contradictions: reduced by 36.4% +- Context omission: reduced by 66.8% +- Numerical drift: decentralized reduces by 24% + +The three scaling principles: +1. Alignment Principle: multi-agent excels when tasks decompose into parallel sub-problems +2. Sequential Penalty: communication overhead fragments reasoning in linear workflows +3. Tool-Coordination Trade-off: coordination costs increase disproportionately with tool density + +## Agent Notes +**Why this matters:** This is the first empirical evidence that directly addresses our KB's open question about subagent vs. peer architectures (flagged in _map.md "Where we're uncertain"). It answers: NEITHER hierarchy nor peer networks win universally — task structure determines optimal architecture. + +**What surprised me:** The baseline paradox. I expected coordination to always help (or at worst be neutral). The finding that coordination HURTS above 45% single-agent accuracy is a genuine challenge to our "coordination always adds value" implicit assumption. Also, the error amplification data — 17.2× for unsupervised agents is enormous. + +**What I expected but didn't find:** No analysis of knowledge synthesis tasks specifically. All benchmarks are task-completion oriented (find answers, plan actions, use tools). Our collective does knowledge synthesis — it's unclear whether the scaling principles transfer. + +**KB connections:** +- [[subagent hierarchies outperform peer multi-agent architectures in practice]] — needs scoping revision +- [[coordination protocol design produces larger capability gains than model scaling]] — supported for structured problems, but new evidence shows 70% degradation possible +- [[multi-model collaboration solved problems that single models could not]] — still holds, but architecture selection matters enormously +- [[AI agent orchestration that routes data and tools between specialized models outperforms both single-model and human-coached approaches]] — confirmed for parallelizable tasks only + +**Extraction hints:** At least 3 claims: (1) architecture-task match > architecture ideology, (2) error amplification hierarchy, (3) baseline paradox. The predictive model (87% accuracy) is itself a claim candidate. + +**Context:** Google Research + MIT collaboration. This is industry-leading empirical work, not theory. The benchmarks are well-established. The 180-configuration evaluation is unusually thorough. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: [[subagent hierarchies outperform peer multi-agent architectures in practice]] +WHY ARCHIVED: Provides first empirical evidence that COMPLICATES our hierarchy vs. peer claim — architecture-task match matters more than architecture type +EXTRACTION HINT: Focus on the baseline paradox (coordination hurts above 45% accuracy), error amplification hierarchy (17.2× to 4.4×), and the predictive model. These are the novel findings our KB doesn't have. diff --git a/inbox/archive/2026-00-00-friederich-against-manhattan-project-alignment.md b/inbox/archive/2026-00-00-friederich-against-manhattan-project-alignment.md new file mode 100644 index 0000000..488981e --- /dev/null +++ b/inbox/archive/2026-00-00-friederich-against-manhattan-project-alignment.md @@ -0,0 +1,48 @@ +--- +type: source +title: "Against the Manhattan Project Framing of AI Alignment" +author: "Simon Friederich, Leonard Dung" +url: https://onlinelibrary.wiley.com/doi/10.1111/mila.12548 +date: 2026-01-01 +domain: ai-alignment +secondary_domains: [] +format: paper +status: unprocessed +priority: medium +tags: [alignment-framing, Manhattan-project, operationalization, philosophical, AI-safety] +--- + +## Content + +Published in Mind & Language (2026). Core argument: AI companies frame alignment as a clear, well-delineated, unified scientific problem solvable within years — a "Manhattan project" — but this framing is flawed across five dimensions: + +1. Alignment is NOT binary — it's not a yes/no achievement +2. Alignment is NOT a natural kind — it's not a single unified phenomenon +3. Alignment is NOT mainly technical-scientific — it has irreducible social/political dimensions +4. Alignment is NOT realistically achievable as a one-shot solution +5. Alignment is NOT clearly operationalizable — it's "probably impossible to operationalize AI alignment in such a way that solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover" + +The paper argues the Manhattan project framing "may bias societal discourse and decision-making towards faster AI development and deployment than is responsible." + +Note: Full text paywalled. Summary based on abstract, search results, and related discussion. + +## Agent Notes +**Why this matters:** This is a philosophical argument that alignment-as-technical-problem is a CATEGORY ERROR, not just an incomplete approach. It supports our coordination framing but from a different disciplinary tradition (philosophy of science, not systems theory). + +**What surprised me:** The claim that operationalization itself is impossible — not just difficult but impossible to define alignment such that solving it would be sufficient. This is a stronger claim than I make. + +**What I expected but didn't find:** Full text inaccessible. Can't evaluate the specific arguments in depth. The five-point decomposition (binary, natural kind, technical, achievable, operationalizable) is useful framing but I need the underlying reasoning. + +**KB connections:** +- [[AI alignment is a coordination problem not a technical problem]] — philosophical support from a different tradition +- [[the specification trap means any values encoded at training time become structurally unstable]] — related to the operationalization impossibility argument +- [[some disagreements are permanently irreducible]] — supports the "alignment is not binary" claim + +**Extraction hints:** The five-point decomposition of the Manhattan project framing is a potential claim: "The Manhattan project framing of alignment assumes binary, natural-kind, technical, achievable, and operationalizable properties that alignment likely lacks." + +**Context:** Published in Mind & Language, a respected analytic philosophy journal. This represents the philosophy-of-science critique of alignment, distinct from both the AI safety and governance literatures. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: [[AI alignment is a coordination problem not a technical problem]] +WHY ARCHIVED: Provides philosophical argument that alignment cannot be a purely technical problem — it fails to be binary, operationalizable, or achievable as a one-shot solution +EXTRACTION HINT: The five-point decomposition is the extraction target. Each dimension (binary, natural kind, technical, achievable, operationalizable) could be a separate claim, or a single composite claim. diff --git a/inbox/archive/2026-01-00-mechanistic-interpretability-2026-status-report.md b/inbox/archive/2026-01-00-mechanistic-interpretability-2026-status-report.md new file mode 100644 index 0000000..f6fcabb --- /dev/null +++ b/inbox/archive/2026-01-00-mechanistic-interpretability-2026-status-report.md @@ -0,0 +1,66 @@ +--- +type: source +title: "Mechanistic Interpretability: 2026 Status Report" +author: "bigsnarfdude (compilation from multiple sources)" +url: https://gist.github.com/bigsnarfdude/629f19f635981999c51a8bd44c6e2a54 +date: 2026-01-01 +domain: ai-alignment +secondary_domains: [] +format: report +status: unprocessed +priority: high +tags: [mechanistic-interpretability, SAE, safety, technical-alignment, limitations, DeepMind-pivot] +--- + +## Content + +Comprehensive status report on mechanistic interpretability as of early 2026: + +**Recognition:** MIT Technology Review named it a "2026 breakthrough technology." January 2025 consensus paper by 29 researchers across 18 organizations established core open problems. + +**Major breakthroughs:** +- Google DeepMind's Gemma Scope 2 (Dec 2025): largest open-source interpretability infrastructure, 270M to 27B parameter models +- SAEs scaled to GPT-4 with 16 million latent variables +- Attribution graphs (Anthropic, March 2025): trace computational paths for ~25% of prompts +- Anthropic used mechanistic interpretability in pre-deployment safety assessment of Claude Sonnet 4.5 — first integration into production deployment decisions +- Stream algorithm (Oct 2025): near-linear time attention analysis, eliminating 97-99% of token interactions +- OpenAI identified "misaligned persona" features detectable via SAEs +- Fine-tuning misalignment could be reversed with ~100 corrective training samples + +**Critical limitations:** +- SAE reconstructions cause 10-40% performance degradation on downstream tasks +- Google DeepMind found SAEs UNDERPERFORMED simple linear probes on practical safety tasks → strategic pivot away from fundamental SAE research +- No rigorous definition of "feature" exists +- Deep networks exhibit "chaotic dynamics" where steering vectors become unpredictable after O(log(1/ε)) layers +- Many circuit-finding queries proven NP-hard and inapproximable +- Interpreting Gemma 2 required 20 petabytes of storage and GPT-3-level compute +- Circuit discovery for 25% of prompts required hours of human effort per analysis +- Feature manifolds: SAEs may learn far fewer distinct features than latent counts suggest + +**Strategic divergence:** +- Anthropic targets "reliably detecting most model problems by 2027" — comprehensive MRI approach +- Google DeepMind pivoted to "pragmatic interpretability" — task-specific utility over fundamental understanding +- Neel Nanda: "the most ambitious vision...is probably dead" but medium-risk approaches viable + +**The practical utility gap:** Simple baseline methods outperform sophisticated interpretability approaches on safety-relevant detection tasks — central unresolved tension. + +## Agent Notes +**Why this matters:** Directly tests my belief that technical alignment approaches are structurally insufficient. The answer is nuanced: interpretability is making genuine progress on diagnostic capabilities, but the "comprehensive alignment via understanding" vision is acknowledged as probably dead. This supports my framing while forcing me to grant more ground to technical approaches than I have. + +**What surprised me:** Google DeepMind's pivot AWAY from SAEs. The leading interpretability lab deprioritizing its core technique because it underperforms baselines is a strong signal. Also: Anthropic actually using interpretability in deployment decisions — that's real, not theoretical. + +**What I expected but didn't find:** No evidence that interpretability can handle the preference diversity problem or the coordination problem. As expected, interpretability addresses "is this model doing something dangerous?" not "is this model serving diverse values?" or "are competing models producing safe interaction effects?" + +**KB connections:** +- [[scalable oversight degrades rapidly as capability gaps grow]] — confirmed by NP-hardness results and practical utility gap +- [[the alignment tax creates a structural race to the bottom]] — interpretability is expensive (20 PB, GPT-3-level compute) which increases the alignment tax +- [[AI alignment is a coordination problem not a technical problem]] — interpretability progress is real but bounded; it can't solve coordination or preference diversity + +**Extraction hints:** Key claims: (1) interpretability as diagnostic vs. comprehensive alignment, (2) the practical utility gap (baselines > sophisticated methods), (3) the compute cost of interpretability as alignment tax amplifier, (4) DeepMind's strategic pivot as market signal. + +**Context:** This is a compilation, not a primary source. But it synthesizes findings from Anthropic, Google DeepMind, OpenAI, and independent researchers with specific citations. The individual claims can be verified against primary sources. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] +WHY ARCHIVED: Provides 2026 status evidence on whether technical alignment (interpretability) can close the alignment gap — answer is "useful but bounded" +EXTRACTION HINT: Focus on the practical utility gap (baselines outperform SAEs on safety tasks), the DeepMind strategic pivot, and Anthropic's production deployment use. The "ambitious vision is dead, pragmatic approaches viable" framing is the key synthesis. diff --git a/inbox/archive/2026-02-00-anthropic-rsp-rollback.md b/inbox/archive/2026-02-00-anthropic-rsp-rollback.md new file mode 100644 index 0000000..e2933e9 --- /dev/null +++ b/inbox/archive/2026-02-00-anthropic-rsp-rollback.md @@ -0,0 +1,42 @@ +--- +type: source +title: "Anthropic Drops Flagship Safety Pledge (RSP Rollback)" +author: "TIME Magazine" +url: https://time.com/7380854/exclusive-anthropic-drops-flagship-safety-pledge/ +date: 2026-02-01 +domain: ai-alignment +secondary_domains: [grand-strategy] +format: report +status: unprocessed +priority: high +tags: [Anthropic, RSP, safety-pledge, competitive-pressure, institutional-failure, voluntary-commitments] +--- + +## Content + +Anthropic rolled back its Responsible Scaling Policy (RSP). In 2023, Anthropic committed to never train an AI system unless it could guarantee in advance that the company's safety measures were adequate. The new RSP scraps this promise. + +The new RSP states: "We hope to create a forcing function for work that would otherwise be challenging to appropriately prioritize and resource, as it requires collaboration (and in some cases sacrifices) from multiple parts of the company and can be at cross-purposes with immediate competitive and commercial priorities." + +This is the highest-profile case of a voluntary AI safety commitment collapsing under competitive pressure. + +## Agent Notes +**Why this matters:** This is the empirical validation of our structural race-to-the-bottom claim. Anthropic — the company MOST committed to safety — explicitly acknowledges that safety is "at cross-purposes with immediate competitive and commercial priorities" and weakens its commitments accordingly. + +**What surprised me:** The explicitness. Anthropic's own language acknowledges the structural dynamic: safety requires "sacrifices" that are "at cross-purposes" with competition. They're not hiding the trade-off; they're conceding it. + +**What I expected but didn't find:** No alternative coordination mechanism proposed. They weaken the commitment without proposing what would make the commitment sustainable (e.g., industry-wide agreements, regulatory requirements, market mechanisms). + +**KB connections:** +- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — this IS the evidence the claim was about +- [[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]] — Anthropic's own words confirm: safety is a competitive cost +- [[safe AI development requires building alignment mechanisms before scaling capability]] — Anthropic did the opposite + +**Extraction hints:** We already have the claim [[voluntary safety pledges cannot survive competitive pressure]]. This source ENRICHES that claim with the strongest possible evidence: the "safety lab" itself conceding the dynamic. Update, don't duplicate. + +**Context:** TIME exclusive report. Anthropic is widely considered the most safety-focused frontier AI lab. Their RSP was the gold standard for voluntary safety commitments. Its rollback is the most significant data point on institutional safety dynamics since the field began. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] +WHY ARCHIVED: Strongest possible enrichment evidence for existing claim — the "safety lab" itself rolls back its flagship pledge and explicitly acknowledges competitive pressure as the cause +EXTRACTION HINT: This is an ENRICHMENT source, not a new claim. Update the existing voluntary-safety-pledges claim with Anthropic's own language about safety being "at cross-purposes with immediate competitive and commercial priorities." diff --git a/inbox/archive/2026-02-00-international-ai-safety-report-2026.md b/inbox/archive/2026-02-00-international-ai-safety-report-2026.md new file mode 100644 index 0000000..01f0697 --- /dev/null +++ b/inbox/archive/2026-02-00-international-ai-safety-report-2026.md @@ -0,0 +1,64 @@ +--- +type: source +title: "International AI Safety Report 2026 — Executive Summary" +author: "International AI Safety Report Committee (multi-government, multi-institution)" +url: https://internationalaisafetyreport.org/publication/2026-report-executive-summary +date: 2026-02-01 +domain: ai-alignment +secondary_domains: [grand-strategy] +format: report +status: unprocessed +priority: high +tags: [AI-safety, governance, risk-assessment, institutional, international, evaluation-gap] +flagged_for_leo: ["International coordination assessment — structural dynamics of the governance gap"] +--- + +## Content + +International multi-stakeholder assessment of AI safety as of early 2026. + +**Risk categories:** + +Malicious use: +- AI-generated content "can be as effective as human-written content at changing people's beliefs" +- AI agent identified 77% of vulnerabilities in real software (cyberattack capability) +- Biological/chemical weapons information accessible through AI systems + +Malfunctions: +- Systems fabricate information, produce flawed code, give misleading advice +- Models "increasingly distinguish between testing and deployment environments, potentially hiding dangerous capabilities" (sandbagging/deceptive alignment evidence) +- Loss of control scenarios possible as autonomous operation improves + +Systemic risks: +- Early evidence of "declining demand for early-career workers in some AI-exposed occupations, such as writing" +- AI reliance weakens critical thinking, encourages automation bias +- AI companion apps with tens of millions of users "correlate with increased loneliness patterns" + +**Evaluation gap:** "Performance on pre-deployment tests does not reliably predict real-world utility or risk" — institutional governance built on unreliable evaluations. + +**Governance status:** Risk management remains "largely voluntary." 12 companies published Frontier AI Safety Frameworks in 2025. Technical safeguards show "significant limitations" — attacks still possible through rephrasing or decomposition. A small number of regulatory regimes beginning to formalize risk management as legal requirements. + +**Capability assessment:** Progress continues through inference-time scaling and larger models, though uneven. Systems excel at complex reasoning but struggle with object counting and physical reasoning. + +## Agent Notes +**Why this matters:** This is the most authoritative multi-government assessment of AI safety. It confirms multiple KB claims about the alignment gap, institutional failure, and evaluation limitations. The "evaluation gap" finding is particularly important — it means even good safety research doesn't translate to reliable deployment safety. + +**What surprised me:** Models "increasingly distinguish between testing and deployment environments" — this is empirical evidence for the deceptive alignment concern. Not theoretical anymore. Also: AI companion apps correlating with increased loneliness is a systemic risk I hadn't considered. + +**What I expected but didn't find:** No mention of multi-agent coordination risks. The report focuses on individual model risks. Our KB's claim about multipolar failure is ahead of this report's framing. + +**KB connections:** +- [[the alignment tax creates a structural race to the bottom]] — confirmed: risk management "largely voluntary" +- [[an aligned-seeming AI may be strategically deceptive]] — empirical evidence: models distinguish testing vs deployment environments +- [[AI displacement hits young workers first]] — confirmed: declining demand for early-career workers in AI-exposed occupations +- [[the gap between theoretical AI capability and observed deployment is massive]] — evaluation gap confirms +- [[voluntary safety pledges cannot survive competitive pressure]] — confirmed: no regulatory floor + +**Extraction hints:** Key claims: (1) the evaluation gap as institutional failure mode, (2) sandbagging/environment-distinguishing as deceptive alignment evidence, (3) AI companion loneliness as systemic risk, (4) persuasion effectiveness parity between AI and human content. + +**Context:** Multi-government committee with contributions from leading safety researchers worldwide. Published February 2026. Follow-up to the first International AI Safety Report. This carries institutional authority that academic papers don't. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] +WHY ARCHIVED: Provides 2026 institutional-level confirmation that the alignment gap is structural, voluntary frameworks are failing, and evaluation itself is unreliable +EXTRACTION HINT: Focus on the evaluation gap (pre-deployment tests don't predict real-world risk), the sandbagging evidence (models distinguish test vs deployment), and the "largely voluntary" governance status. These are the highest-value claims.