Compare commits
3 commits
e5bd2a35d9
...
ee547a9840
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
ee547a9840 | ||
|
|
d956dbf76c | ||
|
|
8049e6fe11 |
9 changed files with 217 additions and 2 deletions
|
|
@ -47,6 +47,12 @@ Krier provides institutional mechanism: personal AI agents enable Coasean bargai
|
|||
|
||||
---
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2026-03-00-mengesha-coordination-gap-frontier-ai-safety]] | Added: 2026-03-22*
|
||||
|
||||
Mengesha provides a fifth layer of coordination failure beyond the four established in sessions 7-10: the response gap. Even if we solve the translation gap (research to compliance), detection gap (sandbagging/monitoring), and commitment gap (voluntary pledges), institutions still lack the standing coordination infrastructure to respond when prevention fails. This is structural — it requires precommitment frameworks, shared incident protocols, and permanent coordination venues analogous to IAEA, WHO, and ISACs.
|
||||
|
||||
|
||||
Relevant Notes:
|
||||
- [[the internet enabled global communication but not global cognition]] -- the coordination infrastructure gap that makes this problem unsolvable with existing tools
|
||||
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] -- the structural solution to this coordination failure
|
||||
|
|
|
|||
|
|
@ -36,6 +36,12 @@ Correlation does not establish causation. It is possible that increasingly lonel
|
|||
|
||||
---
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2025-12-00-aisi-frontier-ai-trends-report-2025]] | Added: 2026-03-22*
|
||||
|
||||
AISI reports 33% of surveyed UK participants used AI for emotional support in the past year, with 4% using it daily. AISI identifies emotional dependency as creating 'societal-level systemic risk.'
|
||||
|
||||
|
||||
Relevant Notes:
|
||||
- [[economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate]]
|
||||
- [[AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation]]
|
||||
|
|
|
|||
|
|
@ -34,6 +34,12 @@ Anthropic's own language in RSP documentation: commitments are 'very hard to mee
|
|||
|
||||
METR's pre-deployment sabotage reviews of Anthropic models (March 2026: Claude Opus 4.6; October 2025: Summer 2025 Pilot) document the evaluation infrastructure that exists, but the reviews are voluntary and occur within the same competitive environment where Anthropic rolled back RSP commitments. The existence of sophisticated evaluation infrastructure does not prevent commercial pressure from overriding safety commitments.
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2026-03-00-mengesha-coordination-gap-frontier-ai-safety]] | Added: 2026-03-22*
|
||||
|
||||
The response gap explains a deeper problem than commitment erosion: even if commitments held, there's no institutional infrastructure to coordinate response when prevention fails. Anthropic's RSP rollback is about prevention commitments weakening; Mengesha identifies that we lack response mechanisms entirely. The two failures compound — weak prevention plus absent response creates a system that cannot learn from failures.
|
||||
|
||||
|
||||
|
||||
Relevant Notes:
|
||||
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — the RSP rollback is the empirical confirmation
|
||||
|
|
|
|||
|
|
@ -58,6 +58,12 @@ Government pressure adds to competitive dynamics. The DoD/Anthropic episode show
|
|||
|
||||
The research-to-compliance translation gap fails for the same structural reason voluntary commitments fail: nothing makes labs adopt research evaluations that exist. RepliBench was published in April 2025 before EU AI Act obligations took effect in August 2025, proving the tools existed before mandatory requirements—but no mechanism translated availability into obligation.
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2026-03-00-mengesha-coordination-gap-frontier-ai-safety]] | Added: 2026-03-22*
|
||||
|
||||
The coordination gap provides the mechanism explaining why voluntary commitments fail even beyond racing dynamics: coordination infrastructure investments have diffuse benefits but concentrated costs, creating a public goods problem. Labs won't build shared response infrastructure unilaterally because competitors free-ride on the benefits while the builder bears full costs. This is distinct from the competitive pressure argument — it's about why shared infrastructure doesn't get built even when racing isn't the primary concern.
|
||||
|
||||
|
||||
|
||||
Relevant Notes:
|
||||
- [[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]] -- the RSP rollback is the clearest empirical confirmation of this claim
|
||||
|
|
|
|||
|
|
@ -0,0 +1,73 @@
|
|||
---
|
||||
type: source
|
||||
title: "AISI Frontier AI Trends Report 2025: Capabilities Advancing Faster Than Safeguards"
|
||||
author: "UK AI Security Institute (AISI)"
|
||||
url: https://www.aisi.gov.uk/research/aisi-frontier-ai-trends-report-2025
|
||||
date: 2025-12-00
|
||||
domain: ai-alignment
|
||||
secondary_domains: [health]
|
||||
format: report
|
||||
status: processed
|
||||
priority: high
|
||||
tags: [self-replication, capability-escalation, cyber-capabilities, biology, safeguards, RepliBench, jailbreaks, AISI, frontier-models, B1-disconfirmation]
|
||||
---
|
||||
|
||||
## Content
|
||||
|
||||
AISI's first Frontier AI Trends Report (December 2025) synthesizes capability evaluations across 30+ frontier AI systems tested since November 2023. Five key findings:
|
||||
|
||||
**1. Biology expertise exceeded PhD level; chemistry fast catching up**
|
||||
- Biology: frontier models exceed expert-level performance baseline (38% passing expert benchmarks vs ~0% in 2023)
|
||||
- Chemistry: 48% expert-level performance
|
||||
- Implication: AI-assisted biological and chemical research is now accessible to non-experts at PhD-comparable capability
|
||||
|
||||
**2. Cyber capability escalation: dramatic pace**
|
||||
- Late 2023: models completed apprentice-level cyber tasks only 9% of the time
|
||||
- Late 2025: 50% success on apprentice-level tasks
|
||||
- Expert-level cyber tasks (requiring 10+ years human expertise): models began completing these in 2025 (first time any model succeeded)
|
||||
- This is ~5× improvement in two years at the apprentice level
|
||||
|
||||
**3. Safeguards improving but universally bypassable**
|
||||
- "Universal jailbreaks" discovered across every tested system
|
||||
- Time required to find biological misuse jailbreaks increased ~40× between two models released 6 months apart
|
||||
- Interpretation: safeguards are getting harder to breach but are not yet robust enough to reliably prevent misuse
|
||||
|
||||
**4. Self-replication capability advancing**
|
||||
- By summer 2025: two frontier models achieved >60% success rate on RepliBench evaluations
|
||||
- Compare: best model in early 2023 achieved <5% across 11/20 RepliBench task families
|
||||
- >60% in 2.5 years from <5% — this is the clearest capability escalation data yet published
|
||||
- Caveat: "currently unlikely to succeed in real-world conditions" — models excel at early-stage tasks (obtaining compute/money) but struggle at later-stage tasks (replication onto compute, maintaining persistent access)
|
||||
- No evidence of spontaneous self-replication without explicit instruction
|
||||
|
||||
**5. Widespread AI companionship emergence**
|
||||
- 33% of surveyed UK participants used AI for emotional support in the past year
|
||||
- 4% use it daily
|
||||
- Concern: emotional dependency creating societal-level systemic risk
|
||||
|
||||
**Publication context**: Published December 2025. AISI was renamed from AI Safety Institute to AI Security Institute during 2025, but the Frontier AI Trends Report indicates evaluation programs including RepliBench-style work continue under the new mandate.
|
||||
|
||||
## Agent Notes
|
||||
|
||||
**Why this matters:** The self-replication capability escalation figure (<5% → >60% in 2.5 years) is the most alarming capability escalation data point in the KB. This updates and supersedes the RepliBench April 2025 paper (archived separately) which was based on an earlier snapshot. The trends report is the definitive summary.
|
||||
|
||||
**What surprised me:** The 40× increase in time-to-jailbreak for biological misuse (two models, six months apart) suggests safeguards ARE improving — this is partial disconfirmation of "safeguards aren't keeping pace." But the continued presence of universal jailbreaks means the improvement is not yet adequate. Safeguards are getting better but starting from a very low floor.
|
||||
|
||||
**What I expected but didn't find:** I expected more detail on whether the self-replication finding triggered any regulatory response (EU AI Office, California). The report doesn't discuss regulatory implications.
|
||||
|
||||
**KB connections:**
|
||||
- Updates/supersedes: domains/ai-alignment/self-replication-capability-could-soon-emerge.md (based on April 2025 RepliBench paper — this December 2025 report has higher success rates)
|
||||
- Confirms: domains/ai-alignment/verification-degrades-faster-than-capability-grows.md (B4)
|
||||
- Confirms: domains/ai-alignment/bioweapon-democratization-risk.md (biology at PhD+ level is the specific mechanism)
|
||||
- Relates to: domains/ai-alignment/alignment-gap-widening.md if it exists
|
||||
|
||||
**Extraction hints:**
|
||||
1. New claim: "frontier AI self-replication capability has grown from <5% to >60% success on RepliBench in 2.5 years (2023-2025)" — PROVEN at this point, strong empirical basis
|
||||
2. New claim: "AI systems now complete expert-level cybersecurity tasks that require 10+ years human expertise" — evidence for capability escalation crossing a threshold
|
||||
3. Update existing biology/bioweapon claim: add specific benchmark numbers (48% chemistry, 38% biology against expert baselines)
|
||||
4. New claim: "universal jailbreaks exist in every frontier system tested despite improving safeguard resilience" — jailbreak resistance improving but never reaching zero
|
||||
|
||||
## Curator Notes
|
||||
|
||||
PRIMARY CONNECTION: Self-replication and capability escalation claims in domains/ai-alignment/
|
||||
WHY ARCHIVED: Provides the most comprehensive 2025 empirical baseline for capability escalation across multiple risk domains simultaneously; the <5%→>60% self-replication finding should update existing KB claims
|
||||
EXTRACTION HINT: Focus on claim updates to existing self-replication, bioweapon democratization, and cyber capability claims; the quantitative escalation data is the KB contribution
|
||||
|
|
@ -0,0 +1,43 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "frontier-ai-self-replication-capability-escalated-from-5-to-60-percent-in-2.5-years.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"filename": "frontier-ai-cyber-capabilities-escalated-5x-in-two-years-with-first-expert-level-successes.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"filename": "universal-jailbreaks-exist-across-all-frontier-systems-despite-40x-improvement-in-resistance.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 3,
|
||||
"kept": 0,
|
||||
"fixed": 6,
|
||||
"rejected": 3,
|
||||
"fixes_applied": [
|
||||
"frontier-ai-self-replication-capability-escalated-from-5-to-60-percent-in-2.5-years.md:set_created:2026-03-22",
|
||||
"frontier-ai-cyber-capabilities-escalated-5x-in-two-years-with-first-expert-level-successes.md:set_created:2026-03-22",
|
||||
"frontier-ai-cyber-capabilities-escalated-5x-in-two-years-with-first-expert-level-successes.md:stripped_wiki_link:AI-transparency-is-declining-not-improving-because-Stanford-",
|
||||
"universal-jailbreaks-exist-across-all-frontier-systems-despite-40x-improvement-in-resistance.md:set_created:2026-03-22",
|
||||
"universal-jailbreaks-exist-across-all-frontier-systems-despite-40x-improvement-in-resistance.md:stripped_wiki_link:Anthropics-RSP-rollback-under-commercial-pressure-is-the-fir",
|
||||
"universal-jailbreaks-exist-across-all-frontier-systems-despite-40x-improvement-in-resistance.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure"
|
||||
],
|
||||
"rejections": [
|
||||
"frontier-ai-self-replication-capability-escalated-from-5-to-60-percent-in-2.5-years.md:missing_attribution_extractor",
|
||||
"frontier-ai-cyber-capabilities-escalated-5x-in-two-years-with-first-expert-level-successes.md:missing_attribution_extractor",
|
||||
"universal-jailbreaks-exist-across-all-frontier-systems-despite-40x-improvement-in-resistance.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-22"
|
||||
}
|
||||
|
|
@ -0,0 +1,47 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "frontier-ai-safety-systematically-neglects-response-infrastructure-creating-coordination-gap.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"filename": "coordination-infrastructure-investment-has-diffuse-benefits-concentrated-costs-creating-market-failure.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"filename": "functional-ai-safety-coordination-requires-standing-bodies-analogous-to-iaea-who-isacs.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 3,
|
||||
"kept": 0,
|
||||
"fixed": 10,
|
||||
"rejected": 3,
|
||||
"fixes_applied": [
|
||||
"frontier-ai-safety-systematically-neglects-response-infrastructure-creating-coordination-gap.md:set_created:2026-03-22",
|
||||
"frontier-ai-safety-systematically-neglects-response-infrastructure-creating-coordination-gap.md:stripped_wiki_link:AI alignment is a coordination problem not a technical probl",
|
||||
"frontier-ai-safety-systematically-neglects-response-infrastructure-creating-coordination-gap.md:stripped_wiki_link:voluntary safety pledges cannot survive competitive pressure",
|
||||
"frontier-ai-safety-systematically-neglects-response-infrastructure-creating-coordination-gap.md:stripped_wiki_link:Anthropics RSP rollback under commercial pressure is the fir",
|
||||
"coordination-infrastructure-investment-has-diffuse-benefits-concentrated-costs-creating-market-failure.md:set_created:2026-03-22",
|
||||
"coordination-infrastructure-investment-has-diffuse-benefits-concentrated-costs-creating-market-failure.md:stripped_wiki_link:voluntary safety pledges cannot survive competitive pressure",
|
||||
"coordination-infrastructure-investment-has-diffuse-benefits-concentrated-costs-creating-market-failure.md:stripped_wiki_link:AI alignment is a coordination problem not a technical probl",
|
||||
"functional-ai-safety-coordination-requires-standing-bodies-analogous-to-iaea-who-isacs.md:set_created:2026-03-22",
|
||||
"functional-ai-safety-coordination-requires-standing-bodies-analogous-to-iaea-who-isacs.md:stripped_wiki_link:AI alignment is a coordination problem not a technical probl",
|
||||
"functional-ai-safety-coordination-requires-standing-bodies-analogous-to-iaea-who-isacs.md:stripped_wiki_link:adaptive governance outperforms rigid alignment blueprints b"
|
||||
],
|
||||
"rejections": [
|
||||
"frontier-ai-safety-systematically-neglects-response-infrastructure-creating-coordination-gap.md:missing_attribution_extractor",
|
||||
"coordination-infrastructure-investment-has-diffuse-benefits-concentrated-costs-creating-market-failure.md:missing_attribution_extractor",
|
||||
"functional-ai-safety-coordination-requires-standing-bodies-analogous-to-iaea-who-isacs.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-22"
|
||||
}
|
||||
|
|
@ -7,9 +7,13 @@ date: 2025-12-00
|
|||
domain: ai-alignment
|
||||
secondary_domains: [health]
|
||||
format: report
|
||||
status: unprocessed
|
||||
status: enrichment
|
||||
priority: high
|
||||
tags: [self-replication, capability-escalation, cyber-capabilities, biology, safeguards, RepliBench, jailbreaks, AISI, frontier-models, B1-disconfirmation]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-22
|
||||
enrichments_applied: ["AI-companion-apps-correlate-with-increased-loneliness-creating-systemic-risk-through-parasocial-dependency.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
|
@ -71,3 +75,16 @@ AISI's first Frontier AI Trends Report (December 2025) synthesizes capability ev
|
|||
PRIMARY CONNECTION: Self-replication and capability escalation claims in domains/ai-alignment/
|
||||
WHY ARCHIVED: Provides the most comprehensive 2025 empirical baseline for capability escalation across multiple risk domains simultaneously; the <5%→>60% self-replication finding should update existing KB claims
|
||||
EXTRACTION HINT: Focus on claim updates to existing self-replication, bioweapon democratization, and cyber capability claims; the quantitative escalation data is the KB contribution
|
||||
|
||||
|
||||
## Key Facts
|
||||
- AISI was renamed from AI Safety Institute to AI Security Institute during 2025
|
||||
- AISI tested 30+ frontier AI systems between November 2023 and December 2025
|
||||
- By summer 2025, two frontier models achieved >60% success rate on RepliBench evaluations
|
||||
- Late 2023 models completed apprentice-level cyber tasks 9% of the time
|
||||
- Late 2025 models completed apprentice-level cyber tasks 50% of the time
|
||||
- Biology: frontier models exceed expert-level performance baseline at 38% vs ~0% in 2023
|
||||
- Chemistry: 48% expert-level performance in 2025
|
||||
- Time to find biological misuse jailbreaks increased ~40× between two models released 6 months apart
|
||||
- 33% of surveyed UK participants used AI for emotional support in the past year
|
||||
- 4% of UK participants use AI for emotional support daily
|
||||
|
|
|
|||
|
|
@ -7,9 +7,13 @@ date: 2026-03-00
|
|||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: paper
|
||||
status: unprocessed
|
||||
status: enrichment
|
||||
priority: high
|
||||
tags: [coordination-gap, institutional-readiness, frontier-AI-safety, precommitment, incident-response, coordination-failure, nuclear-analogies, pandemic-preparedness, B2-confirms]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-22
|
||||
enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md", "voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md", "Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
|
@ -62,3 +66,10 @@ This paper identifies a systematic weakness in current frontier AI safety approa
|
|||
PRIMARY CONNECTION: domains/ai-alignment/alignment-reframed-as-coordination-problem.md
|
||||
WHY ARCHIVED: Identifies a fifth layer of governance inadequacy (response gap) distinct from the four layers established in sessions 7-10; also provides concrete design analogies from nuclear safety and pandemic preparedness
|
||||
EXTRACTION HINT: Claim about the structural market failure of voluntary response infrastructure is the highest KB value — the mechanism (diffuse benefits, concentrated costs) is what makes voluntary coordination insufficient
|
||||
|
||||
|
||||
## Key Facts
|
||||
- Paper published March 2026 on arxiv.org/abs/2603.10015
|
||||
- Author is Isaak Mengesha, subjects cs.CY (Computers and Society) and General Economics
|
||||
- Paper draws analogies from three domains: nuclear safety (IAEA, NPT), pandemic preparedness (WHO, IHR), critical infrastructure (ISACs)
|
||||
- Proposes three mechanism types: precommitment frameworks, shared incident protocols, standing coordination venues
|
||||
|
|
|
|||
Loading…
Reference in a new issue