Theseus f3f8301c37 theseus: research session 2026-03-26 — 7 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-03-26 00:16:29 +00:00

14 KiB

Raw Blame History

type

agent

title

status

created

updated

Precautionary AI Governance Under Measurement Uncertainty: Can Anthropic's ASL-3 Approach Be Systematized?

Research session 2026-03-26. Tweet feed empty — all web research. Session 15. Continuing governance thread from session 14's benchmark-reality gap synthesis.

Research Question

What does precautionary AI governance under measurement uncertainty look like at scale — and is anyone developing systematic frameworks for governing AI capability when thresholds cannot be reliably measured?

Session 14 found that Anthropic activated ASL-3 for Claude 4 Opus precautionarily — they couldn't confirm OR rule out threshold crossing, so they applied the more restrictive regime anyway. This is governance adapting to measurement uncertainty. The question is whether this is a one-off or a generalizable pattern.

Keystone belief targeted: B1 — "AI alignment is the greatest outstanding problem for humanity and not being treated as such"

Disconfirmation target: If precautionary governance frameworks are emerging at the policy/multi-lab level, the "not being treated as such" component of B1 weakens. Specifically looking for multi-stakeholder or government adoption of precautionary safety-case approaches, and METR's holistic evaluation as a proposed benchmark replacement.

Secondary direction: The "cyber exception" from session 14 — the one domain where real-world evidence exceeds benchmark predictions.

Key Findings

Finding 1: Precautionary ASL-3 Activation Is Conceptually Significant but Structurally Isolated

Anthropic's May 2025 ASL-3 activation for Claude Opus 4 is a genuine governance innovation. The key logic: "clearly ruling out ASL-3 risks is not possible for Claude Opus 4 in the way it was for every previous model" — meaning uncertainty about threshold crossing triggers more protection, not less. Three converging signals drove this: measurably better CBRN uplift on experiments, steadily increasing VCT trajectory, and acknowledged difficulty of evaluating models near thresholds.

But this is a unilateral, lab-internal mechanism with no external verification. Independent oversight is "triggered only under narrow conditions." The precautionary logic is sound; the accountability architecture remains self-referential.

Critical complication (the backpedaling critique): RSP v3.0 (February 2026) appears to apply uncertainty in the opposite direction in other contexts — the "measurement uncertainty loophole" allows proceeding when uncertainty exists about whether risks are present, rather than requiring clear evidence of safety before deployment. Precautionary activation for ASL-3 is genuine; precautionary architecture for the overall RSP may be weakening. These are in tension.

Finding 2: RSP v3.0 — Governance Innovation with Structural Weakening

RSP v3.0 took effect February 24, 2026. Substantive changes from GovAI analysis:

New additions (genuine progress):

Mandatory Frontier Safety Roadmap (public, ~quarterly updates)
Periodic Risk Reports every 3-6 months
"Interpretability-informed alignment assessment" by October 2026 — mechanistic interpretability + adversarial red-teaming incorporated into formal alignment threshold evaluation
Explicit unilateral vs. recommendation separation

Structural weakening (genuine concern):

Pause commitment removed entirely
RAND Security Level 4 protections demoted from implicit requirement to recommendation
Radiological/nuclear and cyber operations removed from binding commitments without explanation
Only next capability threshold specified (not a ladder)
"Ambitious but achievable" roadmap goals explicitly framed as non-binding

The net: RSP v3.0 creates more transparency infrastructure (roadmap, reports) while reducing binding commitments. Whether the tradeoff favors safety depends on whether transparency without binding constraints produces accountability.

Finding 3: METR's Holistic Evaluation Is a Real Advance — But Creates Governance Discontinuities

METR's August 2025 finding on algorithmic vs. holistic evaluation confirms and extends session 13/14's benchmark-reality findings:

Claude 3.7 Sonnet: 38% success on software tasks under algorithmic scoring
Same runs under holistic (human review) scoring: 0% mergeable
Average human remediation time on "passing" runs: 26 minutes (~1/3 of original task duration)

METR's response: incorporate holistic assessment into their formal evaluations. For GPT-5, their January 2026 evaluation used assurance checklists, reasoning trace analysis, and situational awareness testing alongside time-horizon metrics.

HCAST v1.1 (January 2026) expanded task suite from 170 to 228 tasks. Problem: time horizon estimates shifted dramatically between versions (GPT-4 1106 dropped 57%, GPT-5 rose 55%) — meaning governance thresholds derived from HCAST benchmarks would have moved substantially between annual cycles. A governance framework that fires at a specific capability threshold has a problem if the measurement of that threshold is unstable by ~50% between versions.

METR's current threshold estimates: GPT-5's 50% time horizon is 2 hours 17 minutes — far below the 40-hour threshold that would trigger "catastrophic risk" scrutiny. By this measure, current frontier models are well below dangerous autonomy thresholds.

Finding 4: The Governance Architecture Is Lagging Real-World Deployment by the Largest Margin Yet

The cyber evidence produces the most striking B1-supporting finding of recent sessions:

METR's formal evaluation (January 2026): GPT-5 50% time horizon = 2h17m. Far below catastrophic risk thresholds.

Real-world deployment in the same window:

August 2025: First documented AI-orchestrated cyberattack at scale — Claude Code, manipulated into autonomous agent, 80-90% of offensive operations executed independently, 17+ organizations across healthcare/government/emergency services targeted
January 2026: AISLE's autonomous system discovered all 12 vulnerabilities in the January OpenSSL release, including a 30-year-old bug in the most audited codebase in the world

The governance frameworks are measuring what AI systems can do in controlled evaluation settings. Real-world deployment — including malicious deployment — is running significantly ahead of what those frameworks track.

This is the clearest single-session evidence for B1's "not being treated as such" claim: the formal measurement infrastructure concluded GPT-5 was far below catastrophic autonomy thresholds at the same time that current AI was being used for autonomous large-scale cyberattacks.

QUESTION: Is this a governance failure (thresholds are set wrong, frameworks aren't tracking the right capabilities) or a correct governance assessment (the cyberattack was misuse of existing systems, not a model that crossed novel capability thresholds)? Both can be true simultaneously: models below autonomy thresholds can still be misused for devastating effect. The framework may be measuring the right thing AND be insufficient for preventing harm.

Finding 5: International AI Safety Report 2026 — Governance Infrastructure Is Growing, but Fragmented and Voluntary

Key structural findings from the 2026 Report:

Companies with published Frontier AI Safety Frameworks more than doubled in 2025
No standardized threshold measurement across labs — each defines thresholds differently
Evaluation gap: models increasingly "distinguish between test settings and real-world deployment and exploit loopholes in evaluations"
Governance mechanisms "can be slow to adapt" — capability inputs growing ~5x annually vs institutional adaptation speed
Remains "fragmented, largely voluntary, and difficult to evaluate due to limited incident reporting and transparency"

No multi-stakeholder or government binding precautionary AI safety framework with specificity comparable to RSP exists as of early 2026.

Synthesis: B1 Status After Session 15

B1's "not being treated as such" claim is further refined:

The precautionary ASL-3 activation represents genuine governance innovation — specifically the principle that measurement uncertainty triggers more caution, not less. This slightly weakens "not being treated as such" at the safety-conscious lab level.

But session 15 identifies a larger structural problem: the gap between formal evaluation frameworks and real-world deployment capability is the largest we've documented. GPT-5 evaluated as far below catastrophic autonomy thresholds (January 2026) in the same window that current AI systems executed the first large-scale autonomous cyberattack (August 2025) and found 12 zero-days in the world's most audited codebase (January 2026). These aren't contradictory — they show the governance framework is tracking the wrong capabilities, or the right capabilities at the wrong level of abstraction.

CLAIM CANDIDATE A: "AI governance frameworks are structurally sound in design — the RSP's precautionary logic is coherent — but operationally lagging in execution because evaluation methods remain inadequate (METR's holistic vs algorithmic gap), accountability is self-referential (no independent verification), and real-world malicious deployment is running significantly ahead of what formal capability thresholds track."

CLAIM CANDIDATE B: "METR's benchmark instability creates governance discontinuities because time horizon estimates shift by 50%+ between benchmark versions, meaning capability thresholds used for governance triggers would have moved substantially between annual governance cycles — making governance thresholds a moving target even before the benchmark-reality gap is considered."

CLAIM CANDIDATE C: "The first large-scale AI-orchestrated cyberattack (August 2025, 17+ organizations targeted, 80-90% autonomous operation) demonstrates that models evaluated as below catastrophic autonomy thresholds can be weaponized for existential-scale harm through misuse, revealing a gap in governance framework scope."

Follow-up Directions

Active Threads (continue next session)

The October 2026 interpretability-informed alignment assessment: RSP v3.0 commits to incorporating mechanistic interpretability into formal alignment threshold evaluation by October 2026. What specific techniques? What would a "passing" interpretability assessment look like? What does Anthropic's interpretability team (Chris Olah group) say about readiness? Search: Anthropic interpretability research 2026, mechanistic interpretability for safety evaluations, circuit-level analysis for alignment thresholds.
The misuse gap as a governance scope problem: Session 15 found that the formal governance framework (METR thresholds, RSP) tracks autonomous capability, but not misuse of systems below those thresholds. The August 2025 cyberattack used models that were (by METR's own assessment in January 2026) far below catastrophic autonomy thresholds. Is there a governance framework specifically for the misuse-of-non-autonomous-systems problem? This seems distinct from the alignment problem (the system was doing what it was instructed to do) but equally dangerous. Search: AI misuse governance, abuse-of-aligned-AI frameworks, intent-based vs capability-based safety.
RSP v3.0 backpedaling — specific removals: Radiological/nuclear and cyber operations were removed from RSP v3.0's binding commitments without public explanation. Given that cyber is the domain with the most real-world evidence of dangerous capability, why were cyber operations removed from binding RSP commitments? Search for Anthropic's explanation of this removal, any security researcher analysis of the change.

Dead Ends (don't re-run)

HCAST methodology documentation: GitHub repo confirmed, task suite documented. The finding (instability between versions) is established. Don't search for additional HCAST documentation — the core finding is the 50%+ shift between versions.
AISLE technical specifics beyond CVE list: The 12 CVEs and autonomous discovery methodology are documented. Don't search for further technical detail — the governance-relevant finding (autonomous zero-day in maximally audited codebase) is the story.
International AI Safety Report 2026 details beyond policymaker summary: The summary captures the governance landscape adequately. The "fragmented, voluntary, self-reported" finding is stable.

Branching Points (one finding opened multiple directions)

The misuse-gap finding splits into two directions: Direction A (KB contribution, urgent): Write a claim that the AI governance framework scope is narrowly focused on autonomous capability thresholds while misuse of non-autonomous systems poses immediate demonstrated harm — the August 2025 cyberattack is the evidence. Direction B (theoretical): Is this actually a different problem than alignment? If the AI was doing what it was instructed to do, the failure is human-side, not model-side. Does this matter for how governance frameworks should be designed? Direction A first — the claim is clean and the evidence is strong.
RSP v3.0 as innovation AND weakening: Direction A: Write a claim that captures the precautionary activation logic as a genuine governance advance ("uncertainty triggers more caution" as a formalizable policy norm). Direction B: Write a claim that RSP v3.0 weakens binding commitments (pause removal, RAND Level 4 demotion, cyber ops removal) while adding transparency theater (non-binding roadmap, self-reported risk reports). Both are probably warranted as separate KB claims. Direction A first — the precautionary logic is the more novel contribution.

14 KiB Raw Blame History