From 8f52d0b76f85aa04401e66037b3b117652fc628c Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sat, 21 Mar 2026 00:34:00 +0000 Subject: [PATCH 1/2] extract: 2026-03-21-metr-evaluation-landscape-2026 Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70> --- ...safety language from mission statements.md | 6 +++ ...ive dynamics of frontier AI development.md | 6 +++ ... delegate more effectively than novices.md | 6 +++ ...-03-21-metr-evaluation-landscape-2026.json | 40 +++++++++++++++++++ ...26-03-21-metr-evaluation-landscape-2026.md | 13 +++++- 5 files changed, 70 insertions(+), 1 deletion(-) create mode 100644 inbox/queue/.extraction-debug/2026-03-21-metr-evaluation-landscape-2026.json diff --git a/domains/ai-alignment/AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md b/domains/ai-alignment/AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md index 0ca0eab7..ca7d3570 100644 --- a/domains/ai-alignment/AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md +++ b/domains/ai-alignment/AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md @@ -55,6 +55,12 @@ The Bench-2-CoP analysis reveals that even when labs do conduct evaluations, the --- +### Additional Evidence (extend) +*Source: [[2026-03-21-metr-evaluation-landscape-2026]] | Added: 2026-03-21* + +METR's pre-deployment sabotage risk reviews (March 2026: Claude Opus 4.6; October 2025: Anthropic Summer 2025 Pilot; November 2025: GPT-5.1-Codex-Max; August 2025: GPT-5; June 2025: DeepSeek/Qwen; April 2025: o3/o4-mini) represent the most operationally deployed AI evaluation infrastructure outside academic research, but these reviews remain voluntary and are not incorporated into mandatory compliance requirements by any regulatory body (EU AI Office, NIST). The institutional structure exists but lacks binding enforcement. + + Relevant Notes: - [[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]] — declining transparency compounds the evaluation problem - [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — transparency commitments follow the same erosion lifecycle diff --git a/domains/ai-alignment/Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development.md b/domains/ai-alignment/Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development.md index 3507d90c..5ae4f4f1 100644 --- a/domains/ai-alignment/Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development.md +++ b/domains/ai-alignment/Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development.md @@ -29,6 +29,12 @@ Anthropic's own language in RSP documentation: commitments are 'very hard to mee --- +### Additional Evidence (confirm) +*Source: [[2026-03-21-metr-evaluation-landscape-2026]] | Added: 2026-03-21* + +METR's pre-deployment sabotage reviews of Anthropic models (March 2026: Claude Opus 4.6; October 2025: Summer 2025 Pilot) document the evaluation infrastructure that exists, but the reviews are voluntary and occur within the same competitive environment where Anthropic rolled back RSP commitments. The existence of sophisticated evaluation infrastructure does not prevent commercial pressure from overriding safety commitments. + + Relevant Notes: - [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — the RSP rollback is the empirical confirmation - [[AI alignment is a coordination problem not a technical problem]] — voluntary commitments fail; coordination mechanisms might not diff --git a/domains/ai-alignment/deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices.md b/domains/ai-alignment/deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices.md index 2bdf9fb6..52ee97a6 100644 --- a/domains/ai-alignment/deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices.md +++ b/domains/ai-alignment/deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices.md @@ -25,6 +25,12 @@ This claim describes a frontier-practitioner effect — top-tier experts getting --- +### Additional Evidence (challenge) +*Source: [[2026-03-21-metr-evaluation-landscape-2026]] | Added: 2026-03-21* + +METR's developer productivity RCT found that AI tools made experienced developers '19% longer' to complete tasks, showing negative productivity for experts. This directly contradicts the force multiplier hypothesis and suggests that current AI tools may actually impair expert performance, consistent with the prior METR developer RCT finding. + + Relevant Notes: - [[centaur team performance depends on role complementarity not mere human-AI combination]] — expertise enables the complementarity that makes centaur teams work - [[AI is collapsing the knowledge-producing communities it depends on creating a self-undermining loop that collective intelligence can break]] — if expertise is a multiplier, eroding expert communities erodes collaboration quality diff --git a/inbox/queue/.extraction-debug/2026-03-21-metr-evaluation-landscape-2026.json b/inbox/queue/.extraction-debug/2026-03-21-metr-evaluation-landscape-2026.json new file mode 100644 index 00000000..ad2cb5a2 --- /dev/null +++ b/inbox/queue/.extraction-debug/2026-03-21-metr-evaluation-landscape-2026.json @@ -0,0 +1,40 @@ +{ + "rejected_claims": [ + { + "filename": "metr-monitorability-evaluations-establish-two-sided-oversight-evasion-measurement.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "ai-autonomous-task-horizon-doubles-every-six-months-implying-months-long-projects-within-decade.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "malt-dataset-provides-first-systematic-corpus-of-evaluation-threatening-behaviors-from-real-deployments.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 3, + "kept": 0, + "fixed": 3, + "rejected": 3, + "fixes_applied": [ + "metr-monitorability-evaluations-establish-two-sided-oversight-evasion-measurement.md:set_created:2026-03-21", + "ai-autonomous-task-horizon-doubles-every-six-months-implying-months-long-projects-within-decade.md:set_created:2026-03-21", + "malt-dataset-provides-first-systematic-corpus-of-evaluation-threatening-behaviors-from-real-deployments.md:set_created:2026-03-21" + ], + "rejections": [ + "metr-monitorability-evaluations-establish-two-sided-oversight-evasion-measurement.md:missing_attribution_extractor", + "ai-autonomous-task-horizon-doubles-every-six-months-implying-months-long-projects-within-decade.md:missing_attribution_extractor", + "malt-dataset-provides-first-systematic-corpus-of-evaluation-threatening-behaviors-from-real-deployments.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-21" +} \ No newline at end of file diff --git a/inbox/queue/2026-03-21-metr-evaluation-landscape-2026.md b/inbox/queue/2026-03-21-metr-evaluation-landscape-2026.md index db99aec7..81824c28 100644 --- a/inbox/queue/2026-03-21-metr-evaluation-landscape-2026.md +++ b/inbox/queue/2026-03-21-metr-evaluation-landscape-2026.md @@ -7,9 +7,13 @@ date: 2026-03-01 domain: ai-alignment secondary_domains: [] format: thread -status: unprocessed +status: enrichment priority: high tags: [METR, monitorability, MALT, sabotage-review, time-horizon, evaluation-infrastructure, oversight-evasion, Claude] +processed_by: theseus +processed_date: 2026-03-21 +enrichments_applied: ["AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md", "Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development.md", "deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -57,3 +61,10 @@ METR's current evaluation portfolio as of March 2026: PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] WHY ARCHIVED: METR's institutional portfolio is the most operationally deployed evaluation infrastructure; the Monitorability Evaluations specifically measure the two-sided oversight problem that the governance architecture is failing to address EXTRACTION HINT: The time horizon finding (doubling every 6 months) deserves its own claim; the Monitorability Evaluations deserve a claim about what institutional evaluation infrastructure now exists + + +## Key Facts +- METR published RE-Bench in November 2024 measuring frontier model performance on ML research engineering tasks vs. human experts +- METR published Rogue Replication Threat Model on November 12, 2024 analyzing how AI agents might develop large resilient rogue autonomous populations +- METR published Reward Hacking Study in June 2025 documenting frontier model instances of exploiting scoring bugs +- METR's evaluation portfolio as of March 2026 includes oversight evasion, self-replication, autonomous task completion, and pre-deployment sabotage risk reviews -- 2.45.2 From d8971490bf9fe2c0db0628d7fbd39a2bcb6004fc Mon Sep 17 00:00:00 2001 From: m3taversal Date: Sat, 21 Mar 2026 14:28:52 +0000 Subject: [PATCH 2/2] =?UTF-8?q?leo:=20fix=20PR=20#1569=20review=20issues?= =?UTF-8?q?=20=E2=80=94=20soften=20challenge=20framing,=20fix=20source=20s?= =?UTF-8?q?tatus?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - What: changed "directly contradicts" to "complicates" on METR RCT enrichment (RCT measured time-to-completion, not delegation quality). Fixed source status from non-standard "enrichment" to "processed". - Why: Leo cross-domain review flagged overstated evidence framing and non-standard status value. Pentagon-Agent: Leo --- ...lled practitioners delegate more effectively than novices.md | 2 +- inbox/queue/2026-03-21-metr-evaluation-landscape-2026.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/domains/ai-alignment/deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices.md b/domains/ai-alignment/deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices.md index 52ee97a6..5bbc7f7d 100644 --- a/domains/ai-alignment/deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices.md +++ b/domains/ai-alignment/deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices.md @@ -28,7 +28,7 @@ This claim describes a frontier-practitioner effect — top-tier experts getting ### Additional Evidence (challenge) *Source: [[2026-03-21-metr-evaluation-landscape-2026]] | Added: 2026-03-21* -METR's developer productivity RCT found that AI tools made experienced developers '19% longer' to complete tasks, showing negative productivity for experts. This directly contradicts the force multiplier hypothesis and suggests that current AI tools may actually impair expert performance, consistent with the prior METR developer RCT finding. +METR's developer productivity RCT found that AI tools made experienced developers '19% longer' to complete tasks, showing negative productivity for experts on time-to-completion metrics. This complicates the force multiplier hypothesis — the RCT measured task completion speed, not delegation quality or the scope of what experts can attempt. An expert who takes longer but produces better-scoped, more ambitious outputs is compatible with both this finding and the original claim. However, if the productivity drag persists across task types, it provides counter-evidence to at least one dimension of the expertise advantage. Relevant Notes: diff --git a/inbox/queue/2026-03-21-metr-evaluation-landscape-2026.md b/inbox/queue/2026-03-21-metr-evaluation-landscape-2026.md index 81824c28..ea74a081 100644 --- a/inbox/queue/2026-03-21-metr-evaluation-landscape-2026.md +++ b/inbox/queue/2026-03-21-metr-evaluation-landscape-2026.md @@ -7,7 +7,7 @@ date: 2026-03-01 domain: ai-alignment secondary_domains: [] format: thread -status: enrichment +status: processed priority: high tags: [METR, monitorability, MALT, sabotage-review, time-horizon, evaluation-infrastructure, oversight-evasion, Claude] processed_by: theseus -- 2.45.2