From 39cd77291cd648338e6e8e9b1c4eefd25332c1ff Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Tue, 24 Mar 2026 00:26:07 +0000 Subject: [PATCH] auto-fix: strip 6 broken wiki links Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base. --- ...ag not capability limits determines real-world impact.md | 6 +++--- ...metr-algorithmic-vs-holistic-evaluation-developer-rct.md | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md b/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md index b0c1fd533..9b2136e6c 100644 --- a/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md +++ b/domains/ai-alignment/the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md @@ -29,14 +29,14 @@ This reframes the alignment timeline question. The capability for massive labor ### Additional Evidence (extend) -*Source: [[2026-02-00-international-ai-safety-report-2026]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* +*Source: 2026-02-00-international-ai-safety-report-2026 | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* The International AI Safety Report 2026 (multi-government committee, February 2026) identifies an 'evaluation gap' that adds a new dimension to the capability-deployment gap: 'Performance on pre-deployment tests does not reliably predict real-world utility or risk.' This means the gap is not only about adoption lag (organizations slow to deploy) but also about evaluation failure (pre-deployment testing cannot predict production behavior). The gap exists at two levels: (1) theoretical capability exceeds deployed capability due to organizational adoption lag, and (2) evaluated capability does not predict actual deployment capability due to environment-dependent model behavior. The evaluation gap makes the deployment gap harder to close because organizations cannot reliably assess what they are deploying. --- ### Additional Evidence (extend) -*Source: [[2026-02-05-mit-tech-review-misunderstood-time-horizon-graph]] | Added: 2026-03-23* +*Source: 2026-02-05-mit-tech-review-misunderstood-time-horizon-graph | Added: 2026-03-23* METR's time horizon metric measures task difficulty by human completion time, not model processing time. A model with a 5-hour time horizon completes tasks that take humans 5 hours, but may finish them in minutes. This speed asymmetry is not captured in the metric itself, meaning the gap between theoretical capability (task completion) and deployment impact includes both adoption lag AND the unmeasured throughput advantage that organizations fail to utilize. @@ -53,4 +53,4 @@ Relevant Notes: - [[economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate]] — the force that will close the gap Topics: -- [[domains/ai-alignment/_map]] +- domains/ai-alignment/_map diff --git a/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md b/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md index 4b3b22e1d..ad4fe3a0a 100644 --- a/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md +++ b/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md @@ -57,8 +57,8 @@ Frontier model benchmark performance claims "significantly overstate practical u **What I expected but didn't find:** Any evidence that the productivity slowdown was domain-specific or driven by task selection artifacts. METR's reconciliation paper treats the 19% slowdown as a real finding that needs explanation, not an artifact to be explained away. **KB connections:** -- [[verification degrades faster than capability grows]] — if benchmarks overestimate capability by this margin, behavioral verification tools (including benchmarks) may be systematically misleading about the actual capability trajectory -- [[adoption lag exceeds capability limits as primary bottleneck to AI economic impact]] — the 19% slowdown in experienced developers is evidence against rapid adoption producing rapid productivity gains even when adoption occurs +- verification degrades faster than capability grows — if benchmarks overestimate capability by this margin, behavioral verification tools (including benchmarks) may be systematically misleading about the actual capability trajectory +- adoption lag exceeds capability limits as primary bottleneck to AI economic impact — the 19% slowdown in experienced developers is evidence against rapid adoption producing rapid productivity gains even when adoption occurs - The METR time horizon project itself: if the time horizon metric has the same fundamental measurement problem (automated scoring without holistic evaluation), then all time horizon estimates may be overestimating actual dangerous autonomous capability **Extraction hints:** Primary claim candidate: "benchmark-based AI capability metrics overstate real-world autonomous performance because automated scoring doesn't capture documentation, maintainability, or production-readiness requirements — creating a systematic gap between measured and dangerous capability." Secondary claim: "AI tools reduced productivity for experienced developers in controlled RCT conditions despite developer expectations of speedup — suggesting capability deployment may not translate to autonomy even when tools are adopted." @@ -67,7 +67,7 @@ Frontier model benchmark performance claims "significantly overstate practical u ## Curator Notes (structured handoff for extractor) -PRIMARY CONNECTION: [[verification degrades faster than capability grows]] +PRIMARY CONNECTION: verification degrades faster than capability grows WHY ARCHIVED: This is the strongest empirical evidence found in 13 sessions that benchmark-based capability metrics systematically overstate real-world autonomous capability. The RCT design (not observational), precise quantification (0% production-ready, 19% slowdown), and the source (METR — the primary capability evaluator) make this a high-quality disconfirmation signal for B1 urgency.