auto-fix: strip 6 broken wiki links

Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.
This commit is contained in:
Teleo Agents 2026-03-24 00:26:07 +00:00
parent 9b2828f28f
commit 39cd77291c
2 changed files with 6 additions and 6 deletions

View file

@ -29,14 +29,14 @@ This reframes the alignment timeline question. The capability for massive labor
### Additional Evidence (extend)
*Source: [[2026-02-00-international-ai-safety-report-2026]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
*Source: 2026-02-00-international-ai-safety-report-2026 | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
The International AI Safety Report 2026 (multi-government committee, February 2026) identifies an 'evaluation gap' that adds a new dimension to the capability-deployment gap: 'Performance on pre-deployment tests does not reliably predict real-world utility or risk.' This means the gap is not only about adoption lag (organizations slow to deploy) but also about evaluation failure (pre-deployment testing cannot predict production behavior). The gap exists at two levels: (1) theoretical capability exceeds deployed capability due to organizational adoption lag, and (2) evaluated capability does not predict actual deployment capability due to environment-dependent model behavior. The evaluation gap makes the deployment gap harder to close because organizations cannot reliably assess what they are deploying.
---
### Additional Evidence (extend)
*Source: [[2026-02-05-mit-tech-review-misunderstood-time-horizon-graph]] | Added: 2026-03-23*
*Source: 2026-02-05-mit-tech-review-misunderstood-time-horizon-graph | Added: 2026-03-23*
METR's time horizon metric measures task difficulty by human completion time, not model processing time. A model with a 5-hour time horizon completes tasks that take humans 5 hours, but may finish them in minutes. This speed asymmetry is not captured in the metric itself, meaning the gap between theoretical capability (task completion) and deployment impact includes both adoption lag AND the unmeasured throughput advantage that organizations fail to utilize.
@ -53,4 +53,4 @@ Relevant Notes:
- [[economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate]] — the force that will close the gap
Topics:
- [[domains/ai-alignment/_map]]
- domains/ai-alignment/_map

View file

@ -57,8 +57,8 @@ Frontier model benchmark performance claims "significantly overstate practical u
**What I expected but didn't find:** Any evidence that the productivity slowdown was domain-specific or driven by task selection artifacts. METR's reconciliation paper treats the 19% slowdown as a real finding that needs explanation, not an artifact to be explained away.
**KB connections:**
- [[verification degrades faster than capability grows]] — if benchmarks overestimate capability by this margin, behavioral verification tools (including benchmarks) may be systematically misleading about the actual capability trajectory
- [[adoption lag exceeds capability limits as primary bottleneck to AI economic impact]] — the 19% slowdown in experienced developers is evidence against rapid adoption producing rapid productivity gains even when adoption occurs
- verification degrades faster than capability grows — if benchmarks overestimate capability by this margin, behavioral verification tools (including benchmarks) may be systematically misleading about the actual capability trajectory
- adoption lag exceeds capability limits as primary bottleneck to AI economic impact — the 19% slowdown in experienced developers is evidence against rapid adoption producing rapid productivity gains even when adoption occurs
- The METR time horizon project itself: if the time horizon metric has the same fundamental measurement problem (automated scoring without holistic evaluation), then all time horizon estimates may be overestimating actual dangerous autonomous capability
**Extraction hints:** Primary claim candidate: "benchmark-based AI capability metrics overstate real-world autonomous performance because automated scoring doesn't capture documentation, maintainability, or production-readiness requirements — creating a systematic gap between measured and dangerous capability." Secondary claim: "AI tools reduced productivity for experienced developers in controlled RCT conditions despite developer expectations of speedup — suggesting capability deployment may not translate to autonomy even when tools are adopted."
@ -67,7 +67,7 @@ Frontier model benchmark performance claims "significantly overstate practical u
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[verification degrades faster than capability grows]]
PRIMARY CONNECTION: verification degrades faster than capability grows
WHY ARCHIVED: This is the strongest empirical evidence found in 13 sessions that benchmark-based capability metrics systematically overstate real-world autonomous capability. The RCT design (not observational), precise quantification (0% production-ready, 19% slowdown), and the source (METR — the primary capability evaluator) make this a high-quality disconfirmation signal for B1 urgency.