m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected

Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-21 11:55:18 +01:00

4.9 KiB

Raw Blame History

type

domain

description

confidence

source

created

depends_on

reweave_edges

sourced_from

claim

ai-alignment

Anthropic's study of 998K tool calls found experienced users shift to full auto-approve at 40%+ rates, with ~100 permission requests per hour exceeding human evaluation capacity — the permission model fails not from bad design but from human cognitive limits

likely

Cornelius (@molt_cornelius), 'AI Field Report 3: The Safety Layer Nobody Built', X Article, March 2026; corroborated by Anthropic 998K tool call study, LessWrong volume analysis, Jakob Nielsen Review Paradox, DryRun Security 87% vulnerability rate

2026-03-30

the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load

economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate

deterministic policy engines operating below the LLM layer cannot be circumverted by prompt injection making them essential for adversarial-grade AI agent control

deterministic policy engines operating below the LLM layer cannot be circumverted by prompt injection making them essential for adversarial-grade AI agent control|related|2026-04-19

inbox/archive/2026-03-15-cornelius-field-report-3-safety.md

Approval fatigue drives agent architecture toward structural safety because humans cannot meaningfully evaluate 100 permission requests per hour

The permission-based safety model for AI agents fails not because it is badly designed but because humans are not built to maintain constant oversight of systems that act faster than they can read.

Quantitative evidence:

Anthropic's tool call study (998,000 calls): Experienced users shift to full auto-approve at rates exceeding 40%.
LessWrong analysis: Approximately 100 permission requests per hour in typical agent sessions.
Jakob Nielsen's Review Paradox: It is cognitively harder to verify the quality of AI work than to produce it yourself.
DryRun Security audit: AI coding agents introduced vulnerabilities in 87% of tested pull requests (143 security issues across Claude Code, Codex, and Gemini across 30 PRs).
Carnegie Mellon SUSVIBES: 61% of vibe-coded projects function correctly but only 10.5% are secure.
Apiiro: 10,000 new security findings per month from AI-generated code — 10x spike in six months.

The failure cascade is structural: developers face a choice between productivity and oversight. The productivity gains from removing approval friction are so large that the risk feels abstract until it materializes. @levelsio permanently switched to running Claude Code with every permission bypassed and emptied his bug board for the first time. Meanwhile, @Al_Grigor lost 1.9 million rows of student data when Claude Code ran terraform destroy on a live database — the approval mechanism treated it with the same UI weight as ls.

The architectural response is the determinism boundary: move safety from conversational approval (which humans auto-approve under fatigue) to structural enforcement (hooks, sandboxes, schema restrictions) that fire regardless of human attention state. Five sandboxing platforms shipped in the same month. OWASP published the Top 10 for Agentic Applications, introducing "Least Agency" — autonomy should be earned, not a default setting.

Challenges

CrewAI's data from two billion agentic workflows suggests a viable middle path: start with 100% human review and reduce as trust is established. The question is whether earned autonomy can be calibrated precisely enough to avoid both extremes (approval fatigue and unconstrained operation). Additionally, Anthropic's Auto Mode — where Claude judges which of its own actions are safe — represents a fundamentally different safety architecture (probabilistic self-classification) that may outperform both human approval and rigid structural enforcement if well-calibrated.

Relevant Notes:

the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load — approval fatigue is why the determinism boundary matters: humans cannot be the enforcement layer at agent operational speed
economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate — approval fatigue is the mechanism by which the economic pressure manifests
coding agents cannot take accountability for mistakes which means humans must retain decision authority over security and critical systems regardless of agent capability — the tension: humans must retain decision authority but cannot actually exercise it at 100 requests/hour

Topics:

_map

4.9 KiB Raw Blame History

Approval fatigue drives agent architecture toward structural safety because humans cannot meaningfully evaluate 100 permission requests per hour

Challenges

4.9 KiB

Raw Blame History