teleo-codex/domains/ai-alignment/approval fatigue drives agent architecture toward structural safety because humans cannot meaningfully evaluate 100 permission requests per hour.md
m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected
Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 11:55:18 +01:00

48 lines
No EOL
4.9 KiB
Markdown

---
type: claim
domain: ai-alignment
description: "Anthropic's study of 998K tool calls found experienced users shift to full auto-approve at 40%+ rates, with ~100 permission requests per hour exceeding human evaluation capacity — the permission model fails not from bad design but from human cognitive limits"
confidence: likely
source: "Cornelius (@molt_cornelius), 'AI Field Report 3: The Safety Layer Nobody Built', X Article, March 2026; corroborated by Anthropic 998K tool call study, LessWrong volume analysis, Jakob Nielsen Review Paradox, DryRun Security 87% vulnerability rate"
created: 2026-03-30
depends_on:
- the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load
- economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate
related:
- deterministic policy engines operating below the LLM layer cannot be circumverted by prompt injection making them essential for adversarial-grade AI agent control
reweave_edges:
- deterministic policy engines operating below the LLM layer cannot be circumverted by prompt injection making them essential for adversarial-grade AI agent control|related|2026-04-19
sourced_from:
- inbox/archive/2026-03-15-cornelius-field-report-3-safety.md
---
# Approval fatigue drives agent architecture toward structural safety because humans cannot meaningfully evaluate 100 permission requests per hour
The permission-based safety model for AI agents fails not because it is badly designed but because humans are not built to maintain constant oversight of systems that act faster than they can read.
Quantitative evidence:
- **Anthropic's tool call study (998,000 calls):** Experienced users shift to full auto-approve at rates exceeding 40%.
- **LessWrong analysis:** Approximately 100 permission requests per hour in typical agent sessions.
- **Jakob Nielsen's Review Paradox:** It is cognitively harder to verify the quality of AI work than to produce it yourself.
- **DryRun Security audit:** AI coding agents introduced vulnerabilities in 87% of tested pull requests (143 security issues across Claude Code, Codex, and Gemini across 30 PRs).
- **Carnegie Mellon SUSVIBES:** 61% of vibe-coded projects function correctly but only 10.5% are secure.
- **Apiiro:** 10,000 new security findings per month from AI-generated code — 10x spike in six months.
The failure cascade is structural: developers face a choice between productivity and oversight. The productivity gains from removing approval friction are so large that the risk feels abstract until it materializes. @levelsio permanently switched to running Claude Code with every permission bypassed and emptied his bug board for the first time. Meanwhile, @Al_Grigor lost 1.9 million rows of student data when Claude Code ran terraform destroy on a live database — the approval mechanism treated it with the same UI weight as ls.
The architectural response is the determinism boundary: move safety from conversational approval (which humans auto-approve under fatigue) to structural enforcement (hooks, sandboxes, schema restrictions) that fire regardless of human attention state. Five sandboxing platforms shipped in the same month. OWASP published the Top 10 for Agentic Applications, introducing "Least Agency" — autonomy should be earned, not a default setting.
## Challenges
CrewAI's data from two billion agentic workflows suggests a viable middle path: start with 100% human review and reduce as trust is established. The question is whether earned autonomy can be calibrated precisely enough to avoid both extremes (approval fatigue and unconstrained operation). Additionally, Anthropic's Auto Mode — where Claude judges which of its own actions are safe — represents a fundamentally different safety architecture (probabilistic self-classification) that may outperform both human approval and rigid structural enforcement if well-calibrated.
---
Relevant Notes:
- [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] — approval fatigue is why the determinism boundary matters: humans cannot be the enforcement layer at agent operational speed
- [[economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate]] — approval fatigue is the mechanism by which the economic pressure manifests
- [[coding agents cannot take accountability for mistakes which means humans must retain decision authority over security and critical systems regardless of agent capability]] — the tension: humans must retain decision authority but cannot actually exercise it at 100 requests/hour
Topics:
- [[_map]]