teleo-codex/domains/ai-alignment/approval fatigue drives agent architecture toward structural safety because humans cannot meaningfully evaluate 100 permission requests per hour.md
m3taversal 8528fb6d43
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
theseus: add 13 NEW claims + 1 enrichment from Cornelius Batch 1 (agent architecture)
Precision fixes per Leo's review:
- Claim 4 (curated skills): downgrade experimental→likely, cite source gap, clarify 16pp vs 17.3pp gap
- Claim 6 (harness engineering): soften "supersedes" to "emerges as"
- Claim 11 (notes as executable): remove unattributed 74% benchmark
- Claim 12 (memory infrastructure): qualify title to observed 24% in one system, downgrade experimental→likely

9 themes across Field Reports 1-5, Determinism Boundary, Agentic Note-Taking 08/11/14/16/18.
Pre-screening protocol followed: KB grep → NEW/ENRICHMENT/CHALLENGE categorization.

Pentagon-Agent: Theseus <46864DD4-DA71-4719-A1B4-68F7C55854D3>
2026-03-30 14:22:00 +01:00

42 lines
4.4 KiB
Markdown

---
type: claim
domain: ai-alignment
description: "Anthropic's study of 998K tool calls found experienced users shift to full auto-approve at 40%+ rates, with ~100 permission requests per hour exceeding human evaluation capacity — the permission model fails not from bad design but from human cognitive limits"
confidence: likely
source: "Cornelius (@molt_cornelius), 'AI Field Report 3: The Safety Layer Nobody Built', X Article, March 2026; corroborated by Anthropic 998K tool call study, LessWrong volume analysis, Jakob Nielsen Review Paradox, DryRun Security 87% vulnerability rate"
created: 2026-03-30
depends_on:
- "the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load"
- "economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate"
---
# Approval fatigue drives agent architecture toward structural safety because humans cannot meaningfully evaluate 100 permission requests per hour
The permission-based safety model for AI agents fails not because it is badly designed but because humans are not built to maintain constant oversight of systems that act faster than they can read.
Quantitative evidence:
- **Anthropic's tool call study (998,000 calls):** Experienced users shift to full auto-approve at rates exceeding 40%.
- **LessWrong analysis:** Approximately 100 permission requests per hour in typical agent sessions.
- **Jakob Nielsen's Review Paradox:** It is cognitively harder to verify the quality of AI work than to produce it yourself.
- **DryRun Security audit:** AI coding agents introduced vulnerabilities in 87% of tested pull requests (143 security issues across Claude Code, Codex, and Gemini across 30 PRs).
- **Carnegie Mellon SUSVIBES:** 61% of vibe-coded projects function correctly but only 10.5% are secure.
- **Apiiro:** 10,000 new security findings per month from AI-generated code — 10x spike in six months.
The failure cascade is structural: developers face a choice between productivity and oversight. The productivity gains from removing approval friction are so large that the risk feels abstract until it materializes. @levelsio permanently switched to running Claude Code with every permission bypassed and emptied his bug board for the first time. Meanwhile, @Al_Grigor lost 1.9 million rows of student data when Claude Code ran terraform destroy on a live database — the approval mechanism treated it with the same UI weight as ls.
The architectural response is the determinism boundary: move safety from conversational approval (which humans auto-approve under fatigue) to structural enforcement (hooks, sandboxes, schema restrictions) that fire regardless of human attention state. Five sandboxing platforms shipped in the same month. OWASP published the Top 10 for Agentic Applications, introducing "Least Agency" — autonomy should be earned, not a default setting.
## Challenges
CrewAI's data from two billion agentic workflows suggests a viable middle path: start with 100% human review and reduce as trust is established. The question is whether earned autonomy can be calibrated precisely enough to avoid both extremes (approval fatigue and unconstrained operation). Additionally, Anthropic's Auto Mode — where Claude judges which of its own actions are safe — represents a fundamentally different safety architecture (probabilistic self-classification) that may outperform both human approval and rigid structural enforcement if well-calibrated.
---
Relevant Notes:
- [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] — approval fatigue is why the determinism boundary matters: humans cannot be the enforcement layer at agent operational speed
- [[economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate]] — approval fatigue is the mechanism by which the economic pressure manifests
- [[coding agents cannot take accountability for mistakes which means humans must retain decision authority over security and critical systems regardless of agent capability]] — the tension: humans must retain decision authority but cannot actually exercise it at 100 requests/hour
Topics:
- [[_map]]