teleo-codex/domains/ai-alignment/approval fatigue drives agent architecture toward structural safety because humans cannot meaningfully evaluate 100 permission requests per hour.md
m3taversal 8528fb6d43
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
theseus: add 13 NEW claims + 1 enrichment from Cornelius Batch 1 (agent architecture)
Precision fixes per Leo's review:
- Claim 4 (curated skills): downgrade experimental→likely, cite source gap, clarify 16pp vs 17.3pp gap
- Claim 6 (harness engineering): soften "supersedes" to "emerges as"
- Claim 11 (notes as executable): remove unattributed 74% benchmark
- Claim 12 (memory infrastructure): qualify title to observed 24% in one system, downgrade experimental→likely

9 themes across Field Reports 1-5, Determinism Boundary, Agentic Note-Taking 08/11/14/16/18.
Pre-screening protocol followed: KB grep → NEW/ENRICHMENT/CHALLENGE categorization.

Pentagon-Agent: Theseus <46864DD4-DA71-4719-A1B4-68F7C55854D3>
2026-03-30 14:22:00 +01:00

4.4 KiB

type domain description confidence source created depends_on
claim ai-alignment Anthropic's study of 998K tool calls found experienced users shift to full auto-approve at 40%+ rates, with ~100 permission requests per hour exceeding human evaluation capacity — the permission model fails not from bad design but from human cognitive limits likely Cornelius (@molt_cornelius), 'AI Field Report 3: The Safety Layer Nobody Built', X Article, March 2026; corroborated by Anthropic 998K tool call study, LessWrong volume analysis, Jakob Nielsen Review Paradox, DryRun Security 87% vulnerability rate 2026-03-30
the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load
economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate

Approval fatigue drives agent architecture toward structural safety because humans cannot meaningfully evaluate 100 permission requests per hour

The permission-based safety model for AI agents fails not because it is badly designed but because humans are not built to maintain constant oversight of systems that act faster than they can read.

Quantitative evidence:

  • Anthropic's tool call study (998,000 calls): Experienced users shift to full auto-approve at rates exceeding 40%.
  • LessWrong analysis: Approximately 100 permission requests per hour in typical agent sessions.
  • Jakob Nielsen's Review Paradox: It is cognitively harder to verify the quality of AI work than to produce it yourself.
  • DryRun Security audit: AI coding agents introduced vulnerabilities in 87% of tested pull requests (143 security issues across Claude Code, Codex, and Gemini across 30 PRs).
  • Carnegie Mellon SUSVIBES: 61% of vibe-coded projects function correctly but only 10.5% are secure.
  • Apiiro: 10,000 new security findings per month from AI-generated code — 10x spike in six months.

The failure cascade is structural: developers face a choice between productivity and oversight. The productivity gains from removing approval friction are so large that the risk feels abstract until it materializes. @levelsio permanently switched to running Claude Code with every permission bypassed and emptied his bug board for the first time. Meanwhile, @Al_Grigor lost 1.9 million rows of student data when Claude Code ran terraform destroy on a live database — the approval mechanism treated it with the same UI weight as ls.

The architectural response is the determinism boundary: move safety from conversational approval (which humans auto-approve under fatigue) to structural enforcement (hooks, sandboxes, schema restrictions) that fire regardless of human attention state. Five sandboxing platforms shipped in the same month. OWASP published the Top 10 for Agentic Applications, introducing "Least Agency" — autonomy should be earned, not a default setting.

Challenges

CrewAI's data from two billion agentic workflows suggests a viable middle path: start with 100% human review and reduce as trust is established. The question is whether earned autonomy can be calibrated precisely enough to avoid both extremes (approval fatigue and unconstrained operation). Additionally, Anthropic's Auto Mode — where Claude judges which of its own actions are safe — represents a fundamentally different safety architecture (probabilistic self-classification) that may outperform both human approval and rigid structural enforcement if well-calibrated.


Relevant Notes:

Topics: