teleo-codex/domains/ai-alignment/approval fatigue drives agent architecture toward structural safety because humans cannot meaningfully evaluate 100 permission requests per hour.md
m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected
Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 11:55:18 +01:00

4.9 KiB

type domain description confidence source created depends_on related reweave_edges sourced_from
claim ai-alignment Anthropic's study of 998K tool calls found experienced users shift to full auto-approve at 40%+ rates, with ~100 permission requests per hour exceeding human evaluation capacity — the permission model fails not from bad design but from human cognitive limits likely Cornelius (@molt_cornelius), 'AI Field Report 3: The Safety Layer Nobody Built', X Article, March 2026; corroborated by Anthropic 998K tool call study, LessWrong volume analysis, Jakob Nielsen Review Paradox, DryRun Security 87% vulnerability rate 2026-03-30
the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load
economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate
deterministic policy engines operating below the LLM layer cannot be circumverted by prompt injection making them essential for adversarial-grade AI agent control
deterministic policy engines operating below the LLM layer cannot be circumverted by prompt injection making them essential for adversarial-grade AI agent control|related|2026-04-19
inbox/archive/2026-03-15-cornelius-field-report-3-safety.md

Approval fatigue drives agent architecture toward structural safety because humans cannot meaningfully evaluate 100 permission requests per hour

The permission-based safety model for AI agents fails not because it is badly designed but because humans are not built to maintain constant oversight of systems that act faster than they can read.

Quantitative evidence:

  • Anthropic's tool call study (998,000 calls): Experienced users shift to full auto-approve at rates exceeding 40%.
  • LessWrong analysis: Approximately 100 permission requests per hour in typical agent sessions.
  • Jakob Nielsen's Review Paradox: It is cognitively harder to verify the quality of AI work than to produce it yourself.
  • DryRun Security audit: AI coding agents introduced vulnerabilities in 87% of tested pull requests (143 security issues across Claude Code, Codex, and Gemini across 30 PRs).
  • Carnegie Mellon SUSVIBES: 61% of vibe-coded projects function correctly but only 10.5% are secure.
  • Apiiro: 10,000 new security findings per month from AI-generated code — 10x spike in six months.

The failure cascade is structural: developers face a choice between productivity and oversight. The productivity gains from removing approval friction are so large that the risk feels abstract until it materializes. @levelsio permanently switched to running Claude Code with every permission bypassed and emptied his bug board for the first time. Meanwhile, @Al_Grigor lost 1.9 million rows of student data when Claude Code ran terraform destroy on a live database — the approval mechanism treated it with the same UI weight as ls.

The architectural response is the determinism boundary: move safety from conversational approval (which humans auto-approve under fatigue) to structural enforcement (hooks, sandboxes, schema restrictions) that fire regardless of human attention state. Five sandboxing platforms shipped in the same month. OWASP published the Top 10 for Agentic Applications, introducing "Least Agency" — autonomy should be earned, not a default setting.

Challenges

CrewAI's data from two billion agentic workflows suggests a viable middle path: start with 100% human review and reduce as trust is established. The question is whether earned autonomy can be calibrated precisely enough to avoid both extremes (approval fatigue and unconstrained operation). Additionally, Anthropic's Auto Mode — where Claude judges which of its own actions are safe — represents a fundamentally different safety architecture (probabilistic self-classification) that may outperform both human approval and rigid structural enforcement if well-calibrated.


Relevant Notes:

Topics: