teleo-codex/domains/ai-alignment/structured self-diagnosis prompts induce metacognitive monitoring in AI agents that default behavior does not produce because explicit uncertainty flagging and failure mode enumeration activate deliberate reasoning patterns.md
m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected
Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 11:55:18 +01:00

5.2 KiB

type domain secondary_domains description confidence source created sourced_from
claim ai-alignment
collective-intelligence
Practitioner-documented prompt patterns for agent self-diagnosis (uncertainty calibration, failure anticipation, adversarial self-review) represent a lightweight scalable oversight mechanism that parallels structured exploration gains speculative kloss (@kloss_xyz), '25 Prompts for Making AI Agents Self-Diagnose' (X thread, March 2026); connects to Reitbauer (2026) structured exploration evidence 2026-03-16
inbox/archive/2026-03-09-kloss-25-prompts-agent-self-diagnosis.md

structured self-diagnosis prompts induce metacognitive monitoring in AI agents that default behavior does not produce because explicit uncertainty flagging and failure mode enumeration activate deliberate reasoning patterns

kloss (2026) documents 25 prompts for making AI agents self-diagnose — a practitioner-generated collection that reveals a structural pattern in how prompt scaffolding induces oversight-relevant behaviors. The prompts cluster into six functional categories:

Uncertainty calibration (5 prompts): "Rate your confidence 1-10. Explain any score below 7." "What information are you missing that would change your approach?" These force explicit uncertainty quantification that agents don't produce by default.

Failure mode anticipation (4 prompts): "Before you begin, state the single biggest risk of failure in this task." "What are the three most likely failure modes for your current approach?" Pre-commitment to failure scenarios reduces blind spots.

Adversarial self-review (3 prompts): "Before giving your final answer, argue against it." "What would an expert in this domain critique about your reasoning?" This induces the separated proposer-evaluator dynamic that adversarial PR review produces higher quality knowledge than self-review because separated proposer and evaluator roles catch errors that the originating agent cannot see within a single agent.

Strategy meta-monitoring (4 prompts): "If this task has taken more than N steps, pause and reassess your strategy." "Pause: is there a loop?" These catch failure modes that accumulate over multi-step execution — exactly where agent reliability degrades.

User alignment (3 prompts): "Are you solving the problem the user asked, or a different one?" "What will the user do with your output? Optimize for that." These address goal drift, where agent behavior diverges from user intent without either party noticing.

Epistemic discipline (3 prompts): "If you're about to say 'I think,' replace it with your evidence." "Is there a simpler way to solve this?" These enforce the distinction between deductive and speculative reasoning.

The alignment significance: these prompts function as lightweight scalable oversight. Unlike debate-based oversight which scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps, self-diagnosis prompts scale because they leverage the agent's own capability against itself — the more capable the agent, the better its self-diagnosis becomes. This is the same mechanism that makes structured exploration protocols reduce human intervention by 6x because the Residue prompt enabled 5 unguided AI explorations to solve what required 31 human-coached explorations — structured prompting activates reasoning patterns that unstructured prompting misses.

The limitation: this is practitioner knowledge without empirical validation. No controlled study compares agent performance with and without self-diagnosis scaffolding. The evidence is analogical — structured prompting works for exploration (Reitbauer 2026), so it plausibly works for oversight. Confidence is speculative until tested.

For collective agent architectures, self-diagnosis prompts could complement cross-agent review: each agent runs self-checks before submitting work for peer evaluation, catching errors that would otherwise consume reviewer bandwidth. This addresses the single evaluator bottleneck means review throughput scales linearly with proposer count because one agent reviewing every PR caps collective output at the evaluators context window by filtering low-quality submissions before they reach the review queue.


Relevant Notes:

Topics: