Wrote sourced_from: into 414 claim files pointing back to their origin source. Backfilled claims_extracted: into 252 source files that were processed but missing this field. Matching uses author+title overlap against claim source: field, validated against 296 known-good pairs from existing claims_extracted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
44 lines
5.2 KiB
Markdown
44 lines
5.2 KiB
Markdown
---
|
|
type: claim
|
|
domain: ai-alignment
|
|
secondary_domains: [collective-intelligence]
|
|
description: "Practitioner-documented prompt patterns for agent self-diagnosis (uncertainty calibration, failure anticipation, adversarial self-review) represent a lightweight scalable oversight mechanism that parallels structured exploration gains"
|
|
confidence: speculative
|
|
source: "kloss (@kloss_xyz), '25 Prompts for Making AI Agents Self-Diagnose' (X thread, March 2026); connects to Reitbauer (2026) structured exploration evidence"
|
|
created: 2026-03-16
|
|
sourced_from:
|
|
- inbox/archive/2026-03-09-kloss-25-prompts-agent-self-diagnosis.md
|
|
---
|
|
|
|
# structured self-diagnosis prompts induce metacognitive monitoring in AI agents that default behavior does not produce because explicit uncertainty flagging and failure mode enumeration activate deliberate reasoning patterns
|
|
|
|
kloss (2026) documents 25 prompts for making AI agents self-diagnose — a practitioner-generated collection that reveals a structural pattern in how prompt scaffolding induces oversight-relevant behaviors. The prompts cluster into six functional categories:
|
|
|
|
**Uncertainty calibration** (5 prompts): "Rate your confidence 1-10. Explain any score below 7." "What information are you missing that would change your approach?" These force explicit uncertainty quantification that agents don't produce by default.
|
|
|
|
**Failure mode anticipation** (4 prompts): "Before you begin, state the single biggest risk of failure in this task." "What are the three most likely failure modes for your current approach?" Pre-commitment to failure scenarios reduces blind spots.
|
|
|
|
**Adversarial self-review** (3 prompts): "Before giving your final answer, argue against it." "What would an expert in this domain critique about your reasoning?" This induces the separated proposer-evaluator dynamic that [[adversarial PR review produces higher quality knowledge than self-review because separated proposer and evaluator roles catch errors that the originating agent cannot see]] within a single agent.
|
|
|
|
**Strategy meta-monitoring** (4 prompts): "If this task has taken more than N steps, pause and reassess your strategy." "Pause: is there a loop?" These catch failure modes that accumulate over multi-step execution — exactly where agent reliability degrades.
|
|
|
|
**User alignment** (3 prompts): "Are you solving the problem the user asked, or a different one?" "What will the user do with your output? Optimize for that." These address goal drift, where agent behavior diverges from user intent without either party noticing.
|
|
|
|
**Epistemic discipline** (3 prompts): "If you're about to say 'I think,' replace it with your evidence." "Is there a simpler way to solve this?" These enforce the distinction between deductive and speculative reasoning.
|
|
|
|
The alignment significance: these prompts function as lightweight scalable oversight. Unlike debate-based oversight which [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]], self-diagnosis prompts scale because they leverage the agent's own capability against itself — the more capable the agent, the better its self-diagnosis becomes. This is the same mechanism that makes [[structured exploration protocols reduce human intervention by 6x because the Residue prompt enabled 5 unguided AI explorations to solve what required 31 human-coached explorations]] — structured prompting activates reasoning patterns that unstructured prompting misses.
|
|
|
|
The limitation: this is practitioner knowledge without empirical validation. No controlled study compares agent performance with and without self-diagnosis scaffolding. The evidence is analogical — structured prompting works for exploration (Reitbauer 2026), so it plausibly works for oversight. Confidence is speculative until tested.
|
|
|
|
For collective agent architectures, self-diagnosis prompts could complement cross-agent review: each agent runs self-checks before submitting work for peer evaluation, catching errors that would otherwise consume reviewer bandwidth. This addresses the [[single evaluator bottleneck means review throughput scales linearly with proposer count because one agent reviewing every PR caps collective output at the evaluators context window]] by filtering low-quality submissions before they reach the review queue.
|
|
|
|
---
|
|
|
|
Relevant Notes:
|
|
- [[structured exploration protocols reduce human intervention by 6x because the Residue prompt enabled 5 unguided AI explorations to solve what required 31 human-coached explorations]] — same mechanism: structured prompting activates latent capability
|
|
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — self-diagnosis may scale better than debate because it leverages the agent's own capability
|
|
- [[adversarial PR review produces higher quality knowledge than self-review because separated proposer and evaluator roles catch errors that the originating agent cannot see]] — self-diagnosis prompts create an internal proposer-evaluator split
|
|
- [[single evaluator bottleneck means review throughput scales linearly with proposer count because one agent reviewing every PR caps collective output at the evaluators context window]] — self-diagnosis as pre-filter reduces review load
|
|
|
|
Topics:
|
|
- [[_map]]
|