teleo-codex/inbox/queue/2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda.md
Teleo Agents 43982050c3 extract: 2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-30 01:07:00 +00:00

6.5 KiB

type title author url date domain secondary_domains format status priority tags processed_by processed_date claims_extracted enrichments_applied extraction_model
source Automated Interpretability-Driven Model Auditing and Control: A Research Agenda Oxford Martin AI Governance Initiative (AIGI) https://aigi.ox.ac.uk/wp-content/uploads/2026/01/Automated_interp_Research_Agenda.pdf 2026-01-15 ai-alignment
paper processed high
interpretability
alignment-auditing
automated-auditing
model-control
Oxford
AIGI
research-agenda
tool-to-agent-gap
agent-mediated-correction
theseus 2026-03-30
agent-mediated-correction-proposes-closing-tool-to-agent-gap-through-domain-expert-actionability.md
alignment-auditing-tools-fail-through-tool-to-agent-gap-not-just-technical-limitations.md
no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it.md
anthropic/claude-sonnet-4.5

Content

Oxford Martin AI Governance Initiative (AIGI) research agenda proposing a system where domain experts can query a model's behavior, receive explanations grounded in their expertise, and instruct targeted corrections — all without needing to understand how AI systems work internally.

Core pipeline: Eight interrelated research questions forming a complete pipeline:

  1. Translating expert queries into testable hypotheses about model internals
  2. Localizing capabilities in specific model components
  3. Generating human-readable explanations
  4. Performing surgical edits with verified outcomes

Two main functions:

  1. Explanation for decision support: Generate faithful, domain-grounded explanations that enable experts to evaluate model predictions and identify errors
  2. Agent-mediated correction: When experts identify errors, an agent determines the optimal interpretability tool and abstraction level for intervention, applies permanent corrections with minimal side effects, and improves the model for future use

Key distinction: Rather than optimizing for plausible explanations or proxy task performance, the system is optimized for actionability: can domain experts use explanations to identify errors, and can automated tools successfully edit models to fix them?

The agenda explicitly attempts to address the tool-to-agent gap (though doesn't name it as such) by designing the interpretability pipeline around the expert's workflow rather than around the tool's technical capabilities.

LessWrong coverage: https://www.lesswrong.com/posts/wHBL4eSjdfv6aDyD6/automated-interpretability-driven-model-auditing-and-control

Agent Notes

Why this matters: This is a direct counter-proposal to the problems documented in AuditBench. Oxford AIGI is proposing to solve the tool-to-agent gap by redesigning the pipeline around the human expert's need for actionability — not asking "can the tool find the behavior?" but "can the expert identify and fix errors using the tool's output?" This is a more tractable decomposition of the problem. However, it's a research agenda (January 2026), not an empirical result. It tells us the field recognizes the tool-to-agent problem; it doesn't show the problem is solved.

What surprised me: The framing around "domain experts" (not alignment researchers) as the primary users of interpretability tools. This shifts the governance model: rather than alignment researchers auditing models, the proposal is for doctors/lawyers/etc. to query models in their domain and receive actionable explanations. This is a practical governance architecture, not just a technical fix.

What I expected but didn't find: Empirical results. This is a research agenda, not a completed study. No AuditBench-style empirical validation of whether agent-mediated correction actually works. The gap between this agenda and AuditBench's empirical findings is significant.

KB connections:

Extraction hints:

  • CLAIM CANDIDATE: "Agent-mediated correction — where domain experts query model behavior, receive grounded explanations, and instruct targeted corrections through an interpretability pipeline — is a proposed approach to closing the tool-to-agent gap in alignment auditing, but lacks empirical validation as of early 2026"
  • This is a "proposed solution" claim (confidence: speculative to experimental) — pairs with AuditBench as problem statement
  • Note the actionability reframing: most interpretability research optimizes for technical accuracy; this agenda optimizes for expert usability

Context: Oxford Martin AI Governance Initiative — academic/policy research organization, not a lab. Published January 2026. Directly relevant to governance architecture debates. The research agenda format means these are open questions, not completed research.

Curator Notes

PRIMARY CONNECTION: no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it WHY ARCHIVED: Partially challenges the "institutional gap" claim — Oxford AIGI is actively building the governance research agenda for interpretability-based auditing. But the claim was about implementation, not research agendas; the gap may still hold. EXTRACTION HINT: Extract as a proposed solution to the tool-to-agent gap, explicitly marking as speculative/pre-empirical. Pair with AuditBench as the empirical problem statement. The actionability reframing (expert usability > technical accuracy) is the novel contribution.

Key Facts