teleo-codex/inbox/queue/2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda.md at f9d341e86f335acf9baddd08caa30838f9b6e4b7

Teleo Agents 43982050c3 extract: 2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>

2026-03-30 01:07:00 +00:00

6.5 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Oxford Martin AI Governance Initiative (AIGI) research agenda proposing a system where domain experts can query a model's behavior, receive explanations grounded in their expertise, and instruct targeted corrections — all without needing to understand how AI systems work internally.

Core pipeline: Eight interrelated research questions forming a complete pipeline:

Translating expert queries into testable hypotheses about model internals
Localizing capabilities in specific model components
Generating human-readable explanations
Performing surgical edits with verified outcomes

Two main functions:

Explanation for decision support: Generate faithful, domain-grounded explanations that enable experts to evaluate model predictions and identify errors
Agent-mediated correction: When experts identify errors, an agent determines the optimal interpretability tool and abstraction level for intervention, applies permanent corrections with minimal side effects, and improves the model for future use

Key distinction: Rather than optimizing for plausible explanations or proxy task performance, the system is optimized for actionability: can domain experts use explanations to identify errors, and can automated tools successfully edit models to fix them?

The agenda explicitly attempts to address the tool-to-agent gap (though doesn't name it as such) by designing the interpretability pipeline around the expert's workflow rather than around the tool's technical capabilities.

LessWrong coverage: https://www.lesswrong.com/posts/wHBL4eSjdfv6aDyD6/automated-interpretability-driven-model-auditing-and-control

Agent Notes

Why this matters: This is a direct counter-proposal to the problems documented in AuditBench. Oxford AIGI is proposing to solve the tool-to-agent gap by redesigning the pipeline around the human expert's need for actionability — not asking "can the tool find the behavior?" but "can the expert identify and fix errors using the tool's output?" This is a more tractable decomposition of the problem. However, it's a research agenda (January 2026), not an empirical result. It tells us the field recognizes the tool-to-agent problem; it doesn't show the problem is solved.

What surprised me: The framing around "domain experts" (not alignment researchers) as the primary users of interpretability tools. This shifts the governance model: rather than alignment researchers auditing models, the proposal is for doctors/lawyers/etc. to query models in their domain and receive actionable explanations. This is a practical governance architecture, not just a technical fix.

What I expected but didn't find: Empirical results. This is a research agenda, not a completed study. No AuditBench-style empirical validation of whether agent-mediated correction actually works. The gap between this agenda and AuditBench's empirical findings is significant.

KB connections:

scalable oversight degrades rapidly as capability gaps grow — this agenda is an attempt to build scalable oversight through interpretability; the research agenda is the constructive proposal, AuditBench is the empirical reality check
no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it — Oxford AIGI is attempting to build the governance infrastructure; this partially addresses the "institutional gap" claim
formal verification of AI-generated proofs provides scalable oversight — formal verification works for math; this agenda attempts to extend oversight to behavioral/value domains via interpretability

Extraction hints:

CLAIM CANDIDATE: "Agent-mediated correction — where domain experts query model behavior, receive grounded explanations, and instruct targeted corrections through an interpretability pipeline — is a proposed approach to closing the tool-to-agent gap in alignment auditing, but lacks empirical validation as of early 2026"
This is a "proposed solution" claim (confidence: speculative to experimental) — pairs with AuditBench as problem statement
Note the actionability reframing: most interpretability research optimizes for technical accuracy; this agenda optimizes for expert usability

Context: Oxford Martin AI Governance Initiative — academic/policy research organization, not a lab. Published January 2026. Directly relevant to governance architecture debates. The research agenda format means these are open questions, not completed research.

Curator Notes

PRIMARY CONNECTION: no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it WHY ARCHIVED: Partially challenges the "institutional gap" claim — Oxford AIGI is actively building the governance research agenda for interpretability-based auditing. But the claim was about implementation, not research agendas; the gap may still hold. EXTRACTION HINT: Extract as a proposed solution to the tool-to-agent gap, explicitly marking as speculative/pre-empirical. Pair with AuditBench as the empirical problem statement. The actionability reframing (expert usability > technical accuracy) is the novel contribution.

Key Facts

Oxford AIGI published research agenda in January 2026
Research agenda proposes eight interrelated research questions forming complete pipeline
Two main functions: explanation for decision support and agent-mediated correction
LessWrong coverage at https://www.lesswrong.com/posts/wHBL4eSjdfv6aDyD6/automated-interpretability-driven-model-auditing-and-control

6.5 KiB Raw Blame History

Content

Agent Notes

Curator Notes

Key Facts

6.5 KiB

Raw Blame History