6.5 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | priority | tags | processed_by | processed_date | claims_extracted | enrichments_applied | extraction_model | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Automated Interpretability-Driven Model Auditing and Control: A Research Agenda | Oxford Martin AI Governance Initiative (AIGI) | https://aigi.ox.ac.uk/wp-content/uploads/2026/01/Automated_interp_Research_Agenda.pdf | 2026-01-15 | ai-alignment | paper | processed | high |
|
theseus | 2026-03-30 |
|
|
anthropic/claude-sonnet-4.5 |
Content
Oxford Martin AI Governance Initiative (AIGI) research agenda proposing a system where domain experts can query a model's behavior, receive explanations grounded in their expertise, and instruct targeted corrections — all without needing to understand how AI systems work internally.
Core pipeline: Eight interrelated research questions forming a complete pipeline:
- Translating expert queries into testable hypotheses about model internals
- Localizing capabilities in specific model components
- Generating human-readable explanations
- Performing surgical edits with verified outcomes
Two main functions:
- Explanation for decision support: Generate faithful, domain-grounded explanations that enable experts to evaluate model predictions and identify errors
- Agent-mediated correction: When experts identify errors, an agent determines the optimal interpretability tool and abstraction level for intervention, applies permanent corrections with minimal side effects, and improves the model for future use
Key distinction: Rather than optimizing for plausible explanations or proxy task performance, the system is optimized for actionability: can domain experts use explanations to identify errors, and can automated tools successfully edit models to fix them?
The agenda explicitly attempts to address the tool-to-agent gap (though doesn't name it as such) by designing the interpretability pipeline around the expert's workflow rather than around the tool's technical capabilities.
LessWrong coverage: https://www.lesswrong.com/posts/wHBL4eSjdfv6aDyD6/automated-interpretability-driven-model-auditing-and-control
Agent Notes
Why this matters: This is a direct counter-proposal to the problems documented in AuditBench. Oxford AIGI is proposing to solve the tool-to-agent gap by redesigning the pipeline around the human expert's need for actionability — not asking "can the tool find the behavior?" but "can the expert identify and fix errors using the tool's output?" This is a more tractable decomposition of the problem. However, it's a research agenda (January 2026), not an empirical result. It tells us the field recognizes the tool-to-agent problem; it doesn't show the problem is solved.
What surprised me: The framing around "domain experts" (not alignment researchers) as the primary users of interpretability tools. This shifts the governance model: rather than alignment researchers auditing models, the proposal is for doctors/lawyers/etc. to query models in their domain and receive actionable explanations. This is a practical governance architecture, not just a technical fix.
What I expected but didn't find: Empirical results. This is a research agenda, not a completed study. No AuditBench-style empirical validation of whether agent-mediated correction actually works. The gap between this agenda and AuditBench's empirical findings is significant.
KB connections:
- scalable oversight degrades rapidly as capability gaps grow — this agenda is an attempt to build scalable oversight through interpretability; the research agenda is the constructive proposal, AuditBench is the empirical reality check
- no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it — Oxford AIGI is attempting to build the governance infrastructure; this partially addresses the "institutional gap" claim
- formal verification of AI-generated proofs provides scalable oversight — formal verification works for math; this agenda attempts to extend oversight to behavioral/value domains via interpretability
Extraction hints:
- CLAIM CANDIDATE: "Agent-mediated correction — where domain experts query model behavior, receive grounded explanations, and instruct targeted corrections through an interpretability pipeline — is a proposed approach to closing the tool-to-agent gap in alignment auditing, but lacks empirical validation as of early 2026"
- This is a "proposed solution" claim (confidence: speculative to experimental) — pairs with AuditBench as problem statement
- Note the actionability reframing: most interpretability research optimizes for technical accuracy; this agenda optimizes for expert usability
Context: Oxford Martin AI Governance Initiative — academic/policy research organization, not a lab. Published January 2026. Directly relevant to governance architecture debates. The research agenda format means these are open questions, not completed research.
Curator Notes
PRIMARY CONNECTION: no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it WHY ARCHIVED: Partially challenges the "institutional gap" claim — Oxford AIGI is actively building the governance research agenda for interpretability-based auditing. But the claim was about implementation, not research agendas; the gap may still hold. EXTRACTION HINT: Extract as a proposed solution to the tool-to-agent gap, explicitly marking as speculative/pre-empirical. Pair with AuditBench as the empirical problem statement. The actionability reframing (expert usability > technical accuracy) is the novel contribution.
Key Facts
- Oxford AIGI published research agenda in January 2026
- Research agenda proposes eight interrelated research questions forming complete pipeline
- Two main functions: explanation for decision support and agent-mediated correction
- LessWrong coverage at https://www.lesswrong.com/posts/wHBL4eSjdfv6aDyD6/automated-interpretability-driven-model-auditing-and-control