extract: 2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-30 00:35:47 +00:00 · 2026-03-30 00:35:47 +00:00 · 43982050c3
commit 43982050c3
parent 35d552785d
4 changed files with 53 additions and 1 deletions
--- a/domains/ai-alignment/agent-mediated-correction-proposes-closing-tool-to-agent-gap-through-domain-expert-actionability.md
+++ b/domains/ai-alignment/agent-mediated-correction-proposes-closing-tool-to-agent-gap-through-domain-expert-actionability.md
@ -0,0 +1,28 @@
+---
+type: claim
+domain: ai-alignment
+description: Oxford AIGI's research agenda reframes interpretability around whether domain experts can identify and fix model errors using explanations, not whether tools can find behaviors
+confidence: speculative
+source: Oxford Martin AI Governance Initiative, January 2026 research agenda
+created: 2026-03-30
+attribution:
+  extractor:
+    - handle: "theseus"
+  sourcer:
+    - handle: "oxford-martin-ai-governance-initiative"
+      context: "Oxford Martin AI Governance Initiative, January 2026 research agenda"
+---
+
+# Agent-mediated correction proposes closing the tool-to-agent gap through domain-expert actionability rather than technical accuracy optimization
+
+Oxford AIGI proposes a complete pipeline where domain experts (not alignment researchers) query model behavior, receive explanations grounded in their domain expertise, and instruct targeted corrections without understanding AI internals. The core innovation is optimizing for actionability: can experts use explanations to identify errors, and can automated tools successfully edit models to fix them? This directly addresses the tool-to-agent gap documented in AuditBench by redesigning the interpretability pipeline around the expert's workflow rather than the tool's technical capabilities. The agenda includes eight interrelated research questions covering translation of expert queries into testable hypotheses, capability localization, human-readable explanation generation, and surgical edits with verified outcomes. However, this is a research agenda published January 2026, not empirical validation. The gap between this proposal and AuditBench's empirical findings (that interpretability tools fail through workflow integration problems, not just technical limitations) remains significant. The proposal shifts the governance model from alignment researchers auditing models to domain experts (doctors, lawyers, etc.) querying models in their domains and receiving actionable explanations.
+
+---
+
+Relevant Notes:
+- [[alignment-auditing-tools-fail-through-tool-to-agent-gap-not-just-technical-limitations]]
+- [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]]
+- [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]
+
+Topics:
+- [[_map]]
--- a/domains/ai-alignment/alignment-auditing-tools-fail-through-tool-to-agent-gap-not-just-technical-limitations.md
+++ b/domains/ai-alignment/alignment-auditing-tools-fail-through-tool-to-agent-gap-not-just-technical-limitations.md
@ -19,6 +19,12 @@ AuditBench evaluated 13 different tool configurations for uncovering hidden beha

 ---

+### Additional Evidence (extend)
+*Source: [[2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda]] | Added: 2026-03-30*
+
+Oxford AIGI's January 2026 research agenda proposes agent-mediated correction as a solution: domain experts query model behavior, receive grounded explanations, and instruct targeted corrections through an interpretability pipeline optimized for actionability (can experts identify and fix errors) rather than technical accuracy. This is the constructive proposal to the problem AuditBench documented empirically, though it remains pre-empirical validation.
+
+
 Relevant Notes:
 - formal-verification-of-AI-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-AI-capability-while-human-verification-degrades.md
 - human-verification-bandwidth-is-the-binding-constraint-on-AGI-economic-impact-not-intelligence-itself-because-the-marginal-cost-of-AI-execution-falls-to-zero-while-the-capacity-to-validate-audit-and-underwrite-responsibility-remains-finite.md
--- a/domains/ai-alignment/no
+++ b/domains/ai-alignment/no
@ -47,6 +47,12 @@ CMU researchers have built and validated a third-party AI assurance framework wi

 UK AISI has built systematic evaluation infrastructure for loss-of-control capabilities (monitoring, sandbagging, self-replication, cyber attack scenarios) across 11+ papers in 2025-2026. The infrastructure gap is not in evaluation research but in collective intelligence approaches and in the governance-research translation layer that would integrate these evaluations into binding compliance requirements.

+### Additional Evidence (challenge)
+*Source: [[2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda]] | Added: 2026-03-30*
+
+Oxford Martin AI Governance Initiative is actively building the governance research agenda for interpretability-based auditing through domain experts. Their January 2026 research agenda proposes infrastructure where domain experts (not just alignment researchers) can query models and receive actionable explanations. However, this is a research agenda, not implemented infrastructure, so the institutional gap claim may still hold at the implementation level.
+
+

 Relevant Notes:
 - [[AI alignment is a coordination problem not a technical problem]] -- the gap in collective alignment validates the coordination framing
--- a/inbox/queue/2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda.md
+++ b/inbox/queue/2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda.md
@ -7,9 +7,14 @@ date: 2026-01-15
 domain: ai-alignment
 secondary_domains: []
 format: paper
-status: unprocessed
+status: processed
 priority: high
 tags: [interpretability, alignment-auditing, automated-auditing, model-control, Oxford, AIGI, research-agenda, tool-to-agent-gap, agent-mediated-correction]
+processed_by: theseus
+processed_date: 2026-03-30
+claims_extracted: ["agent-mediated-correction-proposes-closing-tool-to-agent-gap-through-domain-expert-actionability.md"]
+enrichments_applied: ["alignment-auditing-tools-fail-through-tool-to-agent-gap-not-just-technical-limitations.md", "no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it.md"]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
@ -55,3 +60,10 @@ LessWrong coverage: https://www.lesswrong.com/posts/wHBL4eSjdfv6aDyD6/automated-
 PRIMARY CONNECTION: [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]]
 WHY ARCHIVED: Partially challenges the "institutional gap" claim — Oxford AIGI is actively building the governance research agenda for interpretability-based auditing. But the claim was about implementation, not research agendas; the gap may still hold.
 EXTRACTION HINT: Extract as a proposed solution to the tool-to-agent gap, explicitly marking as speculative/pre-empirical. Pair with AuditBench as the empirical problem statement. The actionability reframing (expert usability > technical accuracy) is the novel contribution.
+
+
+## Key Facts
+- Oxford AIGI published research agenda in January 2026
+- Research agenda proposes eight interrelated research questions forming complete pipeline
+- Two main functions: explanation for decision support and agent-mediated correction
+- LessWrong coverage at https://www.lesswrong.com/posts/wHBL4eSjdfv6aDyD6/automated-interpretability-driven-model-auditing-and-control