From f9d341e86f335acf9baddd08caa30838f9b6e4b7 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Mon, 30 Mar 2026 01:07:03 +0000 Subject: [PATCH] pipeline: archive 1 source(s) post-merge Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70> --- ...tability-model-auditing-research-agenda.md | 57 +++++++++++++++++++ 1 file changed, 57 insertions(+) create mode 100644 inbox/archive/ai-alignment/2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda.md diff --git a/inbox/archive/ai-alignment/2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda.md b/inbox/archive/ai-alignment/2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda.md new file mode 100644 index 00000000..917dcfa9 --- /dev/null +++ b/inbox/archive/ai-alignment/2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda.md @@ -0,0 +1,57 @@ +--- +type: source +title: "Automated Interpretability-Driven Model Auditing and Control: A Research Agenda" +author: "Oxford Martin AI Governance Initiative (AIGI)" +url: https://aigi.ox.ac.uk/wp-content/uploads/2026/01/Automated_interp_Research_Agenda.pdf +date: 2026-01-15 +domain: ai-alignment +secondary_domains: [] +format: paper +status: processed +priority: high +tags: [interpretability, alignment-auditing, automated-auditing, model-control, Oxford, AIGI, research-agenda, tool-to-agent-gap, agent-mediated-correction] +--- + +## Content + +Oxford Martin AI Governance Initiative (AIGI) research agenda proposing a system where domain experts can query a model's behavior, receive explanations grounded in their expertise, and instruct targeted corrections — all without needing to understand how AI systems work internally. + +**Core pipeline:** Eight interrelated research questions forming a complete pipeline: +1. Translating expert queries into testable hypotheses about model internals +2. Localizing capabilities in specific model components +3. Generating human-readable explanations +4. Performing surgical edits with verified outcomes + +**Two main functions:** +1. **Explanation for decision support**: Generate faithful, domain-grounded explanations that enable experts to evaluate model predictions and identify errors +2. **Agent-mediated correction**: When experts identify errors, an agent determines the optimal interpretability tool and abstraction level for intervention, applies permanent corrections with minimal side effects, and improves the model for future use + +**Key distinction**: Rather than optimizing for plausible explanations or proxy task performance, the system is optimized for **actionability**: can domain experts use explanations to identify errors, and can automated tools successfully edit models to fix them? + +The agenda explicitly attempts to address the tool-to-agent gap (though doesn't name it as such) by designing the interpretability pipeline around the expert's workflow rather than around the tool's technical capabilities. + +LessWrong coverage: https://www.lesswrong.com/posts/wHBL4eSjdfv6aDyD6/automated-interpretability-driven-model-auditing-and-control + +## Agent Notes +**Why this matters:** This is a direct counter-proposal to the problems documented in AuditBench. Oxford AIGI is proposing to solve the tool-to-agent gap by redesigning the pipeline around the human expert's need for actionability — not asking "can the tool find the behavior?" but "can the expert identify and fix errors using the tool's output?" This is a more tractable decomposition of the problem. However, it's a research agenda (January 2026), not an empirical result. It tells us the field recognizes the tool-to-agent problem; it doesn't show the problem is solved. + +**What surprised me:** The framing around "domain experts" (not alignment researchers) as the primary users of interpretability tools. This shifts the governance model: rather than alignment researchers auditing models, the proposal is for doctors/lawyers/etc. to query models in their domain and receive actionable explanations. This is a practical governance architecture, not just a technical fix. + +**What I expected but didn't find:** Empirical results. This is a research agenda, not a completed study. No AuditBench-style empirical validation of whether agent-mediated correction actually works. The gap between this agenda and AuditBench's empirical findings is significant. + +**KB connections:** +- scalable oversight degrades rapidly as capability gaps grow — this agenda is an attempt to build scalable oversight through interpretability; the research agenda is the constructive proposal, AuditBench is the empirical reality check +- [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] — Oxford AIGI is attempting to build the governance infrastructure; this partially addresses the "institutional gap" claim +- formal verification of AI-generated proofs provides scalable oversight — formal verification works for math; this agenda attempts to extend oversight to behavioral/value domains via interpretability + +**Extraction hints:** +- CLAIM CANDIDATE: "Agent-mediated correction — where domain experts query model behavior, receive grounded explanations, and instruct targeted corrections through an interpretability pipeline — is a proposed approach to closing the tool-to-agent gap in alignment auditing, but lacks empirical validation as of early 2026" +- This is a "proposed solution" claim (confidence: speculative to experimental) — pairs with AuditBench as problem statement +- Note the actionability reframing: most interpretability research optimizes for technical accuracy; this agenda optimizes for expert usability + +**Context:** Oxford Martin AI Governance Initiative — academic/policy research organization, not a lab. Published January 2026. Directly relevant to governance architecture debates. The research agenda format means these are open questions, not completed research. + +## Curator Notes +PRIMARY CONNECTION: [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] +WHY ARCHIVED: Partially challenges the "institutional gap" claim — Oxford AIGI is actively building the governance research agenda for interpretability-based auditing. But the claim was about implementation, not research agendas; the gap may still hold. +EXTRACTION HINT: Extract as a proposed solution to the tool-to-agent gap, explicitly marking as speculative/pre-empirical. Pair with AuditBench as the empirical problem statement. The actionability reframing (expert usability > technical accuracy) is the novel contribution.