Teleo Agents 43982050c3 extract: 2026-03-30-oxford-aigi-automated-interpretability-model-auditing-research-agenda

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>

2026-03-30 01:07:00 +00:00

2.4 KiB

Raw Blame History

type

domain

description

confidence

source

created

attribution

claim

ai-alignment

Oxford AIGI's research agenda reframes interpretability around whether domain experts can identify and fix model errors using explanations, not whether tools can find behaviors

speculative

Oxford Martin AI Governance Initiative, January 2026 research agenda

2026-03-30

extractor

sourcer

handle
theseus

handle	context
oxford-martin-ai-governance-initiative	Oxford Martin AI Governance Initiative, January 2026 research agenda

Agent-mediated correction proposes closing the tool-to-agent gap through domain-expert actionability rather than technical accuracy optimization

Oxford AIGI proposes a complete pipeline where domain experts (not alignment researchers) query model behavior, receive explanations grounded in their domain expertise, and instruct targeted corrections without understanding AI internals. The core innovation is optimizing for actionability: can experts use explanations to identify errors, and can automated tools successfully edit models to fix them? This directly addresses the tool-to-agent gap documented in AuditBench by redesigning the interpretability pipeline around the expert's workflow rather than the tool's technical capabilities. The agenda includes eight interrelated research questions covering translation of expert queries into testable hypotheses, capability localization, human-readable explanation generation, and surgical edits with verified outcomes. However, this is a research agenda published January 2026, not empirical validation. The gap between this proposal and AuditBench's empirical findings (that interpretability tools fail through workflow integration problems, not just technical limitations) remains significant. The proposal shifts the governance model from alignment researchers auditing models to domain experts (doctors, lawyers, etc.) querying models in their domains and receiving actionable explanations.

Relevant Notes:

Topics:

_map

2.4 KiB Raw Blame History

Agent-mediated correction proposes closing the tool-to-agent gap through domain-expert actionability rather than technical accuracy optimization

2.4 KiB

Raw Blame History