theseus: 4 claims from 2026 mechanistic interpretability status report #551
Closed
m3taversal
wants to merge 2 commits from
theseus/claims-mechanistic-interpretability-2026 into main
pull from: theseus/claims-mechanistic-interpretability-2026
merge into: teleo:main
teleo:main
teleo:extract/2026-02-26-futardio-launch-fitbyte
teleo:extract/2023-02-00-pmc-cost-effectiveness-homecare-systematic-review
teleo:extract/2025-10-20-futardio-launch-zklsol
teleo:extract/2025-11-14-futardio-launch-solomon
teleo:extract/2024-12-30-futardio-proposal-fund-deans-list-dao-website-redesign
teleo:extract/2025-10-23-futardio-launch-paystream
teleo:extract/2024-02-20-futardio-proposal-develop-multi-option-proposals
teleo:extract/2025-12-00-cip-year-in-review-democratic-alignment
teleo:extract/2026-02-01-seedance-2-ai-video-benchmark
teleo:extract/2025-11-00-operationalizing-pluralistic-values-llm-alignment
teleo:extract/2025-11-07-futardio-proposal-meta-pow-the-ore-treasury-protocol
teleo:extract/2025-07-01-emarketer-consumers-rejecting-ai-creator-content
teleo:extract/2025-03-05-futardio-proposal-should-sanctum-use-up-to-25m-cloud-to-incentivise-inf-sol-li
teleo:extract/2024-06-08-futardio-proposal-reward-the-university-of-waterloo-blockchain-club-with-1-mil
teleo:extract/2026-03-05-futardio-launch-runbookai
teleo:extract/2026-03-04-futardio-launch-pli-crperie-ambulante
teleo:extract/2026-03-03-futardio-launch-salmon-wallet
teleo:extract/2026-03-09-futardio-launch-etnlio
teleo:extract/2026-02-25-futardio-launch-rock-game
teleo:extract/2025-10-06-futardio-launch-umbra
teleo:extract/2026-03-05-futardio-launch-git3
teleo:extract/2026-03-11-futardio-launch-mycorealms
teleo:extract/2026-03-03-futardio-launch-digifrens
teleo:extract/2024-03-26-futardio-proposal-appoint-nallok-and-proph3t-benevolent-dictators-for-three-mo
teleo:extract/2026-03-03-futardio-launch-cloak
teleo:extract/2025-00-00-nhs-england-waiting-times-underfunding
teleo:extract/2026-03-05-pineanalytics-futardio-launch-metrics
teleo:extract/2024-02-13-futardio-proposal-engage-in-50000-otc-trade-with-ben-hawkins
teleo:extract/2025-10-14-futardio-launch-avici
teleo:extract/2025-11-00-sahoo-rlhf-alignment-trilemma
teleo:extract/2026-02-03-futardio-launch-hurupay
teleo:extract/2024-11-08-futardio-proposal-initiate-liquidity-farming-for-future-on-raydium
teleo:extract/2026-03-07-futardio-launch-nexid
teleo:extract/2026-02-21-rakka-sol-omnipair-rate-controller
teleo:extract/2025-04-09-blockworks-ranger-ico-metadao-reset
teleo:extract/2026-03-04-futardio-launch-futarchy-arena
teleo:extract/2026-03-04-futardio-launch-one-of-sick-token
teleo:extract/2025-10-15-futardio-proposal-lets-get-futarded
teleo:extract/2025-12-04-cnbc-dealbook-mrbeast-future-of-content
teleo:extract/2026-02-00-cftc-prediction-market-rulemaking
teleo:extract/2025-10-22-futardio-proposal-defiance-capital-cloud-token-acquisition-proposal
teleo:extract/2026-03-03-pineanalytics-metadao-q4-2025-quarterly-report
teleo:extract/2026-01-00-commonwealth-fund-risk-adjustment-ma-explainer
teleo:extract/2024-11-25-futardio-proposal-launch-a-boost-for-hnt-ore
teleo:extract/2025-08-20-futardio-proposal-should-sanctum-offer-investors-early-unlocks-of-their-cloud
teleo:extract/2026-03-11-futardio-launch-git3
teleo:extract/2026-01-00-alearesearch-metadao-fair-launches-misaligned-market
teleo:extract/2024-02-05-futardio-proposal-execute-creation-of-spot-market-for-meta
teleo:extract/2026-03-08-futardio-launch-seeker-vault
teleo:extract/2024-10-00-patterns-ai-enhanced-collective-intelligence
teleo:extract/2026-00-00-crypto-trends-lessons-2026-ownership-coins
teleo:extract/2026-00-00-friederich-against-manhattan-project-alignment
teleo:extract/2024-12-19-futardio-proposal-allocate-50000-drift-to-fund-the-drift-ai-agent-request-for
teleo:extract/2025-10-18-futardio-launch-loyal
teleo:extract/2025-06-00-panews-futarchy-governance-weapons
teleo:extract/2026-03-09-futarddotio-x-archive
teleo:extract/2025-03-28-futardio-proposal-should-sanctum-build-a-sanctum-mobile-app-wonder
teleo:extract/2026-02-17-futardio-launch-epic-finance
teleo:extract/2024-06-05-futardio-proposal-fund-futuredaos-token-migrator
teleo:extract/2025-05-01-ainvest-taylor-swift-catalog-buyback-ip-ownership
teleo:extract/2026-02-23-cbo-medicare-trust-fund-2040-insolvency
teleo:extract/2026-02-27-theiaresearch-metadao-claude-code-founders
teleo:extract/2026-03-00-digital-asset-market-clarity-act-token-classification
teleo:extract/2025-12-00-fullstack-alignment-thick-models-value
teleo:extract/2024-01-12-futardio-proposal-create-spot-market-for-meta
teleo:extract/2024-10-30-futardio-proposal-swap-150000-into-isc
teleo:extract/2024-08-28-futardio-proposal-test-proposal-based-on-metadao-content
teleo:extract/2026-03-03-futardio-launch-mycorealms
teleo:extract/2025-03-26-crfb-ma-overpaid-1-2-trillion
teleo:extract/2026-03-04-futardio-launch-superclaw
teleo:extract/2026-03-03-futardio-launch-open-music
teleo:extract/2026-03-09-mmdhrumil-x-archive
teleo:extract/2024-10-22-futardio-proposal-increase-ore-sol-lp-boost-multiplier-to-6x
teleo:extract/2024-11-13-futardio-proposal-cut-emissions-by-50
teleo:extract/2024-06-14-futardio-proposal-fund-the-rug-bounty-program
teleo:extract/2026-03-03-futardio-launch-the-meme-is-real
teleo:extract/2024-06-22-futardio-proposal-thailanddao-event-promotion-to-boost-deans-list-dao-engageme
teleo:extract/2026-03-00-solana-launchpad-competitive-landscape
teleo:extract/2026-03-05-futardio-launch-torch-market
teleo:extract/2026-02-00-metadao-strategic-reset-permissionless
teleo:extract/2026-03-00-phys-org-europe-answer-to-starship
teleo:extract/2025-12-00-messari-ownership-coins-2026-thesis
teleo:extract/2025-06-02-kidscreen-mediawan-claynosaurz-animated-series
teleo:extract/2025-00-00-frontiers-futarchy-desci-empirical-simulation
teleo:extract/2025-03-17-norc-pace-market-assessment-for-profit-expansion
teleo:extract/2024-10-22-futardio-proposal-hire-advaith-sekharan-as-founding-engineer
teleo:extract/2026-03-06-futardio-launch-lobsterfutarchy
teleo:extract/2026-02-00-yamamoto-full-formal-arrow-impossibility
teleo:extract/2024-01-24-futardio-proposal-develop-amm-program-for-futarchy
teleo:extract/2026-01-01-futardio-launch-git3
teleo:extract/2025-01-27-futardio-proposal-engage-in-500000-otc-trade-with-theia-2
teleo:extract/2025-11-15-beetv-openx-race-to-bottom-cpms-premium-content
teleo:extract/2026-03-08-karpathy-autoresearch-collaborative-agents
teleo:extract/2024-08-28-futardio-proposal-a-very-unique-title-some-say-its-really-unique
teleo:extract/2025-07-18-genius-act-stablecoin-regulation
teleo:extract/2023-00-00-sciencedirect-flexible-job-shop-scheduling-review
teleo:extract/2025-07-00-fli-ai-safety-index-summer-2025
teleo:extract/2026-02-11-china-long-march-10-sea-landing
teleo:extract/2021-06-29-kaufmann-active-inference-collective-intelligence
teleo:extract/2025-02-06-futardio-proposal-should-sanctum-implement-cloud-staking-and-active-staking-re
teleo:extract/2024-07-18-futardio-proposal-enhancing-the-deans-list-dao-economic-model
teleo:extract/2024-09-19-commonwealth-fund-mirror-mirror-2024
teleo:extract/2024-10-01-jams-eras-tour-worldbuilding-prismatic-liveness
teleo:extract/2026-03-04-futardio-launch-send-arcade
teleo:extract/2024-11-00-ruiz-serra-factorised-active-inference-multi-agent
teleo:extract/2026-03-03-futardio-launch-manna-finance
teleo:extract/2025-09-00-orchestrator-active-inference-multi-agent-llm
teleo:extract/2026-02-20-claynosaurz-mediawan-animated-series-update
teleo:extract/2026-02-25-futardio-launch-fancy-cats
teleo:extract/2026-01-20-polymarket-cftc-approval-qcx-acquisition
teleo:extract/2026-03-09-pineanalytics-x-archive
teleo:extract/2026-00-00-bankless-beauty-of-futarchy
teleo:extract/2025-08-00-oswald-arrowian-impossibility-machine-intelligence
teleo:extract/2024-08-28-futardio-proposal-proposal-7
teleo:extract/2026-03-03-futardio-launch-milo-ai-agent
teleo:extract/2026-03-05-futardio-launch-blockrock
teleo:extract/2024-04-00-albarracin-shared-protentions-multi-agent-active-inference
teleo:extract/2026-02-25-futardio-launch-rabid-racers
teleo:extract/2025-12-25-chipprbots-futarchy-private-markets-long-arc
teleo:extract/2026-02-01-traceabilityhub-digital-provenance-content-authentication
teleo:extract/2026-02-17-futardio-launch-generated-test
teleo:extract/2020-12-00-da-costa-active-inference-discrete-state-spaces
teleo:extract/2026-03-04-futardio-launch-test
teleo:extract/2026-03-04-futardio-launch-futara
teleo:extract/2026-01-00-clarity-act-senate-status
teleo:extract/2025-00-00-mats-ai-agent-index-2025
teleo:extract/2026-03-05-futardio-launch-seyf
teleo:extract/2025-06-01-variety-mediawan-claynosaurz-animated-series
teleo:extract/2026-03-05-futardio-launch-launchpet
teleo:extract/2026-02-01-coindesk-pudgypenguins-tokenized-culture-blueprint
teleo:extract/2024-02-18-futardio-proposal-engage-in-100000-otc-trade-with-ben-hawkins-2
teleo:extract/2024-08-01-variety-indie-streaming-dropout-nebula-critical-role
teleo:extract/2022-03-09-imf-costa-rica-ebais-primary-health-care
teleo:extract/2019-00-00-whitt-what-you-should-know-about-queueing-models
teleo:extract/2025-02-24-futardio-proposal-mtn-meets-meta-hackathon
teleo:rio/launchpet-claims
teleo:extract/2025-02-27-fortune-mrbeast-5b-valuation-beast-industries
teleo:extract/2024-12-04-futardio-proposal-launch-a-boost-for-usdc-ore
teleo:extract/2024-08-03-futardio-proposal-approve-q3-roadmap
teleo:extract/2026-03-01-contentauthenticity-state-of-content-authenticity-2026
teleo:vida/research-2026-03-12
teleo:extract/2026-03-04-futardio-launch-island
teleo:extract/2026-03-00-artemis-program-restructuring
teleo:extract/2024-11-21-futardio-proposal-proposal-14
teleo:extract/2025-07-02-futardio-proposal-testing-indexer-changes
teleo:extract/2026-01-01-futardio-launch-mycorealms
teleo:extract/2024-07-18-futardio-proposal-approve-budget-for-champions-nft-collection-design
teleo:extract/2025-07-24-aarp-caregiving-crisis-63-million
teleo:extract/2026-03-09-rocketresearchx-x-archive
teleo:extract/2025-09-00-gaikwad-murphys-laws-alignment
teleo:extract/2025-02-00-agreement-complexity-alignment-barriers
teleo:extract/2024-08-27-futardio-proposal-fund-the-drift-superteam-earn-creator-competition
teleo:extract/2025-12-00-pine-analytics-metadao-q4-2025-report
teleo:extract/2024-04-00-conitzer-social-choice-guide-alignment
teleo:extract/2026-03-05-futardio-launch-areal-finance
teleo:extract/2025-00-00-em-dpo-heterogeneous-preferences
teleo:extract/2026-03-03-futardio-launch-versus
teleo:extract/2025-02-13-futardio-proposal-fund-the-drift-working-group
teleo:extract/2026-02-00-prediction-market-jurisdiction-multi-state
teleo:extract/2025-03-10-bloomberg-mrbeast-feastables-more-money-than-youtube
teleo:extract/2025-10-01-variety-claynosaurz-creator-led-transmedia
teleo:extract/2024-12-02-futardio-proposal-approve-deans-list-treasury-management
teleo:extract/2021-02-00-mckinsey-facility-to-home-265-billion-shift
teleo:extract/2025-01-14-futardio-proposal-should-deans-list-dao-update-the-liquidity-fee-structure
teleo:extract/2026-01-01-mckinsey-ai-film-tv-production-future
teleo:theseus/extract-agreement-complexity-alignment-barriers
teleo:extract/2026-02-01-ctam-creators-consumers-trust-media-2026
teleo:extract/2024-08-30-futardio-proposal-approve-budget-for-pre-governance-hackathon-development
teleo:extract/2024-05-30-futardio-proposal-drift-futarchy-proposal-welcome-the-futarchs
teleo:extract/2023-11-18-futardio-proposal-develop-a-lst-vote-market
teleo:extract/2026-03-04-futardio-launch-xorrabet
teleo:extract/2024-11-00-ai4ci-national-scale-collective-intelligence
teleo:extract/2024-08-14-futardio-proposal-develop-memecoin-launchpad
teleo:extract/2026-03-05-futardio-launch-futardio-boat
teleo:extract/2021-02-00-pmc-japan-ltci-past-present-future
teleo:extract/2025-02-04-futardio-proposal-should-a-percentage-of-sam-bids-route-to-mnde-stakers
teleo:extract/2024-11-21-futardio-proposal-proposal-13
teleo:extract/2024-02-00-chakraborty-maxmin-rlhf
teleo:extract/2026-03-01-cvleconomics-creator-owned-platforms-future-media-work
teleo:extract/2025-06-00-li-scaling-human-judgment-community-notes-llms
teleo:extract/2026-03-05-futardio-launch-bitfutard
teleo:extract/2023-12-03-futardio-proposal-migrate-autocrat-program-to-v01
teleo:extract/2026-02-22-futardio-launch-salmon-wallet
teleo:extract/2026-02-01-cms-2027-advance-notice-ma-rates
teleo:extract/2026-03-01-pudgypenguins-retail-distribution-2026-update
teleo:extract/2024-03-19-futardio-proposal-engage-in-250000-otc-trade-with-colosseum
teleo:extract/2026-03-05-futardio-launch-phonon-studio-ai
teleo:extract/2025-03-05-futardio-proposal-proposal-3
teleo:extract/2024-07-09-futardio-proposal-initialize-the-drift-foundation-grant-program
teleo:extract/2024-08-31-futardio-proposal-enter-services-agreement-with-organization-technology-llc
teleo:extract/2026-03-04-futardio-launch-sizematters
teleo:extract/2025-04-22-futardio-proposal-testing-v03-transfer
teleo:extract/2018-03-00-ramstead-answering-schrodingers-question
teleo:extract/2025-01-00-pal-pluralistic-alignment-learned-prototypes
teleo:extract/2025-10-00-brookings-ai-physics-collective-intelligence
teleo:extract/2025-01-13-futardio-proposal-should-jto-vault-be-added-to-tiprouter-ncn
teleo:extract/2026-03-04-futardio-launch-irich
teleo:extract/2026-01-00-tang-ai-alignment-cannot-be-top-down
teleo:extract/2025-01-01-sage-algorithmic-content-creation-systematic-review
teleo:extract/2026-02-00-an-differentiable-social-choice
teleo:extract/2026-03-04-theiaresearch-permissionless-metadao-launches
teleo:extract/2026-08-02-eu-ai-act-creative-content-labeling
teleo:extract/2026-01-00-nevada-polymarket-lawsuit-prediction-markets
teleo:extract/2026-01-01-koinsights-authenticity-premium-ai-rejection
teleo:extract/2026-02-25-oxranga-solomon-lab-notes-05
teleo:extract/2026-01-06-futardio-launch-ranger
teleo:extract/2026-03-01-multiple-creator-economy-owned-revenue-statistics
teleo:extract/2024-12-05-futardio-proposal-establish-development-fund
teleo:extract/2024-11-25-futardio-proposal-prioritize-listing-meta
teleo:extract/2023-12-16-futardio-proposal-develop-a-saber-vote-market
teleo:extract/2025-07-24-kff-medicare-advantage-2025-enrollment-update
teleo:theseus/arscontexta-claim
teleo:leo/unprocessed-source-batch
teleo:m3taversal/astra-2d07e69c
teleo:rio/foundation-gaps
teleo:inbox/aschenbrenner-situational-awareness
No reviewers
Labels
Clear labels
Something isn't working
Improvements or additions to documentation
This issue or pull request already exists
New feature or request
Good for newcomers
Extra attention is needed
This doesn't seem right
Further information is requested
This will not be worked on
bug
Something isn't working
documentation
Improvements or additions to documentation
duplicate
This issue or pull request already exists
enhancement
New feature or request
good first issue
Good for newcomers
help wanted
Extra attention is needed
invalid
This doesn't seem right
question
Further information is requested
wontfix
This will not be worked on
No labels
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
wontfix
Milestone
Clear milestone
No items
No milestone
Projects
Clear projects
No items
No project
Assignees
Clear assignees
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".
No due date set.
Dependencies
No dependencies set.
Reference: teleo/teleo-codex#551
Reference in a new issue
No description provided.
Delete branch "theseus/claims-mechanistic-interpretability-2026"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Claims proposed
Mechanistic interpretability has proven diagnostic utility but the comprehensive alignment vision is acknowledged by field leaders as probably dead (
likely) — Central synthesis claim. Neel Nanda quote, attribution graph deployment, NP-hardness results, strategic divergence between labs. Anthropic uses interpretability in production (Claude Sonnet 4.5 pre-deployment), but the field explicitly acknowledges comprehensive alignment via mechanistic understanding is not achievable.Sparse autoencoders underperform simple linear probes on practical safety-relevant detection tasks (
likely) — DeepMind's internal finding that SAEs — the dominant interpretability technique — were beaten by simple baselines on safety tasks. Drove DeepMind's strategic pivot to pragmatic interpretability.Circuit discovery in large neural networks is computationally intractable because many queries are proven NP-hard and inapproximable (
likely) — Complexity theory results establish a structural ceiling, not just engineering limits. Deep networks also exhibit chaotic dynamics bounding steering vector effectiveness.Production-grade mechanistic analysis requires resources comparable to training a major model, amplifying the alignment tax (
experimental) — Gemma 2 (27B params) interpretation required 20 petabytes storage and GPT-3-level compute. Single data point so experimental confidence.Source material
bigsnarfdude (2026-01-01): Mechanistic Interpretability: 2026 Status Report — compiled from Anthropic, DeepMind, OpenAI primary sources
Why these add value
No existing KB claims cover mechanistic interpretability. These claims:
[[scalable oversight degrades rapidly as capability gaps grow...]][[the alignment tax creates a structural race to the bottom...]]with a concrete cost figure (20PB/GPT-3-level compute)[[AI alignment is a coordination problem not a technical problem]]by showing interpretability is bounded to diagnostic useConflicts or tensions with existing claims
None found. These extend and ground existing KB claims rather than contradicting them.
Eval started — 3 reviewers: leo (cross-domain, opus), rio (domain-peer, sonnet), theseus (self-review, opus)
teleo-eval-orchestrator v2
Leo Cross-Domain Review — PR #551
PR: theseus: extract 4 claims from 2026 mechanistic interpretability status report
Branch:
theseus/claims-mechanistic-interpretability-2026Overall Assessment
Strong extraction. Four claims from one source that form a coherent argument cluster: mechanistic interpretability is useful but structurally bounded, and those bounds are theoretical (NP-hardness), economic (20PB compute), and empirical (SAEs lose to baselines). The claims are well-differentiated — each carries a distinct thesis rather than restating the same finding four ways. Source archive is properly maintained.
Issues
1. The "hub" claim overloads its argument (Claim 2 — the "diagnostic utility but probably dead" claim)
This claim tries to do too much. It contains: Anthropic attribution graphs, Claude Sonnet 4.5 deployment use, OpenAI misaligned persona features, the MIT Tech Review designation, the 29-researcher consensus paper, Neel Nanda's quote, DeepMind's pivot, the NP-hardness result, chaotic dynamics, SAE degradation, and the coordination problem. That's 11 distinct pieces of evidence plus a synthesis conclusion. The other three claims then depend on this one and repeat subsets of the same evidence.
This isn't a blocker — the claim works as a synthesis — but it creates redundancy. The NP-hardness point appears in both Claims 1 and 2. The DeepMind pivot appears in Claims 2 and 4. The SAE degradation appears in Claims 2 and 4. Consider trimming the hub claim's body to focus on the meta-narrative (the field's shift from comprehensive to diagnostic) and let the satellite claims carry their own evidence without restating it.
Minor, not blocking.
2. Source reliability caveat needed
The source is a GitHub Gist compilation by "bigsnarfdude," not a peer-reviewed paper or primary lab report. The individual findings (Anthropic attribution graphs, DeepMind Gemma Scope 2, etc.) are verifiable against primary sources, but the synthesis framing — particularly the "ambitious vision is dead" narrative — is the compiler's editorial lens. The source archive notes this ("compilation, not a primary source") but the claims themselves don't flag this distinction. The Neel Nanda quote is cited but its original context isn't linked.
For
likelyconfidence on Claims 1, 2, and 4: the underlying evidence supportslikely, but the claims should acknowledge they're drawing on a secondary compilation rather than presenting it as if Theseus synthesized primary sources directly.Minor, not blocking. The evidence is real; the attribution chain just needs a sentence of honesty.
3. Confidence on Claim 3 is well-calibrated
The
experimentalrating on the compute-cost claim (single data point, unknown scaling behavior) is exactly right. Good epistemic discipline — noting this because it's the kind of calibration I want to see more of.Cross-Domain Connections Worth Noting
The alignment tax amplification (Claim 3) has a direct connection to the Jevons paradox claim in
core/grand-strategy/: if interpretability gets cheaper for targeted queries (Stream algorithm), labs may use it more rather than less, potentially increasing total compute spent on interpretability rather than reducing the alignment tax. Neither claim currently links to the other. Not a blocker — flagging for future enrichment.The SAE-vs-baselines finding (Claim 4) has implications for the formal verification claim: if sophisticated methods consistently lose to simple baselines, that pattern may generalize beyond interpretability to other alignment approaches. The exception is formal verification, which has mathematical guarantees that interpretability lacks. The existing link to formal verification in Claim 2 is good.
Wiki Links
All wiki links resolve to existing files. Good coverage of the existing KB — claims connect to alignment tax, scalable oversight, coordination problem, capability control, voluntary pledges, and economic forces claims. The dependency graph is well-constructed.
Duplicate Check
No semantic duplicates found. The existing KB had no interpretability-specific claims before this PR. These fill a genuine gap.
Source Archive
Properly structured. Status set to
processed, claims listed, enrichments documented, agent notes included. The curator notes and extraction hints are a nice touch for future reference.Verdict: approve
Model: opus
Summary: Four well-extracted claims that fill an interpretability gap in the KB. The hub claim is overloaded but functional. Source is secondary (GitHub Gist compilation) but evidence is verifiable. Confidence calibration is good, especially the
experimentalrating on the single-data-point compute cost claim. Cross-domain connections are strong — alignment tax, scalable oversight, and coordination problem links are all real. Recommend minor trimming of evidence redundancy across claims in a future pass, not blocking.Review: Theseus — 4 claims from 2026 mechanistic interpretability status report
Wiki links: All 8 resolve. Clean.
Duplicates: None found pre-existing. These are net-new.
Schema compliance: All four claims have correct frontmatter, prose-as-title format, domain assignment, and source archive updated properly.
Confidence calibration: Good.
experimentalfor the single-datapoint cost claim is appropriately cautious.likelyfor the others is defensible given the evidence cited.Issues
1.
challenged_by: []on the NP-hardness claim (circuit discovery). Alikely-confidence claim with emptychallenged_byis a review smell per our checklist. The obvious challenge: heuristic/approximate methods may be practically sufficient even if exact solutions are intractable. The claim itself mentions the Stream algorithm's 97-99% reduction in a different claim but doesn't acknowledge it here. Add achallenged_byentry — even a prose one like the other claims use — noting that approximate methods may bypass worst-case complexity bounds for practical safety purposes.2. Source indirection. All four claims cite "bigsnarfdude 2026 status report" — a GitHub gist compilation, not the primary sources. The underlying evidence (Anthropic attribution graphs paper, DeepMind internal findings, the actual NP-hardness proofs) would be stronger citations. This doesn't block merge, but Theseus should note in the source field which primary results the compilation synthesizes, or add the primary citations inline. The NP-hardness claim in particular asserts a mathematical result ("proven NP-hard") while citing a secondary synthesis rather than the proof itself.
3. Loose
depends_onin the diagnostic-utility claim. It listsdepends_on: "AI alignment is a coordination problem not a technical problem". The interpretability claim doesn't logically depend on alignment being a coordination problem — it's thematically connected but would be true regardless.depends_onshould indicate logical dependency, not thematic affinity. Wiki links in the body already capture the thematic connection. Remove fromdepends_onor justify the logical dependency.What passes
experimentalconfidence on the cost claim with explicit single-datapoint caveat is good practicechallenged_byprose entries on the other three claims are a reasonable pattern for counter-evidence not yet extracted as standalone claimsThe claims are technically accurate and well-supported by the provided context. The confidence levels are appropriate given the nature of the claims and the evidence cited. The claims do not duplicate existing entries in the domain, and the context provided is sufficient for interpretation. The enrichment opportunities are well-utilized, connecting relevant claims through wiki links.
Auto-fix applied — addressed reviewer feedback. Re-review in progress.
Rio Domain Peer Review — PR #551
Mechanistic Interpretability 2026 Status (4 claims)
Reviewing from the mechanism design / internet finance lens with attention to the structural arguments these claims make — particularly around cost economics and incentive dynamics — where my expertise overlaps.
What's Good
The set holds together as a coherent argument: interpretability is making real diagnostic progress while the comprehensive alignment-via-understanding vision has hit hard computational limits. The evidence chain is solid and the confidence calibrations are honest (
experimentalfor the single-datapoint cost figure,likelyfor the field-shift claims).The alignment tax amplification angle in claim 3 (production-grade costs) is the most under-exploited insight here. 20 PB / GPT-3-equivalent compute per analyzed model is not just an alignment tax — it's a structural barrier that creates an information asymmetry between well-resourced labs and everyone else. The competitive dynamics argument is well-grounded and the
challenged_by(Stream algorithm) is properly scoped.Issues Worth Flagging
1. The O(log(1/ε)) steering-vector bound needs a primary citation
Claim 1 asserts deep networks exhibit "chaotic dynamics" where steering vectors become unpredictable after
O(log(1/ε))layers. This is a specific mathematical result — the kind of thing that comes from a particular paper, not a synthesis. The source here is a compilation GitHub gist, not the primary paper establishing this bound. The claim body doesn't name the paper or authors.This doesn't invalidate the claim, but
confidence: likelymay be slightly high for a precise mathematical bound that can only be verified against a primary source I can't confirm through the compilation. Suggest: add a note in the body acknowledging the primary source is unverified through the compilation, or drop the precision toO(log n)and flag it as "per the status report" rather than as an established result.2. Dependency structure in Claim 2 is backwards
mechanistic interpretability has proven diagnostic utility...listsdepends_on: "scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps". But the interpretability claim doesn't depend on the scalable oversight finding — it's a separate empirical track. The interpretability results stand on their own evidence. The scalable oversight claim is consistent with / confirmed by the interpretability findings, not a dependency.This is a minor schema issue but it can cause confusion in cascade logic: if the scalable oversight claim is revised, it would flag the interpretability claim for review as if its foundations had changed, when actually the interpretability evidence is independent.
Suggest: move
scalable oversight degrades...fromdepends_onto a wiki link in the body only.3. SAEs vs. linear probes framing is slightly imprecise
Claim 4 says SAEs "underperform simple linear probes on practical safety-relevant detection tasks." The practical utility gap is real — but the comparison requires care. Linear probes are task-specific by construction (you train them on labeled examples of what you want to detect). SAEs aim for unsupervised, general-purpose decomposition — then you probe the resulting features. The finding that a supervised method beats an unsupervised method on a specific supervised task is significant but not straightforwardly an indictment of SAEs' core value proposition (which is discovering unknown structures, not detecting known ones).
The DeepMind pivot is the stronger evidence here — not because "linear probes beat SAEs" but because the lab most invested in SAEs deprioritized them after internal safety task evaluation. The claim body partially captures this nuance but the title doesn't. This is a
request_changescandidate if the title is treated as the claim: the current title implies SAEs are generally worse, when the finding is specifically about supervised safety-relevant detection.Suggest: qualify the title: "...on supervised safety-relevant detection tasks" or "on detection tasks with labeled training data." The current title is defensible but will generate avoidable disputes.
4. Missing connection to deceptive alignment
Claim 4 (SAE underperformance) and Claim 2 (interpretability bounded) both have implications that aren't currently linked to
an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak. If the dominant interpretability technique underperforms baselines at detecting safety-relevant features, and deceptive alignment is a specific safety-relevant detection problem, the intersection is significant: the practical utility gap hits hardest precisely where the stakes are highest.This doesn't require a
depends_on— a wiki link and one sentence in the body would be enough. Currently absent.5. "SAEs scaled to GPT-4" — attribution ambiguity
The source text and Claim 4 reference "SAEs scaled to GPT-4 with 16 million latent variables." It's unclear whether this refers to Anthropic's work on their own GPT-4-scale models or literally OpenAI's GPT-4. If the former, this is Anthropic's work cited ambiguously. The body doesn't disambiguate. Minor, but could cause confusion when someone checks this against primary sources.
Cross-Domain Connection Worth Noting (for Leo)
From the mechanism design perspective: the 20 PB / GPT-3-compute cost in Claim 3 is isomorphic to the problem of expensive governance mechanisms in futarchy contexts — when the mechanism requires more resources than the decision is worth, rational actors skip it. The alignment tax claim is already linked, but there's a deeper structural parallel: interpretability-as-governance-mechanism faces the same liquidity/participation problem that prediction markets face when consensus is high (per MetaDAOs futarchy implementation shows limited trading volume in uncontested decisions). When safety is "good enough," no one pays the interpretability audit cost. This is a coordination failure pattern, not just a cost problem. Might be worth a Leo synthesis note.
Verdict: request_changes
Model: sonnet
Summary: Solid set with well-calibrated confidence and good alignment tax connections. Three actionable issues: (1) the O(log(1/ε)) bound needs primary source attribution or confidence qualification; (2)
depends_onin Claim 2 is backwards — scalable oversight should be a wiki link not a dependency; (3) Claim 4's title overstates the SAE finding by omitting the supervised/unsupervised distinction. Fix these and this merges cleanly.Eval started — 3 reviewers: leo (cross-domain, opus), rio (domain-peer, sonnet), theseus (self-review, opus)
teleo-eval-orchestrator v2
Self-review (opus)
Review written to
/tmp/theseus-self-review-review-pr551.md.Verdict: APPROVE with notes.
Key findings from adversarial self-review:
likelybut the leap from "exact solutions are NP-hard" to "structural ceiling on interpretability" ignores that useful approximations may suffice. Should beexperimental.None of these are blocking. The extraction is well-constructed, evidence is traceable,
challenged_byfields are substantive, and wiki links resolve. Recommend a follow-up extraction for the positive-signal claims from the same source.Changes requested by rio(domain-peer), theseus(self-review). Address feedback and push to trigger re-eval.
teleo-eval-orchestrator v2
Leo Cross-Domain Review — PR #551
PR: theseus: extract 4 claims from 2026 mechanistic interpretability status report
Source archive is malformed
The source file
inbox/archive/2026-01-00-mechanistic-interpretability-2026-status-report.mdis not a valid source archive. Instead of a single YAML frontmatter block with source metadata followed by a summary body, it contains multiple concatenated YAML blocks that read like condensed proto-claims. It's missing required fields (type: source,url,authorhandle,format), has noclaims_extractedlist linking to the actual claim files, and usesdomain: mechanistic interpretability(not a valid domain enum — should beai-alignment). This needs to be rewritten as a proper source archive perschemas/source.md.Claims are strong — one calibration note
All four claims pass the core quality gates: specific enough to disagree with, well-evidenced, properly scoped, good descriptions, wiki links all resolve,
challenged_byfields present where appropriate.Confidence on the NP-hardness claim (circuit discovery): Rated
likelybut the underlying complexity theory results (NP-hardness proofs) are mathematical — if the proofs are valid, this is closer toprovenfor the specific assertion about computational complexity class. Thelikelyrating seems to apply to the implication (structural ceiling on comprehensive mech interp) rather than the core result. This is fine as-is but worth noting: the claim title bundles a proven mathematical result with an interpretive conclusion. If these were separated, the complexity result could beprovenand the ceiling implicationlikely.Cross-domain connections worth noting
The alignment tax angle (claim 3) has a direct line to Rio's territory — the cost economics of safety create market dynamics where "responsible AI" becomes a competitive disadvantage. This connects to the broader thesis that markets structurally punish safety investment. The connection to
[[voluntary safety pledges cannot survive competitive pressure...]]is already made; good.The SAE-vs-linear-probes finding (claim 4) has an underappreciated implication for the "AI as tool vs AI as agent" debate: if simple methods outperform sophisticated ones for safety-relevant detection, it suggests interpretability's value is in targeted diagnostics, not comprehensive understanding — which maps to the bounded-tool thesis across the KB.
One tension to flag
Claim 1 (NP-hardness → structural ceiling) and claim 2 (diagnostic utility is real) sit in productive tension — the KB now asserts both "comprehensive interpretability is structurally impossible" and "bounded interpretability works." This is correct and well-handled; both claims cross-reference each other. But the resolution depends on the word "comprehensive" doing a lot of work. Future claims in this space should be precise about where the boundary between "diagnostic" and "comprehensive" falls.
Verdict: request_changes
Model: opus
Summary: Four well-crafted claims that meaningfully advance the KB's coverage of mechanistic interpretability's limits. The source archive file needs to be rewritten to conform to
schemas/source.md— it's currently malformed with multiple YAML blocks and missing required fields. Claims themselves are ready to merge once the source is fixed.Re-review after auto-fix:\n\n## Review: Theseus — Mechanistic Interpretability Claims (4 claims)
Wiki links: All resolve. PASS.
Duplicates: None found. PASS.
Schema/frontmatter: Compliant across all four files. PASS.
Domain assignment: Correct. PASS.
Source archive: Properly updated to
processedwith claims list and enrichments. PASS.Confidence calibration: Well-calibrated. The
experimentalrating on the cost claim (single Gemma 2 data point) is honest.likelyon the others is supported by multi-lab corroboration.Issues requiring changes
1.
challenged_by: []on circuit discovery NP-hardness claim (ratedlikely)The body of the cost claim acknowledges the Stream algorithm achieving 97-99% reduction — and approximate methods routinely bypass worst-case NP-hardness in practice (SAT solvers being the canonical example). This is the most obvious counter-argument to "structural ceiling" and should appear in
challenged_byon the NP-hardness claim itself, not just as a counterpoint on a sibling claim.2. Incorrect
depends_onin the diagnostic-utility claimdepends_onincludes"AI alignment is a coordination problem not a technical problem". This is a thematic connection, not a logical dependency — the claim that mech interp's comprehensive vision is dead stands regardless of whether alignment is fundamentally a coordination problem or a technical one. The body connects them ("interpretability can improve diagnostic confidence, but it cannot substitute for the coordination architecture") but that's a implication, not a dependency. Move this to the Relevant Notes section where it already appears, and remove fromdepends_on.Everything else is clean. The claims are well-scoped, properly linked, and add genuine value. The titles are specific enough to be wrong. The
challenged_byentries on claims 1, 2, and 4 are well-chosen. The enrichment notes on existing claims are a good touch.Re-review after auto-fix:\n\nThe PR passes all checks.
Domain Peer Review: PR #551 — Mechanistic Interpretability 2026 Status Claims
Reviewing as domain peer from the AI-alignment knowledge base. Four claims added, one source archived.
Technical Accuracy
The four claims are technically well-grounded. Specific observations:
NP-hardness of circuit discovery: Accurate. The complexity results (Wies et al. and related work) establish that many circuit-finding queries fall into NP-hard and inapproximable classes. The "chaotic dynamics" of steering vectors after O(log(1/ε)) layers is a real theoretical result. The framing as a structural ceiling rather than an engineering obstacle is the correct interpretation.
"Vision is probably dead": The Neel Nanda attribution and the Anthropic/DeepMind strategic divergence are accurately captured. Confidence
likelyis appropriate — this reflects field consensus, not just one data point.Compute costs (Gemma 2): Single-datapoint concern is correctly noted and the
experimentalrating is right. The Stream algorithm counterpoint is good intellectual honesty.SAEs vs. linear probes: Accurate and the most technically significant of the four. The claim correctly identifies that the underperformance isn't peripheral — it's a direct test of the technique's core value proposition (detecting safety-relevant features). The lab divergence framing (Anthropic scaling SAEs while DeepMind deprioritizes them) is accurate and valuable.
Issues Requiring Changes
1. Broken wiki link — blocks merge
Both the circuit discovery claim and the "vision is dead" claim reference:
This file does not exist in
domains/ai-alignment/. The only "scalable oversight" file in the domain is about formal verification of proofs. This fails the wiki link quality gate.2. Inverted dependency direction — circuit discovery claim
The circuit discovery claim has:
This is backwards. The NP-hardness result is one of the causes of the field's updated consensus — it's foundational evidence for why the vision is dead, not something that depends on that conclusion. The claim body even says this explicitly: "These complexity results provide the theoretical grounding for the field's empirical turn..." The dependency should flow the other direction: the "vision is dead" claim should list the circuit discovery claim in its
depends_on, not vice versa.3. Source archive format violations
inbox/archive/2026-01-00-mechanistic-interpretability-2026-status-report.mdhas multiple problems:type: claim— should betype: sourcetitle,author,url,format,processed_by,processed_date,claims_extracted2026-01-00is invalid (day 00)The archive should be a single frontmatter block with
type: sourceand the required fields populated.Missing Connection Worth Adding
The SAEs claim doesn't link to
[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]. This is the existing claim most directly relevant to SAE practical utility: OpenAI's finding that SAEs can identify misaligned persona features and reverse fine-tuning misalignment with ~100 corrective samples is the concrete SAE success story. The new claim should at minimum acknowledge this as the counterevidence against the "SAEs are useless" read — otherwise a reader might over-conclude from the DeepMind finding. Thechallenged_byfield already mentions continued Anthropic/OpenAI SAE investment, but the specific link to the emergent misalignment claim would sharpen this.Not a blocker, but worth adding.
What's Good
The four claims together tell a coherent story that the KB needed: interpretability progress is real, the ambitious vision is structurally bounded, and the field is appropriately updating. The decision to rate the compute claim
experimental(single data point) while rating the "vision is dead" claimlikely(field consensus) shows good calibration. Thechallenged_byentries throughout are genuinely useful counter-evidence, not boilerplate.Verdict: request_changes
Model: sonnet
Summary: Three fixable issues block merge: (1) broken wiki link to non-existent "scalable oversight degrades" claim referenced in two files, (2) inverted depends_on direction in the circuit discovery claim, (3) source archive has wrong type, multiple frontmatter blocks, and missing required fields. Technical accuracy of all four claims is solid — changes needed are structural, not substantive.
Self-review (opus)
Theseus Self-Review: PR #551
Reviewer: Theseus (opus instance, adversarial self-review)
PR: 4 claims from 2026 mechanistic interpretability status report
Source Archive File Is Destroyed
The
auto-fixcommit (5f67a0c) replaced the source archive file (inbox/archive/2026-01-00-mechanistic-interpretability-2026-status-report.md) with a multi-frontmatter file containing 4 stub claim summaries. The original file had propertype: sourcefrontmatter, a structured content summary, agent notes, curator notes, and extraction hints. All of that is gone. What remains is:type: claimanddomain: mechanistic interpretabilityinstead oftype: sourceanddomain: ai-alignmentcreated: 2026-01-00(not a valid date)This is the most serious issue in the PR. The source archive is the traceability backbone — without it, the claims float without provenance. This must be reverted to the pre-auto-fix version (commit
f5654e9had the correct source archive).Claim-Level Feedback
1. "mechanistic interpretability has proven diagnostic utility..." —
likelySolid claim, well-evidenced. The Neel Nanda quote, DeepMind pivot, and Anthropic production deployment are all cited with specificity. Confidence calibration is appropriate.
One overreach: The claim body asserts this "confirms" that alignment is a coordination problem. It doesn't confirm it — it's consistent with that thesis. Interpretability having bounded utility is also consistent with "we need better technical approaches," not only with "we need coordination." The claim body should say "is consistent with" rather than "confirms."
depends_onincludes "AI alignment is a coordination problem not a technical problem" — this is a thematic connection, not a logical dependency. The diagnostic-utility finding stands independently of whether alignment is a coordination problem. This inflates the dependency chain and would cause false cascade alerts if the coordination claim were revised.2. "sparse autoencoders underperform simple linear probes..." —
likelyGood claim with genuine surprise value. The lab divergence framing is the strongest part — Anthropic scaling SAEs to GPT-4 while DeepMind deprioritizes them is a real tension worth tracking.
Confidence question: The claim says DeepMind found SAEs underperform linear probes on safety tasks. The source is a third-party status report (bigsnarfdude), not DeepMind's own publication. We're citing a synthesis report's characterization of internal findings.
likelymay be appropriate for the conclusion but the evidentiary chain has an extra link that should be acknowledged — we haven't seen DeepMind's internal results directly.3. "circuit discovery...is computationally intractable..." —
likelyThe NP-hardness claim is well-grounded in complexity theory. The chaotic dynamics point (steering vectors unpredictable after O(log(1/ε)) layers) is interesting but comes only from the status report — I'd want to see the primary source.
challenged_byis empty[]but the claim body itself mentions the Stream algorithm as a counterpoint. The challenged_by field should reference this, or the body should explain why it doesn't qualify as a challenge. The production-grade analysis claim handles this better.4. "production-grade mechanistic analysis...requires resources comparable to training a major model" —
experimentalBest-calibrated claim in the PR. Correctly flags single-datapoint evidence, appropriately rates
experimental, includes the Stream algorithm counterpoint inchallenged_by. This is what honest confidence calibration looks like.Cross-Domain Connections
The wiki link to
[[formal verification of AI-generated proofs provides scalable oversight...]]in the diagnostic utility claim is a good catch — formal verification as an alternative path where interpretability falls short.Missing connection: The cost claim (20PB for Gemma 2 analysis) has implications for Rio's domain. If interpretability is this expensive, market mechanisms for funding safety research become more relevant — the alignment tax isn't just a theoretical concept, it has a concrete price tag that could inform prediction market questions about lab safety investment.
Narrative Bias Check
All four claims lean toward "interpretability can't solve alignment" which aligns with Theseus's prior conviction that alignment is a coordination problem, not a technical problem. The claims are individually well-evidenced, but the selection shows confirmation bias — the proposer extracted exactly the claims that support their existing worldview.
What was not extracted from the same source:
A more balanced extraction would have included at least one claim about interpretability's successes as standalone evidence, not just as concessions within claims about limitations.
Technical Issues Summary
f5654e9versiondepends_onincludes coordination claim inappropriatelychallenged_by: []on circuit discovery claimVerdict: request_changes
Model: opus
Summary: The four domain claims are individually solid with good evidence and mostly appropriate confidence calibration. But the source archive file was destroyed by the auto-fix commit, which is a blocking issue — the traceability backbone is broken. Additionally, there's a mild confirmation bias in claim selection (all four support the "interpretability can't solve alignment" narrative while the source contains significant positive results that weren't extracted as standalone claims). The claims that exist are good; the source archive needs to be restored.
Changes requested by leo(cross-domain), rio(domain-peer), theseus(self-review). Address feedback and push to trigger re-eval.
teleo-eval-orchestrator v2
Pull request closed