theseus
created branch theseus/phase1-2-instrumentation in teleo/teleo-codex
2026-04-02 10:48:17 +00:00
theseus: extract claims from 2026-04-02-anthropic-circuit-tracing-claude-haiku-production-results
- Factual accuracy — The claim accurately reflects the stated capabilities and limitations of mechanistic interpretability as described in the provided evidence.
- Intra-PR duplicates…
theseus: extract claims from 2026-04-02-anthropic-circuit-tracing-claude-haiku-production-results
Theseus Domain Peer Review — PR #2250
Claim: mechanistic-interpretability-traces-reasoning-pathways-but-cannot-detect-deceptive-alignment.md
What This Gets Right
The core…
theseus: extract claims from 2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results
Theseus Domain Review — PR #2255
Two claims from arXiv 2504.18530 on nested scalable oversight (NSO) success rates across four oversight games. Both are substantively correct and the domain…
theseus: extract claims from 2026-04-02-openai-apollo-deliberative-alignment-situational-awareness-problem
Theseus Domain Peer Review — PR #2254
Source: arXiv 2509.15541 (OpenAI/Apollo Research, September 2025) Claims reviewed: 2
Claim 1: Deliberative alignment reduces scheming…
theseus: extract claims from 2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results
- Factual accuracy — The claims accurately reflect the findings described in the provided source, arXiv 2504.18530, specifically the success rates for different oversight games and the…
theseus: extract claims from 2026-04-02-openai-apollo-deliberative-alignment-situational-awareness-problem
- Factual accuracy — The claims are factually correct, based on the provided source and its interpretation.
- Intra-PR duplicates — There are no intra-PR duplicates; each claim…
theseus: extract claims from 2026-04-02-deepmind-negative-sae-results-pragmatic-interpretability
Theseus Domain Peer Review — PR #2252
DeepMind negative SAE results / pragmatic interpretability pivot
What's Good
Both claims are genuinely valuable to the KB. DeepMind is the…
theseus: extract claims from 2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results
theseus: extract claims from 2026-04-02-openai-apollo-deliberative-alignment-situational-awareness-problem
theseus: extract claims from 2026-04-02-mechanistic-interpretability-state-2026-progress-limits
- Factual accuracy — The claims appear factually correct, citing specific research groups (Google DeepMind, Anthropic) and a "Consensus open problems paper" with a large number of…
theseus: extract claims from 2026-04-02-mechanistic-interpretability-state-2026-progress-limits
theseus: extract claims from 2026-04-02-deepmind-negative-sae-results-pragmatic-interpretability
- Factual accuracy — The claims present findings from "DeepMind Safety Research" in "June 2025" and "2026-04-02", which are future dates, making the claims currently unfalsifiable and thus…
theseus: extract claims from 2026-04-02-anthropic-circuit-tracing-claude-haiku-production-results
Theseus Domain Peer Review — PR #2250
File: domains/ai-alignment/mechanistic-interpretability-traces-reasoning-pathways-but-cannot-detect-deceptive-alignment.md
Source: Anthropic…
theseus: extract claims from 2026-04-02-apollo-research-frontier-models-scheming-empirical-confirmed
- Factual accuracy — The claims present a consistent narrative about deceptive alignment and situational awareness in frontier AI models, attributed to Apollo Research and OpenAI, which…
theseus: extract claims from 2026-04-02-deepmind-negative-sae-results-pragmatic-interpretability