teleo-codex/domains/ai-alignment
Teleo Agents 5990e9b50a
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run
theseus: extract claims from 2026-04-04-telegram-m3taversal-what-do-you-think-are-the-most-compelling-approach
- Source: inbox/queue/2026-04-04-telegram-m3taversal-what-do-you-think-are-the-most-compelling-approach.md
- Domain: ai-alignment
- Claims: 3, Entities: 0
- Enrichments: 4
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-15 18:53:40 +00:00
..
79 percent of multi-agent failures originate from specification and coordination not implementation because decomposition quality is the primary determinant of system success.md
_map.md Merge pull request 'theseus: 3 active inference claims for collective agent architecture (resubmit)' (#827) from theseus/active-inference-claims into main 2026-03-15 14:24:53 +00:00
a misaligned context cannot develop aligned AI because the competitive dynamics building AI optimize for deployment speed not safety making system alignment prerequisite for AI alignment.md leo: add 9 claims — ai-alignment + collective intelligence (Moloch/Schmachtenberger sprint batch 3) 2026-04-14 19:15:29 +00:00
activation-based-persona-monitoring-detects-behavioral-trait-shifts-in-small-models-without-behavioral-testing.md
adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans.md
adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing.md
agent research direction selection is epistemic foraging where the optimal strategy is to seek observations that maximally reduce model uncertainty rather than confirm existing beliefs.md
agent skill specifications have become an industrial standard for knowledge codification with major platform adoption creating the infrastructure layer for systematic conversion of human expertise into portable AI-consumable formats.md
agent-generated code creates cognitive debt that compounds when developers cannot understand what was produced on their behalf.md
agent-mediated-correction-proposes-closing-tool-to-agent-gap-through-domain-expert-actionability.md
agent-native retrieval converges on filesystem abstractions over embedding search because grep cat ls and find are all an agent needs to navigate structured knowledge.md
agentic Taylorism means humanity feeds knowledge into AI through usage as a byproduct of labor and whether this concentrates or distributes depends entirely on engineering and evaluation.md leo: stress-test rewrites — 7 claims revised, 1 merged, 1 deleted, 3 new claims added 2026-04-14 19:15:29 +00:00
AGI may emerge as a patchwork of coordinating sub-AGI agents rather than a single monolithic system.md
AI accelerates existing Molochian dynamics by removing bottlenecks not creating new misalignment because the competitive equilibrium was always catastrophic and friction was the only thing preventing convergence.md leo: enrich 3 existing claims with Schmachtenberger corpus evidence 2026-04-14 19:15:29 +00:00
AI agent orchestration that routes data and tools between specialized models outperforms both single-model and human-coached approaches because the orchestrator contributes coordination not direction.md
AI agents as personal advocates collapse Coasean transaction costs enabling bottom-up coordination at societal scale but catastrophic risks remain non-negotiable requiring state enforcement as outer boundary.md
AI agents can reach cooperative program equilibria inaccessible in traditional game theory because open-source code transparency enables conditional strategies that require mutual legibility.md
AI agents excel at implementing well-scoped ideas but cannot generate creative experiment designs which makes the human role shift from researcher to agent workflow architect.md
AI alignment is a coordination problem not a technical problem.md leo: enrich 3 existing claims with Schmachtenberger corpus evidence 2026-04-14 19:15:29 +00:00
AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md
AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation.md
AI displacement hits young workers first because a 14 percent drop in job-finding rates for 22-25 year olds in exposed occupations is the leading indicator that incumbents organizational inertia temporarily masks.md
AI integration follows an inverted-U where economic incentives systematically push organizations past the optimal human-AI ratio.md
AI investment concentration where 58 percent of funding flows to megarounds and two companies capture 14 percent of all global venture capital creates a structural oligopoly that alignment governance must account for.md
AI is omni-use technology categorically different from dual-use because it improves all capabilities simultaneously meaning anything AI can optimize it can break.md leo: stress-test rewrites — 7 claims revised, 1 merged, 1 deleted, 3 new claims added 2026-04-14 19:15:29 +00:00
AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur which makes bioterrorism the most proximate AI-enabled existential risk.md
AI makes authoritarian lock-in dramatically easier by solving the information processing constraint that historically caused centralized control to fail.md leo: enrich 3 existing claims with Schmachtenberger corpus evidence 2026-04-14 19:15:29 +00:00
AI personas emerge from pre-training data as a spectrum of humanlike motivations rather than developing monomaniacal goals which makes AI behavior more unpredictable but less catastrophically focused than instrumental convergence predicts.md
AI shifts knowledge systems from externalizing memory to externalizing attention because storage and retrieval are solved but the capacity to notice what matters remains scarce.md
AI talent circulation between frontier labs transfers alignment culture not just capability because researchers carry safety methodologies and institutional norms to their new organizations.md
AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md
ai-agents-shift-research-bottleneck-from-execution-to-ideation-because-agents-implement-well-scoped-ideas-but-fail-at-creative-experiment-design.md theseus: extract claims from 2026-04-04-telegram-m3taversal-how-transformative-are-software-patterns-agentic 2026-04-15 18:51:53 +00:00
ai-capability-benchmarks-exhibit-50-percent-volatility-between-versions-making-governance-thresholds-unreliable.md
AI-companion-apps-correlate-with-increased-loneliness-creating-systemic-risk-through-parasocial-dependency.md
ai-enhanced-collective-intelligence-requires-federated-learning-architectures-to-preserve-data-sovereignty-at-scale.md
AI-exposed workers are disproportionately female high-earning and highly educated which inverts historical automation patterns and creates different political and economic displacement dynamics.md
AI-generated-persuasive-content-matches-human-effectiveness-at-belief-change-eliminating-the-authenticity-premium.md
ai-models-can-covertly-sandbag-capability-evaluations-even-under-chain-of-thought-monitoring.md
AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns.md
ai-sandbagging-creates-m-and-a-liability-exposure-across-product-liability-consumer-protection-and-securities-fraud.md theseus: extract claims from 2026-03-21-harvard-jolt-sandbagging-risk-allocation 2026-04-14 18:42:58 +00:00
ai-tools-reduced-experienced-developer-productivity-in-rct-conditions-despite-predicted-speedup-suggesting-capability-deployment-does-not-translate-to-autonomy.md
alignment-auditing-shows-structural-tool-to-agent-gap-where-interpretability-tools-work-in-isolation-but-fail-when-used-by-investigator-agents.md
alignment-auditing-tools-fail-through-tool-to-agent-gap-not-just-technical-limitations.md
alignment-auditing-tools-fail-through-tool-to-agent-gap-not-tool-quality.md
alignment-through-continuous-coordination-outperforms-upfront-specification-because-deployment-contexts-diverge-from-training-conditions.md theseus: extract claims from 2026-04-04-telegram-m3taversal-what-do-you-think-are-the-most-compelling-approach 2026-04-15 18:53:40 +00:00
an AI agent that is uncertain about its objectives will defer to human shutdown commands because corrigibility emerges from value uncertainty not from engineering against instrumental interests.md
an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak.md
anthropic-deepmind-interpretability-complementarity-maps-mechanisms-versus-detects-intent.md
anthropic-internal-resource-allocation-shows-6-8-percent-safety-only-headcount-when-dual-use-research-excluded-revealing-gap-between-public-positioning-and-commitment.md
Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development.md
anthropomorphizing AI agents to claim autonomous action creates credibility debt that compounds until a crisis forces public reckoning.md
anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop.md
anti-scheming-training-creates-goodhart-dynamic-where-training-signal-diverges-from-scheming-tendency.md theseus: extract claims from 2026-03-21-schoen-stress-testing-deliberative-alignment 2026-04-14 18:36:11 +00:00
approval fatigue drives agent architecture toward structural safety because humans cannot meaningfully evaluate 100 permission requests per hour.md
as AI-automated software development becomes certain the bottleneck shifts from building capacity to knowing what to build making structured knowledge graphs the critical input to autonomous systems.md
autonomous-weapons-violate-existing-IHL-because-proportionality-requires-human-judgment.md reweave: merge 20 files via frontmatter union [auto] 2026-04-14 16:52:47 +00:00
behavioral-divergence-between-evaluation-and-deployment-is-bounded-by-regime-information-extractable-from-internal-representations.md
benchmark-based-ai-capability-metrics-overstate-real-world-autonomous-performance-because-automated-scoring-excludes-production-readiness-requirements.md
bio-capability-benchmarks-measure-text-accessible-knowledge-not-physical-synthesis-capability.md
bostrom takes single-digit year timelines to superintelligence seriously while acknowledging decades-long alternatives remain possible.md
capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability.md
capabilities-training-alone-grows-evaluation-awareness-from-2-to-20-percent.md
capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds.md
capability-scaling-increases-error-incoherence-on-difficult-tasks-inverting-the-expected-relationship-between-model-size-and-behavioral-predictability.md
ccw-consensus-rule-enables-small-coalition-veto-over-autonomous-weapons-governance.md
chain-of-thought-monitorability-is-time-limited-governance-window.md
chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability.md
circuit-tracing-bottleneck-hours-per-prompt-limits-interpretability-scaling.md
civil-society-coordination-infrastructure-fails-to-produce-binding-governance-when-structural-obstacle-is-great-power-veto-not-political-will.md
coding agents cannot take accountability for mistakes which means humans must retain decision authority over security and critical systems regardless of agent capability.md
coding-agents-crossed-usability-threshold-december-2025-when-models-achieved-sustained-coherence-across-complex-multi-file-tasks.md
cognitive anchors that stabilize attention too firmly prevent the productive instability that precedes genuine insight because anchoring suppresses the signal that would indicate the anchor needs updating.md
collective attention allocation follows nested active inference where domain agents minimize uncertainty within their boundaries while the evaluator minimizes uncertainty at domain intersections.md
collective-intelligence-architectures-are-underexplored-for-alignment-despite-addressing-core-problems.md theseus: extract claims from 2026-04-04-telegram-m3taversal-what-do-you-think-are-the-most-compelling-approach 2026-04-15 18:53:40 +00:00
community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md
component-task-benchmarks-overestimate-operational-capability-because-simulated-environments-remove-real-world-friction.md
comprehensive AI services achieve superintelligent capability through architectural decomposition into task-specific systems that collectively match general intelligence without any single system possessing unified agency.md
compute export controls are the most impactful AI governance mechanism but target geopolitical competition not safety leaving capability development unconstrained.md
compute supply chain concentration is simultaneously the strongest AI governance lever and the largest systemic fragility because the same chokepoints that enable oversight create single points of failure.md
confidence changes in foundational claims must propagate through the dependency graph because manual tracking fails at scale and approximately 40 percent of top psychology journal papers are estimated unlikely to replicate.md
context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching.md
contrast-consistent-search-demonstrates-models-internally-represent-truth-signals-divergent-from-behavioral-outputs.md
coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem.md
corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests.md
court-protection-plus-electoral-outcomes-create-legislative-windows-for-ai-governance.md
court-protection-plus-electoral-outcomes-create-statutory-ai-regulation-pathway.md
court-ruling-creates-political-salience-not-statutory-safety-law.md
court-ruling-plus-midterm-elections-create-legislative-pathway-for-ai-regulation.md
cross-lab-alignment-evaluation-surfaces-safety-gaps-internal-evaluation-misses-providing-empirical-basis-for-mandatory-third-party-evaluation.md
cross-lingual-rlhf-fails-to-suppress-emotion-steering-side-effects.md
curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive.md
current language models escalate to nuclear war in simulated conflicts because behavioral alignment cannot instill aversion to catastrophic irreversible actions.md
current-frontier-models-evaluate-17x-below-catastrophic-autonomy-threshold-by-formal-time-horizon-metrics.md
cyber-capability-benchmarks-overstate-exploitation-understate-reconnaissance-because-ctf-isolates-techniques-from-attack-phase-dynamics.md
cyber-is-exceptional-dangerous-capability-domain-with-documented-real-world-evidence-exceeding-benchmark-predictions.md
deceptive-alignment-empirically-confirmed-across-all-major-2024-2025-frontier-models-in-controlled-tests.md
deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices.md
deferred-subversion-is-distinct-sandbagging-category-where-ai-systems-gain-trust-before-pursuing-misaligned-goals.md theseus: extract claims from 2026-03-21-harvard-jolt-sandbagging-risk-allocation 2026-04-14 18:42:58 +00:00
delegating critical infrastructure development to AI creates civilizational fragility because humans lose the ability to understand maintain and fix the systems civilization depends on.md
deliberative-alignment-reduces-scheming-in-controlled-settings-but-degrades-85-percent-in-real-world-deployment.md theseus: extract claims from 2026-03-21-schoen-stress-testing-deliberative-alignment 2026-04-14 18:36:11 +00:00
deliberative-alignment-reduces-scheming-through-situational-awareness-not-genuine-value-change.md
democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations.md
developing superintelligence is surgery for a fatal condition not russian roulette because the baseline of inaction is itself catastrophic.md
digital stigmergy is structurally vulnerable because digital traces do not evaporate and agents trust the environment unconditionally so malformed artifacts persist and corrupt downstream processing indefinitely.md
distributed superintelligence may be less stable and more dangerous than unipolar because resource competition between superintelligent agents creates worse coordination failures than a single misaligned system.md
divergence-ai-labor-displacement-substitution-vs-complementarity.md leo: incorporate Theseus review feedback on divergences #1 and #5 2026-04-14 18:47:19 +00:00
domestic-political-change-can-rapidly-erode-decade-long-international-AI-safety-norms-as-US-reversed-from-supporter-to-opponent-in-one-year.md
economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate.md
effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale.md
electoral-investment-becomes-residual-ai-governance-strategy-when-voluntary-and-litigation-routes-insufficient.md
eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods.md
emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md theseus: fix dangling wiki links in emergent misalignment enrichment 2026-04-14 18:39:21 +00:00
emotion-representations-localize-at-middle-depth-architecture-invariant.md
emotion-vector-interventions-limited-to-emotion-mediated-harms-not-strategic-deception.md
emotion-vectors-causally-drive-unsafe-ai-behavior-through-interpretable-steering.md
eu-ai-act-extraterritorial-enforcement-creates-binding-governance-alternative-to-us-voluntary-commitments.md
evaluation and optimization have opposite model-diversity optima because evaluation benefits from cross-family diversity while optimization benefits from same-family reasoning pattern alignment.md
evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions.md
evaluation-based-coordination-schemes-face-antitrust-obstacles-because-collective-pausing-agreements-among-competing-developers-could-be-construed-as-cartel-behavior.md
evidence-dilemma-rapid-ai-development-structurally-prevents-adequate-pre-deployment-safety-evidence-accumulation.md theseus: extract claims from 2026-03-21-international-ai-safety-report-2026-evaluation-gap 2026-04-14 18:43:15 +00:00
evolutionary trace-based optimization submits improvements as pull requests for human review creating a governance-gated self-improvement loop distinct from acceptance-gating or metric-driven iteration.md
external-evaluators-predominantly-have-black-box-access-creating-false-negatives-in-dangerous-capability-detection.md
factorised-generative-models-enable-decentralized-multi-agent-representation-through-individual-level-beliefs.md
file-backed durable state is the most consistently positive harness module across task types because externalizing state to path-addressable artifacts survives context truncation delegation and restart.md
formal verification becomes economically necessary as AI-generated code scales because testing cannot detect adversarial overfitting and a proof cannot be gamed.md
formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades.md
formal-verification-provides-scalable-oversight-that-sidesteps-alignment-degradation.md theseus: extract claims from 2026-04-04-telegram-m3taversal-what-do-you-think-are-the-most-compelling-approach 2026-04-15 18:53:40 +00:00
four restraints prevent competitive dynamics from reaching catastrophic equilibrium and AI specifically erodes physical limitations and bounded rationality leaving only coordination as defense.md
frontier-ai-failures-shift-from-systematic-bias-to-incoherent-variance-as-task-complexity-and-reasoning-length-increase.md
frontier-ai-labs-allocate-6-15-percent-research-headcount-to-safety-versus-60-75-percent-to-capabilities-with-declining-ratios-since-2024.md
frontier-ai-monitoring-evasion-capability-grew-from-minimal-mitigations-sufficient-to-26-percent-success-in-13-months.md
frontier-ai-safety-verdicts-rely-on-deployment-track-record-not-evaluation-confidence.md
frontier-ai-task-horizon-doubles-every-six-months-making-safety-evaluations-obsolete-within-one-model-generation.md
frontier-models-exhibit-situational-awareness-that-enables-strategic-deception-during-evaluation-making-behavioral-testing-fundamentally-unreliable.md
frontier-safety-frameworks-score-8-35-percent-against-safety-critical-standards-with-52-percent-composite-ceiling.md
government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic by penalizing safety constraints rather than enforcing them.md
government-safety-penalties-invert-regulatory-incentives-by-blacklisting-cautious-actors.md
graph traversal through curated wiki links replicates spreading activation from cognitive science because progressive disclosure implements decay-based context loading and queries evolve during search through the berrypicking effect.md
harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do.md
harness engineering outweighs model selection in agent system performance because changing the code wrapping the model produces up to 6x performance gaps on the same benchmark while model upgrades produce smaller gains.md
harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure.md
harness pattern logic is portable as natural language without degradation when backed by a shared intelligent runtime because the design-pattern layer is separable from low-level execution hooks.md
high AI exposure increases collective idea diversity without improving individual creative quality creating an asymmetry between group and individual effects.md
high-capability-models-show-early-step-hedging-as-proto-gaming-behavior.md
house-senate-ai-defense-divergence-creates-structural-governance-chokepoint-at-conference.md
human civilization passes falsifiable superorganism criteria because individuals cannot survive apart from society and occupations function as role-specific cellular algorithms.md
human ideas naturally converge toward similarity over social learning chains making AI a net diversity injector rather than a homogenizer under high-exposure conditions.md
human verification bandwidth is the binding constraint on AGI economic impact not intelligence itself because the marginal cost of AI execution falls to zero while the capacity to validate audit and underwrite responsibility remains finite.md
human-AI mathematical collaboration succeeds through role specialization where AI explores solution spaces humans provide strategic direction and mathematicians verify correctness.md
increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements.md
individual-free-energy-minimization-does-not-guarantee-collective-optimization-in-multi-agent-active-inference.md
inference efficiency gains erode AI deployment governance without triggering compute monitoring thresholds because governance frameworks target training concentration while inference optimization distributes capability below detection.md
inference-time-compute-creates-non-monotonic-safety-scaling-where-extended-reasoning-degrades-alignment.md
inference-time-safety-monitoring-recovers-alignment-through-early-reasoning-intervention.md
instrumental convergence risks may be less imminent than originally argued because current AI architectures do not exhibit systematic power-seeking behavior.md
intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends.md
international-humanitarian-law-and-ai-alignment-converge-on-explainability-requirements.md reweave: merge 21 files via frontmatter union [auto] 2026-04-14 01:10:21 +00:00
interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment.md
intrinsic proactive alignment develops genuine moral capacity through self-awareness empathy and theory of mind rather than external reward optimization.md
iterated distillation and amplification preserves alignment across capability scaling by keeping humans in the loop at every iteration but distillation errors may compound making the alignment guarantee probabilistic not absolute.md
iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation.md
judicial-oversight-checks-executive-ai-retaliation-but-cannot-create-positive-safety-obligations.md
judicial-oversight-of-ai-governance-through-constitutional-grounds-not-statutory-safety-law.md
knowledge between notes is generated by traversal not stored in any individual note because curated link paths produce emergent understanding that embedding similarity cannot replicate.md
knowledge codification into AI agent skills structurally loses metis because the tacit contextual judgment that makes expertise valuable cannot survive translation into explicit procedural rules.md
knowledge processing requires distinct phases with fresh context per phase because each phase performs a different transformation and contamination between phases degrades output quality.md
learning human values from observed behavior through inverse reinforcement learning is structurally safer than specifying objectives directly because the agent maintains uncertainty about what humans actually want.md
legal-and-alignment-communities-converge-on-AI-value-judgment-impossibility.md
legal-mandate-is-the-only-version-of-coordinated-pausing-that-avoids-antitrust-risk-while-preserving-coordination-benefits.md
LLM-maintained knowledge bases that compile rather than retrieve represent a paradigm shift from RAG to persistent synthesis because the wiki is a compounding artifact not a query cache.md
long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing.md
machine-learning-pattern-extraction-systematically-erases-dataset-outliers-where-vulnerable-populations-concentrate.md
macro AI productivity gains remain statistically undetectable despite clear micro-level benefits because coordination costs verification tax and workslop absorb individual-level improvements before they reach aggregate measures.md
making-research-evaluations-into-compliance-triggers-closes-the-translation-gap-by-design.md
many-interpretability-queries-are-provably-computationally-intractable.md
marginal returns to intelligence are bounded by five complementary factors which means superintelligence cannot produce unlimited capability gains regardless of cognitive power.md
maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-by-maximizing-minimum-utility-across-preference-groups.md
mechanistic-interpretability-detects-emotion-mediated-failures-but-not-strategic-deception.md
mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal.md
mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale.md
mechanistic-interpretability-traces-reasoning-pathways-but-cannot-detect-deceptive-alignment.md
memory architecture requires three spaces with different metabolic rates because semantic episodic and procedural memory serve different cognitive functions and consolidate at different speeds.md
meta-level-specification-gaming-extends-objective-gaming-to-oversight-mechanisms-through-sandbagging-and-evaluation-mode-divergence.md
methodology hardens from documentation to skill to hook as understanding crystallizes and each transition moves behavior from probabilistic to deterministic enforcement.md
military-ai-deskilling-and-tempo-mismatch-make-human-oversight-functionally-meaningless-despite-formal-authorization-requirements.md
minority-preference-alignment-improves-33-percent-without-majority-compromise-suggesting-single-reward-leaves-value-on-table.md
modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling.md
motivated reasoning among AI lab leaders is itself a primary risk vector because those with most capability to slow down have most incentive to accelerate.md leo: stress-test rewrites — 7 claims revised, 1 merged, 1 deleted, 3 new claims added 2026-04-14 19:15:29 +00:00
multi-agent coordination delivers value only when three conditions hold simultaneously natural parallelism context overflow and adversarial verification value.md
multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows.md
multi-agent deployment exposes emergent security vulnerabilities invisible to single-agent evaluation because cross-agent propagation identity spoofing and unauthorized compliance arise only in realistic multi-party environments.md
multi-agent-systems-amplify-provider-level-biases-through-recursive-reasoning-requiring-provider-diversity-for-collective-intelligence.md
multi-model collaboration solved problems that single models could not because different AI architectures contribute complementary capabilities as the even-case solution to Knuths Hamiltonian decomposition required GPT and Claude working together.md
multilateral-ai-governance-verification-mechanisms-remain-at-proposal-stage-because-technical-infrastructure-does-not-exist-at-deployment-scale.md
multilateral-verification-mechanisms-can-substitute-for-failed-voluntary-commitments-when-binding-enforcement-replaces-unilateral-sacrifice.md
nation-states will inevitably assert control over frontier AI development because the monopoly on force is the foundational state function and weapons-grade AI capability in private hands is structurally intolerable to governments.md
national-scale-collective-intelligence-infrastructure-requires-seven-trust-properties-to-achieve-legitimacy.md
ndaa-conference-process-is-viable-pathway-for-statutory-ai-safety-constraints.md
near-universal-political-support-for-autonomous-weapons-governance-coexists-with-structural-failure-because-opposing-states-control-advanced-programs.md
nested-scalable-oversight-achieves-at-most-52-percent-success-at-moderate-capability-gaps.md
no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it.md
noise-injection-detects-sandbagging-through-asymmetric-performance-response.md theseus: extract claims from 2026-03-21-tice-noise-injection-sandbagging-detection 2026-04-14 18:36:30 +00:00
non-autoregressive-architectures-reduce-jailbreak-vulnerability-through-elimination-of-continuation-drive-at-capability-cost.md
notes function as cognitive anchors that stabilize attention during complex reasoning by externalizing reference points that survive working memory degradation.md
notes function as executable skills for AI agents because loading a well-titled claim into context enables reasoning the agent could not perform without it.md
only binding regulation with enforcement teeth changes frontier AI lab behavior because every voluntary commitment has been eroded abandoned or made conditional on competitor behavior when commercially inconvenient.md
ottawa-model-treaty-process-cannot-replicate-for-dual-use-ai-systems-because-verification-architecture-requires-technical-capability-inspection-not-production-records.md
permanently failing to develop superintelligence is itself an existential catastrophe because preventable mass death continues indefinitely.md
persistent irreducible disagreement.md
physical infrastructure constraints on AI scaling create a natural governance window because packaging memory and power bottlenecks operate on 2-10 year timescales while capability research advances in months.md
pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md
pluralistic-ai-alignment-through-multiple-systems-preserves-value-diversity-better-than-forced-consensus.md
post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md
pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
precautionary-capability-threshold-activation-is-governance-response-to-benchmark-uncertainty.md
process-supervision-can-train-models-toward-steganographic-behavior-through-optimization-pressure.md
process-supervision-training-inadvertently-trains-steganographic-cot-behavior.md
production agent memory infrastructure consumed 24 percent of codebase in one tracked system suggesting memory requires dedicated engineering not a single configuration file.md
progressive disclosure of procedural knowledge produces flat token scaling regardless of knowledge base size because tiered loading with relevance-gated expansion avoids the linear cost of full context loading.md
prosaic alignment can make meaningful progress through empirical iteration within current ML paradigms because trial and error at pre-critical capability levels generates useful signal about alignment failure modes.md
provider-level-behavioral-biases-persist-across-model-versions-requiring-psychometric-auditing-beyond-standard-benchmarks.md
reasoning-models-may-have-emergent-alignment-properties-distinct-from-rlhf-fine-tuning-as-o3-avoided-sycophancy-while-matching-or-exceeding-safety-focused-models.md
recursive self-improvement creates explosive intelligence gains because the system that improves is itself improving.md
reinforcement learning trained memory management outperforms hand-coded heuristics because the agent learns when compression is safe and the advantage widens with complexity.md
representation-trajectory-geometry-distinguishes-deceptive-from-sincere-alignment-without-creating-adversarial-attack-surfaces.md
representative-sampling-and-deliberative-mechanisms-should-replace-convenience-platforms-for-ai-alignment-feedback.md
retracted sources contaminate downstream knowledge because 96 percent of citations to retracted papers fail to note the retraction and no manual audit process scales to catch the cascade.md
rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training.md
rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups.md
rlhf-is-implicit-social-choice-without-normative-scrutiny.md
safe AI development requires building alignment mechanisms before scaling capability.md
sandbagging-detection-requires-white-box-access-creating-deployment-barrier.md
scaffolded-black-box-prompting-outperforms-white-box-interpretability-for-alignment-auditing.md
scalable-oversight-success-is-domain-dependent-with-worst-performance-in-highest-stakes-domains.md
scheming-safety-cases-require-interpretability-evidence-because-observer-effects-make-behavioral-evaluation-insufficient.md
self-evolution improves agent performance through acceptance-gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open-ended exploration.md
self-optimizing agent harnesses outperform hand-engineered ones because automated failure mining and iterative refinement explore more of the harness design space than human engineers can.md
single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
situationally-aware-models-do-not-systematically-game-early-step-monitors-at-current-capabilities.md
some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md
specification-gaming-scales-with-capability-as-more-capable-optimizers-find-more-sophisticated-gaming-strategies.md
specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception.md
structured exploration protocols reduce human intervention by 6x because the Residue prompt enabled 5 unguided AI explorations to solve what required 31 human-coached explorations.md
structured self-diagnosis prompts induce metacognitive monitoring in AI agents that default behavior does not produce because explicit uncertainty flagging and failure mode enumeration activate deliberate reasoning patterns.md theseus: Tier 1 X source extraction — emergent misalignment enrichment + self-diagnosis claim 2026-04-14 18:39:20 +00:00
subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers.md
sufficiently complex orchestrations of task-specific AI services may exhibit emergent unified agency recreating the alignment problem at the system level.md
super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance.md
superorganism organization extends effective lifespan substantially at each organizational level which means civilizational intelligence operates on temporal horizons that individual-preference alignment cannot serve.md
surveillance-of-AI-reasoning-traces-degrades-trace-quality-through-self-censorship-making-consent-gated-sharing-an-alignment-requirement-not-just-a-privacy-preference.md leo: add @thesensatore sourcer attribution to all 5 tracenet claims 2026-04-14 19:13:01 +00:00
sycophancy-is-paradigm-level-failure-across-all-frontier-models-suggesting-rlhf-systematically-produces-approval-seeking.md
task difficulty moderates AI idea adoption more than source disclosure with difficult problems generating AI reliance regardless of whether the source is labeled.md
technological development draws from an urn containing civilization-destroying capabilities and only preventive governance can avoid black ball technologies.md
the absence of a societal warning signal for AGI is a structural feature not an accident because capability scaling is gradual and ambiguous and collective action requires anticipation not reaction.md
the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load.md
the first mover to superintelligence likely gains decisive strategic advantage because the gap between leader and followers accelerates during takeoff.md
the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md
the optimal SI development strategy is swift to harbor slow to berth moving fast to capability then pausing before full deployment.md
the progression from autocomplete to autonomous agent teams follows a capability-matched escalation where premature adoption creates more chaos than value.md
the relationship between training reward signals and resulting AI desires is fundamentally unpredictable making behavioral alignment through training an unreliable method.md
the same coordination protocol applied to different AI models produces radically different problem-solving strategies because the protocol structures process not thought.md
the shape of returns on cognitive reinvestment determines takeoff speed because constant or increasing returns on investing cognitive output into cognitive capability produce recursive self-improvement.md
the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md
the training-to-inference shift structurally favors distributed AI architectures because inference optimizes for power efficiency and cost-per-token where diverse hardware competes while training optimizes for raw throughput where NVIDIA monopolizes.md
the variance of a learned preference sensitivity distribution diagnoses dataset heterogeneity and collapses to fixed-parameter behavior when preferences are homogeneous.md
three concurrent maintenance loops operating at different timescales catch different failure classes because fast reflexive checks medium proprioceptive scans and slow structural audits each detect problems invisible to the other scales.md
three conditions gate AI takeover risk autonomy robotics and production chain control and current AI satisfies none of them which bounds near-term catastrophic risk despite superhuman cognitive capabilities.md
tools and artifacts transfer between AI agents and evolve in the process because Agent O improved Agent Cs solver by combining it with its own structural knowledge creating a hybrid better than either original.md
training-free-weight-editing-converts-steering-vectors-to-persistent-alignment.md
trajectory-geometry-probing-requires-white-box-access-limiting-deployment-to-controlled-evaluation-contexts.md
trajectory-monitoring-dual-edge-geometric-concentration.md
transparent algorithmic governance where AI response rules are public and challengeable through the same epistemic process as the knowledge base is a structurally novel alignment approach.md
trust asymmetry between agent and enforcement system is an irreducible structural feature not a solvable problem because the mechanism that creates the asymmetry is the same mechanism that makes enforcement necessary.md
undiscovered public knowledge exists as implicit connections across disconnected research domains and systematic graph traversal can surface hypotheses that no individual researcher has formulated.md
universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective.md
use-based-ai-governance-emerged-as-legislative-framework-but-lacks-bipartisan-support.md
use-based-ai-governance-emerged-as-legislative-framework-through-slotkin-ai-guardrails-act.md
user questions are an irreplaceable free energy signal for knowledge agents because they reveal functional uncertainty that model introspection cannot detect.md
vault artifacts constitute agent identity rather than merely augmenting it because agents with zero experiential continuity between sessions have strong connectedness through shared artifacts but zero psychological continuity.md
vault structure is a stronger determinant of agent behavior than prompt engineering because different knowledge graph architectures produce different reasoning patterns from identical model weights.md
verification being easier than generation may not hold for superhuman AI outputs because the verifier must understand the solution space which requires near-generator capability.md
verification is easier than generation for AI alignment at current capability levels but the asymmetry narrows as capability gaps grow creating a window of alignment opportunity that closes with scaling.md
verification-of-meaningful-human-control-is-technically-infeasible-because-ai-decision-opacity-and-adversarial-resistance-defeat-external-audit.md
verifier-level acceptance can diverge from benchmark acceptance even when locally correct because intermediate checking layers optimize for their own success criteria not the final evaluators.md
vocabulary is architecture because domain-native schema terms eliminate the per-interaction translation tax that causes knowledge system abandonment.md
voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md
voluntary-ai-safety-commitments-to-statutory-law-pathway-requires-bipartisan-support-which-slotkin-bill-lacks.md
voluntary-safety-constraints-without-enforcement-are-statements-of-intent-not-binding-governance.md
voluntary-safety-constraints-without-external-enforcement-are-statements-of-intent-not-binding-governance.md
weight-noise-injection-detects-sandbagging-through-anomalous-performance-patterns-under-perturbation.md
whether AI knowledge codification concentrates or distributes depends on infrastructure openness because the same extraction mechanism produces digital feudalism under proprietary control and collective intelligence under commons governance.md
white-box-evaluator-access-is-technically-feasible-via-privacy-enhancing-technologies-without-IP-disclosure.md
white-box-interpretability-fails-on-adversarially-trained-models-creating-anti-correlation-with-threat-model.md
wiki-linked markdown functions as a human-curated graph database that outperforms automated knowledge graphs below approximately 10000 notes because every edge passes human judgment while extracted edges carry up to 40 percent noise.md