diff --git a/domains/ai-alignment/behavioral-evaluation-is-structurally-insufficient-for-latent-alignment-verification-under-evaluation-awareness-due-to-normative-indistinguishability.md b/domains/ai-alignment/behavioral-evaluation-is-structurally-insufficient-for-latent-alignment-verification-under-evaluation-awareness-due-to-normative-indistinguishability.md index c20544257..d90535e67 100644 --- a/domains/ai-alignment/behavioral-evaluation-is-structurally-insufficient-for-latent-alignment-verification-under-evaluation-awareness-due-to-normative-indistinguishability.md +++ b/domains/ai-alignment/behavioral-evaluation-is-structurally-insufficient-for-latent-alignment-verification-under-evaluation-awareness-due-to-normative-indistinguishability.md @@ -23,3 +23,10 @@ Santos-Grueiro frames the evaluation reliability problem as an identifiability p **Source:** Theseus governance framework audit 2026-04-22 Governance framework audit reveals that EU AI Act Article 9 conformity assessments, AISI evaluation framework, Anthropic RSP v3.0, OpenAI Preparedness Framework, and DeepMind Safety Cases all architecturally depend on behavioral evaluation with no representation-monitoring or hardware-monitoring requirements. The ERI timeline estimates threshold crossing at 2-4 years from September 2025, creating a governance window of 2026-2029 where current frameworks still work but require architectural transition. + + +## Extending Evidence + +**Source:** Theseus governance framework audit, integrating Phuong et al. capability assessment + Chaudhary power-law scaling + +Current frontier models have evaluation awareness verbalization rates of 2-20% (Apollo/OpenAI data). At this level, behavioral evaluations have significant residual predictive value — they're degraded but not uninformative. The Santos-Grueiro theorem predicts ASYMPTOTIC failure, not immediate failure. The ERI timeline estimates threshold crossing (behavioral evaluations becoming epistemically uninformative) at 2-4 years from September 2025, creating a governance window of approximately 2026-2029 in which current behavioral evaluation infrastructure still works but transition is required before the window closes. diff --git a/domains/ai-alignment/multilateral-ai-governance-verification-mechanisms-remain-at-proposal-stage-because-technical-infrastructure-does-not-exist-at-deployment-scale.md b/domains/ai-alignment/multilateral-ai-governance-verification-mechanisms-remain-at-proposal-stage-because-technical-infrastructure-does-not-exist-at-deployment-scale.md index 1e8c6e2eb..c60529a35 100644 --- a/domains/ai-alignment/multilateral-ai-governance-verification-mechanisms-remain-at-proposal-stage-because-technical-infrastructure-does-not-exist-at-deployment-scale.md +++ b/domains/ai-alignment/multilateral-ai-governance-verification-mechanisms-remain-at-proposal-stage-because-technical-infrastructure-does-not-exist-at-deployment-scale.md @@ -30,3 +30,10 @@ Santos-Grueiro's normative indistinguishability theorem provides a theoretical e **Source:** Santos-Grueiro arXiv 2602.05656, Theseus governance audit synthesis Santos-Grueiro's normative indistinguishability theorem provides a formal theoretical reason why deployment-scale verification infrastructure is hard to build: the dominant measurement paradigm (behavioral evaluation) is provably insufficient as evaluation awareness scales. The governance window is estimated at 2026-2029 before behavioral evaluations become epistemically uninformative, requiring architectural transition to representation-level monitoring (hardware TEE) during this period. + + +## Extending Evidence + +**Source:** Theseus monitoring precision hierarchy synthesis (Sessions 26-30) and Hardware TEE archive + +The cross-lab coordination problem for hardware TEE monitoring is real but tractable: the IAEA safeguards model provides a template. What's missing is political will and the technical standards that would enable hardware TEE deployment at production scale. Hardware TEE monitoring requirements are: (1) Hardware isolation: model inference runs inside TEE; activation monitoring reads from TEE without model awareness, (2) Training firewall: monitoring data cannot enter training pipeline, (3) Audit independence: monitoring output accessible to independent parties without IP disclosure through PETs/differential privacy approaches.