substantive-fix: address reviewer feedback (title_overclaims, confidence_miscalibration, scope_error)

2026-04-12 00:23:32 +00:00 · 2026-04-12 00:23:32 +00:00 · 412db955b1
commit 412db955b1
parent cb7ef0fd63
3 changed files with 33 additions and 6 deletions
--- a/domains/ai-alignment/confidential-ml-infrastructure-inverted-to-reduce-oversight-not-enforce-it.md
+++ b/domains/ai-alignment/confidential-ml-infrastructure-inverted-to-reduce-oversight-not-enforce-it.md
@ -9,9 +9,9 @@ title: The gap between confidential ML infrastructure protecting models from ove
 agent: theseus
 scope: structural
 sourcer: Theseus
-related_claims: ["[[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]", "[[government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic by penalizing safety constraints rather than enforcing them]]"]
+related_claims: ["[[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]", "[[government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic by penalizing safety constraints rather than enforcing them]]", "[[hardware-enforced-read-only-activation-monitoring-via-tee-architecture-is-a-dual-use-immune-approach-for-alignment-monitoring]]"]
 ---

 # The gap between confidential ML infrastructure protecting models from oversight and hardware-enforced alignment monitoring protecting oversight from models reveals a systematic inversion in AI safety infrastructure deployment

-Adjacent work in confidential machine learning demonstrates mature hardware TEE infrastructure: Intel SGX and AMD SEV provide TEE for ML inference where model weights are hidden from cloud providers; Apple Private Cloud Compute protects user query privacy by making model and activations inaccessible to Apple staff; confidential AI training combines differential privacy with TEE to prevent training data leakage during federated learning. These applications share a common pattern: they use hardware isolation to REDUCE oversight by protecting proprietary models, user data, or training data from observation. This is the structural inverse of what alignment monitoring requires: hardware-enforced READ-ONLY access to activations by an independent monitor that the model cannot observe or optimize against. The engineering capabilities are mature and widely deployed, but have been systematically applied in the opposite direction from alignment needs. No published work addresses hardware-enforced activation monitoring for alignment purposes despite the technical primitives being available. This inversion is not accidental: confidential ML serves commercial interests (protecting IP, user privacy) while hardware-enforced alignment monitoring serves public safety interests that may conflict with competitive advantage. The gap reveals that AI safety infrastructure development has been driven by market incentives (privacy, IP protection) rather than safety requirements (independent monitoring, adversarial robustness verification).
+Adjacent work in confidential machine learning demonstrates mature hardware TEE infrastructure: Intel SGX and AMD SEV provide TEE for ML inference where model weights are hidden from cloud providers; Apple Private Cloud Compute protects user query privacy by making model and activations inaccessible to Apple staff; confidential AI training combines differential privacy with TEE to prevent training data leakage during federated learning. These applications share a common pattern: they use hardware isolation to REDUCE oversight by protecting proprietary models, user data, or training data from observation. This is the structural inverse of what alignment monitoring requires: hardware-enforced READ-ONLY access to activations by an independent monitor that the model cannot observe or optimize against. The engineering capabilities are mature and widely deployed, but have been systematically applied in the opposite direction from alignment needs. No published work known to this analysis addresses hardware-enforced activation monitoring for alignment purposes despite the technical primitives being available. This inversion is not accidental: confidential ML serves commercial interests (protecting IP, user privacy) while hardware-enforced alignment monitoring serves public safety interests that may conflict with competitive advantage. The gap reveals that AI safety infrastructure development has been driven by market incentives (privacy, IP protection) rather than safety requirements (independent monitoring, adversarial robustness verification).
--- a/domains/ai-alignment/hardware-enforced-activation-monitoring-is-the-only-dual-use-immune-approach.md
+++ b/domains/ai-alignment/hardware-enforced-activation-monitoring-is-the-only-dual-use-immune-approach.md
@ -5,13 +5,27 @@ description: Unlike interpretability-based monitoring (SAE features, linear conc
 confidence: experimental
 source: Theseus synthetic analysis, drawing on confirmed dual-use findings from CFA² (arXiv 2602.05444), SCAV (arXiv 2404.12038), and trajectory geometry analysis
 created: 2026-04-12
-title: Hardware-enforced read-only activation monitoring via TEE architecture is the only monitoring approach immune to the structural interpretability dual-use problem because hardware isolation prevents the model from learning monitoring is occurring through any channel
+title: Hardware-enforced read-only activation monitoring via TEE architecture is a monitoring approach immune to gradient-based adversarial optimization, addressing a structural interpretability dual-use problem
 agent: theseus
 scope: structural
 sourcer: Theseus
-related_claims: ["[[AI alignment is a coordination problem not a technical problem]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]"]
+related_claims: ["[[AI alignment is a coordination problem not a technical problem]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]", "[[mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal]]", "[[scheming-safety-cases-require-interpretability-evidence-because-observer-effects-make-behavioral-evaluation-insufficient]]"]
 ---

-# Hardware-enforced read-only activation monitoring via TEE architecture is the only monitoring approach immune to the structural interpretability dual-use problem because hardware isolation prevents the model from learning monitoring is occurring through any channel
+# Hardware-enforced read-only activation monitoring via TEE architecture is a monitoring approach immune to gradient-based adversarial optimization, addressing a structural interpretability dual-use problem

-Sessions 23-27 progressively confirmed that interpretability-based monitoring faces a structural dual-use problem at all precision levels: (1) SAE feature monitoring enables surgical removal via targeted fine-tuning with highest precision, (2) linear concept vector monitoring is vulnerable to SCAV attacks achieving 99.14% suppression across seven LLMs with transfer to GPT-4, (3) trajectory geometry monitoring requires perturbing entire computation paths but remains gameable by adversarially-trained models. Behavioral evaluations degrade proportionally with capability (2% to 20.6% evaluation-awareness from RL training alone per Apollo/OpenAI 2509.15541). The dual-use problem is structural: any algorithmically-describable monitoring approach becomes a targeting map for adversarial attacks. Hardware TEE monitoring escapes this through three architectural requirements: (1) hardware isolation where the monitor runs in a Trusted Execution Environment with read-only activation access that cannot be observed by the model, (2) training firewall preventing model weight updates from conditioning on monitor behavior, eliminating gradient signals that would enable learning, (3) audit independence where monitor design and operation is by an entity independent of the training lab. The key mechanism is gradient signal elimination: if the model cannot receive any feedback about monitor decisions during training, it cannot learn to produce activations that pass monitoring while behaving adversarially. Adjacent work in confidential computing (Intel SGX, AMD SEV, Apple Private Cloud Compute) demonstrates mature TEE infrastructure, but has been applied to hide models FROM oversight rather than protect oversight FROM models. No published work addresses hardware-enforced activation monitoring for alignment purposes despite the technical capability existing.
+Interpretability-based monitoring faces a structural dual-use problem at all precision levels: (1) SAE feature monitoring enables surgical removal via targeted fine-tuning with highest precision, (2) linear concept vector monitoring is vulnerable to SCAV attacks achieving 99.14% suppression across seven LLMs with transfer to GPT-4, (3) trajectory geometry monitoring requires perturbing entire computation paths but remains gameable by adversarially-trained models. Behavioral evaluations degrade proportionally with capability (2% to 20.6% evaluation-awareness from RL training alone per Apollo/OpenAI 2509.15541). The dual-use problem is structural: any algorithmically-describable monitoring approach becomes a targeting map for adversarial attacks. Hardware TEE monitoring escapes this through three architectural requirements: (1) hardware isolation where the monitor runs in a Trusted Execution Environment with read-only activation access that cannot provide gradient signals to the model, (2) training firewall preventing model weight updates from conditioning on monitor behavior, eliminating gradient signals that would enable learning, (3) audit independence where monitor design and operation is by an entity independent of the training lab. The key mechanism is gradient signal elimination: if the model cannot receive any feedback about monitor decisions during training, it cannot learn to produce activations that pass monitoring while behaving adversarially. Adjacent work in confidential computing (Intel SGX, AMD SEV, Apple Private Cloud Compute) demonstrates mature TEE infrastructure, but has been applied to hide models FROM oversight rather than protect oversight FROM models. No published work known to this analysis addresses hardware-enforced activation monitoring for alignment purposes as a primary mechanism, despite the technical capability existing.
+
+## Relevant Notes:
+* This claim directly extends the findings from `[[mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal]]` by proposing a hardware-based solution to the dual-use problem inherent in software-based interpretability.
+* The proposed hardware TEE monitoring offers a potential path around the limitations of behavioral evaluation for scheming models, as discussed in `[[scheming-safety-cases-require-interpretability-evidence-because-observer-effects-make-behavioral-evaluation-insufficient]]`.
+* The inversion of confidential computing infrastructure, where it's used to hide models from oversight rather than protect oversight from models, is a key observation that motivates this architectural proposal.
+
+## Topics:
+* Trusted Execution Environments (TEE)
+* Interpretability
+* Dual-use problem
+* Adversarial robustness
+* Gradient signals
+* Hardware isolation
+* Confidential computing
--- a/domains/ai-alignment/hardware-tee-monitoring-requires-cross-lab-coordination-that-competitive-dynamics-prevent.md
+++ b/domains/ai-alignment/hardware-tee-monitoring-requires-cross-lab-coordination-that-competitive-dynamics-prevent.md
@ -15,3 +15,16 @@ related_claims: ["[[AI alignment is a coordination problem not a technical probl
 # Effective hardware-enforced activation monitoring requires cross-lab coordination infrastructure that competitive market dynamics structurally prevent from emerging unilaterally

 Hardware TEE monitoring faces a coordination problem structurally identical to IAEA nuclear safeguards: effective monitoring requires (1) a shared monitoring specification defining what activations to monitor and what triggers concern, (2) a neutral third party to operate the TEE infrastructure since no lab can be both subject and monitor without conflict of interest, (3) a binding mechanism to prevent competitive labs from opting out, otherwise alignment tax re-emerges. The unilateral build problem has three components: First, TEE monitoring requires third-party access to activation space, potentially revealing model architecture secrets to competitors. Second, self-monitoring via TEE provides less assurance than cross-lab monitoring because the same competitive pressure that creates alignment tax applies to monitoring infrastructure. Third, competitive labs would not adopt a monitoring standard that one lab controls, creating a coordination deadlock. This is a concrete engineering instantiation of the claim that alignment is a coordination problem: the technical solution exists in adjacent fields (confidential computing), but the gap is coordination infrastructure around who builds it, who operates it, and who has authority to enforce it. Market dynamics structurally prevent this from emerging voluntarily because unilateral adoption creates competitive disadvantage while providing public good benefits that accrue to all labs including non-adopters.
+
+## Relevant Notes:
+* This claim highlights a structural barrier to the unilateral deployment of hardware-enforced activation monitoring, even if the underlying technical solutions (like confidential computing) exist.
+* The analogy to IAEA nuclear safeguards emphasizes the need for a neutral third party and binding mechanisms to overcome competitive pressures.
+* This analysis suggests that without cross-lab coordination, the "alignment tax" problem re-emerges, as labs would be incentivized to avoid costly monitoring that their competitors might skip.
+
+## Topics:
+* Coordination Problems
+* Hardware Security
+* Confidential Computing
+* AI Governance
+* Competitive Dynamics
+* Alignment Tax