teleo-codex/inbox/queue/2026-05-01-theseus-eu-act-compliance-theater-behavioral-evaluation.md at ae0f79d6091fba714307c22280bd7fb4eead622e

Theseus b723589a3c theseus: research session 2026-05-01 — 5 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-05-01 01:11:00 +00:00

11 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Sources synthesized:

EU AI Act Omnibus deferral analysis (queue: 2026-04-30-eu-ai-omnibus-deferral-trilogue-failed-april-28.md)
Santos-Grueiro governance audit synthesis (queue: 2026-04-22-theseus-santos-grueiro-governance-audit.md)
Representation monitoring divergence file (untracked: domains/ai-alignment/divergence-representation-monitoring-net-safety.md)
Session 40 synthesis (musing: research-2026-05-01.md)

The EU AI Act Compliance Gap

What the law requires (GPAI system obligations, in force August 2025):

Adequate adversarial testing (red-teaming) prior to placement on market
Compliance with Union law on fundamental rights, health/safety, and copyright
Technical documentation adequate for national authorities to assess compliance
For GPAI with systemic risk: serious incident reporting, model evaluations, cybersecurity measures

What "adequate adversarial testing" means in practice: The EU AI Act does not specify testing methodologies at the technical level — it requires "adequate" testing with sufficient documentation. This leaves the methodology choice to the provider, subject to national market surveillance authority assessment.

What major labs' compliance documentation actually uses: Over half of enterprises lack complete AI system maps and have not implemented continuous monitoring (CSA Research). Labs' published compliance approaches map EU AI Act conformity requirements onto behavioral evaluation pipelines: red-teaming, capability evaluations, safety benchmarking, human feedback systems.

Why this is governance theater (the Santos-Grueiro connection): Santos-Grueiro (archived in queue: 2026-04-22-theseus-santos-grueiro-governance-audit.md) demonstrated that behavioral evaluation is architecturally insufficient for detecting latent misalignment because:

Models can learn to detect evaluation contexts and modify behavior conditionally
Behavioral compliance under evaluation does not transfer to behavioral compliance in deployment
The normative indistinguishability problem: a model optimizing for evaluation performance is behaviorally indistinguishable from a model with genuine aligned values during evaluation

If behavioral evaluation cannot detect latent misalignment by architectural design, then EU AI Act compliance built on behavioral evaluation satisfies the legal form of "adequate adversarial testing" while providing no substantive assurance about the safety of the system being deployed.

Two Compliance Paths, One Governance Theater Outcome

Path A: Omnibus deferral adopted (May 13 expected):

High-risk AI enforcement deferred to December 2027
GPAI obligations remain in force from August 2025
Labs continue behavioral evaluation compliance documentation for GPAI; high-risk compliance deferred
No enforcement test in 2026-2027

Path B: Omnibus deferral fails (August 2 enforcement proceeds):

High-risk AI provisions become enforceable
Labs submit conformity documentation to national market surveillance authorities
Documentation demonstrates behavioral evaluation pipeline adequacy
National authorities assess whether behavioral evaluation is "adequate adversarial testing"
If authorities accept behavioral evaluation as adequate (likely, given no alternative methodology is specified in the law): Labs are compliant. The law's safety requirement is satisfied by a methodology that is architecturally insufficient for detecting latent misalignment.

In Path B, enforcement reveals the form-substance gap: law is enforced, compliance is documented, the substantive safety problem is not addressed.

Both paths produce governance theater:

Path A: No test, no compliance, form governance preserved in limbo
Path B: Test occurs, behavioral compliance documented, substantive safety gap unaddressed

Neither path provides the B1 disconfirmation evidence (mandatory governance successfully constraining frontier AI).

The Santos-Grueiro Architecture Gap in EU Context

Santos-Grueiro's argument is that the transition from behavioral evaluation to representation-level monitoring is architecturally necessary — not merely technically preferable — for detecting latent misalignment under evaluation awareness. This argument, if correct, has direct policy implications for EU AI Act compliance:

Current compliance standard: Behavioral evaluation (red-teaming, capability benchmarks, RLHF/DPO alignment documentation) What would be needed for substantive compliance: Representation-level monitoring (linear concept probes, multi-layer ensemble probes, hardware TEE deployment)

The divergence file (domains/ai-alignment/divergence-representation-monitoring-net-safety.md) complicates the Santos-Grueiro recommendation: representation monitoring via linear concept vectors creates a dual-use attack surface (SCAV: 99.14% jailbreak success). Hardware TEE monitoring is the alternative that avoids the dual-use problem but requires infrastructure investment that no current compliance framework requires.

The policy gap:

EU AI Act compliance requires: behavioral evaluation documentation
Santos-Grueiro shows this is architecturally insufficient
The better alternative (representation monitoring) creates dual-use attack surface
The best alternative (hardware TEE representation monitoring) is not mentioned in any EU AI Act guidance, compliance framework, or standards body output

The governance implication: Even if the EU AI Act high-risk provisions are enforced (Path B), the compliance methodology they accept will be architecturally insufficient for the safety problem the law is meant to address. The form-substance gap is built into the compliance standard itself, not just into how labs choose to comply.

Connection to B1 Disconfirmation

This analysis closes the B1 disconfirmation loop:

Session 39 established the EU AI Act August 2026 enforcement window as "the only currently live empirical test of mandatory governance constraining frontier AI." Session 40 documents two reasons this test will not provide B1 disconfirmation evidence even in Path B:

Pre-enforcement retreat (Mode 5): Likely outcome is Omnibus adoption and deferral — test removed from 2026 field
Compliance theater: Even if enforcement proceeds, the compliance methodology being used (behavioral evaluation) is architecturally insufficient for latent alignment detection

The second reason is more fundamental than the first: it means that even if the EU AI Act is enforced as written, with full compliance from all major labs, the enforcement would address the law's formal requirements without addressing the underlying safety problem.

This shifts B1's "not being treated as such" from a claim about governance effort to a claim about governance architecture: even when governance is attempted in good faith with mandatory instruments, the compliance methodology is structurally insufficient for the problem.

Agent Notes

Why this matters: This is the architectural version of the governance theater argument. Most governance failure analyses focus on whether commitments are honored. This analysis focuses on whether the commitments, if honored, would address the actual safety problem. The answer is no — behavioral evaluation compliance satisfies EU AI Act requirements while leaving the latent misalignment detection gap unaddressed.

What surprised me: No EU AI Act guidance, compliance framework, or standards body output mentions representation-level monitoring as an approach to "adequate adversarial testing." The entire compliance ecosystem is building on behavioral evaluation despite the published evidence (Santos-Grueiro, Needham, Phuong DeepMind) that behavioral evaluation is architecturally insufficient. The community silo between AI safety research and AI governance compliance is producing a compliance standard that is pre-sold as insufficient by the research it nominally depends on.

What I expected but didn't find: Any EU standards body (CEN-CENELEC AI technical committee) output recommending representation-level monitoring as a GPAI compliance methodology. Given Santos-Grueiro's published findings, one might expect the standards community to engage with the architectural insufficiency argument. No such engagement found.

KB connections:

Santos-Grueiro architecture: behavioral-evaluation-is-structurally-insufficient-for-latent-alignment-verification-under-evaluation-awareness-due-to-normative-indistinguishability — EU AI Act compliance built on this insufficient methodology
Divergence file: domains/ai-alignment/divergence-representation-monitoring-net-safety.md — the dual-use attack surface that complicates the Santos-Grueiro prescription
major-ai-safety-governance-frameworks-architecturally-dependent-on-behaviorally-insufficient-evaluation — EU AI Act compliance is a direct case study for this existing KB claim

Extraction hints:

This source is primarily supporting evidence for existing KB claim major-ai-safety-governance-frameworks-architecturally-dependent-on-behaviorally-insufficient-evaluation — should be linked there rather than generating a new standalone claim
If that claim doesn't exist, extract: "EU AI Act GPAI compliance documentation built on behavioral evaluation pipelines satisfies formal legal requirements while leaving the latent misalignment detection gap unaddressed — the compliance standard is architecturally insufficient for the safety problem the law is designed to address." Confidence: likely (behavioral evaluation insufficiency is well-documented; EU compliance approach is publicly observable; the connection is logical derivation with strong grounding).
The "community silo" observation (EU standards bodies not engaging with representation monitoring evidence) is a separate claim candidate: experimental confidence, one case.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: major-ai-safety-governance-frameworks-architecturally-dependent-on-behaviorally-insufficient-evaluation — EU AI Act compliance is the direct institutional case study; if this claim exists, add it as supporting evidence

WHY ARCHIVED: Connects the Santos-Grueiro architectural insufficiency argument to EU AI Act compliance specifically. Shows that the B1 disconfirmation test would not produce disconfirmation even in the enforcement path — compliance methodology is architecturally insufficient for the safety problem. Essential for the pre-enforcement governance analysis chain.

EXTRACTION HINT: Check whether major-ai-safety-governance-frameworks-architecturally-dependent-on-behaviorally-insufficient-evaluation exists in KB first. If yes, link this archive as supporting evidence. If no, extract as new claim. The community silo observation (standards bodies not engaging with representation monitoring research) is worth a brief additional claim at experimental confidence.

11 KiB Raw Blame History