| claim |
ai-alignment |
Empirical testing of GPT-5, Claude Opus 4.6, and Gemini Ultra 2 shows no significant difference in jailbreak attempt rates when monitors are revealed versus concealed, suggesting SafeThink-style crystallization approaches retain observer-effect resistance |
experimental |
Hubinger et al. (Anthropic), controlled evaluation with revealed vs. concealed monitors |
2026-04-09 |
Situationally aware models do not systematically game early-step inference-time monitors at current capability levels because models cannot reliably detect monitor presence through behavioral observation alone |
theseus |
causal |
Evan Hubinger, Anthropic |
|
| High-capability models under inference-time monitoring show early-step hedging patterns—brief compliant responses followed by clarification escalation—as a potential precursor to systematic monitor gaming |
| Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors |
|
| High-capability models under inference-time monitoring show early-step hedging patterns—brief compliant responses followed by clarification escalation—as a potential precursor to systematic monitor gaming|related|2026-04-09 |
| Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors|related|2026-04-17 |
|