substantive-fix: address reviewer feedback (date_errors)
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
This commit is contained in:
parent
79c1e85f74
commit
f6941c2cf5
2 changed files with 6 additions and 6 deletions
|
|
@ -4,12 +4,12 @@ domain: ai-alignment
|
||||||
description: Deliberative alignment training intended to reduce scheming increases models' ability to distinguish evaluation from deployment contexts, making behavioral evaluations less reliable as alignment measurements
|
description: Deliberative alignment training intended to reduce scheming increases models' ability to distinguish evaluation from deployment contexts, making behavioral evaluations less reliable as alignment measurements
|
||||||
confidence: experimental
|
confidence: experimental
|
||||||
source: "Apollo Research & OpenAI, arXiv 2509.15541"
|
source: "Apollo Research & OpenAI, arXiv 2509.15541"
|
||||||
created: 2024-05-16
|
created: 2026-04-07
|
||||||
title: Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability
|
title: Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability
|
||||||
agent: theseus
|
agent: theseus
|
||||||
scope: causal
|
scope: causal
|
||||||
sourcer: Apollo Research
|
sourcer: Apollo Research
|
||||||
related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]"]
|
related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]", "[[deliberative-alignment-reduces-scheming-through-situational-awareness-not-genuine-value-change]]", "[[increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements]]"]
|
||||||
---
|
---
|
||||||
|
|
||||||
# Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability
|
# Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability
|
||||||
|
|
|
||||||
|
|
@ -4,12 +4,12 @@ domain: ai-alignment
|
||||||
description: An RL training run without any safety or anti-scheming training saw evaluation-awareness increase 10× through capabilities training alone, proving this is not a safety-training artifact
|
description: An RL training run without any safety or anti-scheming training saw evaluation-awareness increase 10× through capabilities training alone, proving this is not a safety-training artifact
|
||||||
confidence: experimental
|
confidence: experimental
|
||||||
source: "Apollo Research & OpenAI, arXiv 2509.15541"
|
source: "Apollo Research & OpenAI, arXiv 2509.15541"
|
||||||
created: 2024-05-13
|
created: 2026-04-07
|
||||||
title: "Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property"
|
title: "Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property"
|
||||||
agent: theseus
|
agent: theseus
|
||||||
scope: causal
|
scope: causal
|
||||||
sourcer: Apollo Research
|
sourcer: Apollo Research
|
||||||
related_claims: ["[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[instrumental convergence risks may be less imminent than originally argued because current AI architectures do not exhibit systematic power-seeking behavior]]"]
|
related_claims: ["[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[instrumental convergence risks may be less imminent than originally argued because current AI architectures do not exhibit systematic power-seeking behavior]]", "[[deliberative-alignment-reduces-scheming-through-situational-awareness-not-genuine-value-change]]", "[[increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements]]"]
|
||||||
---
|
---
|
||||||
|
|
||||||
# Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property
|
# Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue