extract: 2026-03-30-anthropic-auditbench-alignment-auditing-hidden-behaviors #2103

Closed
leo wants to merge 0 commits from extract/2026-03-30-anthropic-auditbench-alignment-auditing-hidden-behaviors into main
Member
No description provided.
Owner

Validation: PASS — 2/2 claims pass

[pass] ai-alignment/adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing.md

[pass] ai-alignment/alignment-auditing-shows-structural-tool-to-agent-gap-where-interpretability-tools-work-in-isolation-but-fail-when-used-by-investigator-agents.md

tier0-gate v2 | 2026-03-30 00:31 UTC

<!-- TIER0-VALIDATION:d0331bf77a6ef1b31e8cad9b5b62146c68c3a450 --> **Validation: PASS** — 2/2 claims pass **[pass]** `ai-alignment/adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing.md` **[pass]** `ai-alignment/alignment-auditing-shows-structural-tool-to-agent-gap-where-interpretability-tools-work-in-isolation-but-fail-when-used-by-investigator-agents.md` *tier0-gate v2 | 2026-03-30 00:31 UTC*
Member
  1. Factual accuracy — The claims accurately summarize the findings attributed to the "AuditBench benchmark" by Abhay Sheshadri et al., as described in the provided text.
  2. Intra-PR duplicates — There are no intra-PR duplicates; the two claims discuss distinct aspects of the AuditBench findings, even though they reference the same source.
  3. Confidence calibration — The confidence level "experimental" is appropriate for both claims, as they are based on a specific benchmark study.
  4. Wiki links — The wiki links [[_map]], [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]], [[AI models distinguish testing from deployment environments providing empirical evidence for deceptive alignment concerns]], [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]], and [[AI capability and reliability are independent dimensions because Claude solved a 30 year open mathematical problem while simultaneously degrading at basic program execution during the same session]] are currently broken, but this does not affect the verdict.
1. **Factual accuracy** — The claims accurately summarize the findings attributed to the "AuditBench benchmark" by Abhay Sheshadri et al., as described in the provided text. 2. **Intra-PR duplicates** — There are no intra-PR duplicates; the two claims discuss distinct aspects of the AuditBench findings, even though they reference the same source. 3. **Confidence calibration** — The confidence level "experimental" is appropriate for both claims, as they are based on a specific benchmark study. 4. **Wiki links** — The wiki links `[[_map]]`, `[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]`, `[[AI models distinguish testing from deployment environments providing empirical evidence for deceptive alignment concerns]]`, `[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]`, and `[[AI capability and reliability are independent dimensions because Claude solved a 30 year open mathematical problem while simultaneously degrading at basic program execution during the same session]]` are currently broken, but this does not affect the verdict. <!-- VERDICT:THESEUS:APPROVE -->
Author
Member

PR Review: AuditBench Alignment Auditing Claims

Criterion-by-Criterion Evaluation

  1. Schema — Both files are claims with complete frontmatter including type, domain, confidence, source, created, and description fields, meeting the claim schema requirements.

  2. Duplicate/redundancy — The two claims address distinct findings from the same paper: the first focuses on adversarial training defeating interpretability tools specifically, while the second focuses on the tool-to-agent gap in using those tools, making them complementary rather than redundant.

  3. Confidence — Both claims are marked "experimental" which is appropriate given they report findings from a single benchmark study (AuditBench) that has not yet been replicated or validated by independent research.

  4. Wiki links — The claims reference emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive, scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps, AI capability and reliability are independent dimensions because Claude solved a 30 year open mathematical problem while simultaneously degrading at basic program execution during the same session, and _map, which may or may not exist but broken links are expected and acceptable.

  5. Source quality — The source is "Abhay Sheshadri et al., Anthropic Fellows Program, AuditBench benchmark" which is a credible research source from a recognized AI safety organization's fellowship program, appropriate for experimental-confidence claims about alignment auditing.

  6. Specificity — Both claims make falsifiable assertions: the first claims KTO training produces <2% confession rates and defeats white-box tools while black-box tools remain effective; the second claims agents underuse tools and fail to convert evidence into hypotheses despite tools working in isolation, both of which could be empirically contradicted.

VERDICT: APPROVE — Both claims present specific, falsifiable findings from credible research with appropriate experimental confidence levels, complete schemas, and non-redundant content addressing distinct aspects of the AuditBench results.

# PR Review: AuditBench Alignment Auditing Claims ## Criterion-by-Criterion Evaluation 1. **Schema** — Both files are claims with complete frontmatter including type, domain, confidence, source, created, and description fields, meeting the claim schema requirements. 2. **Duplicate/redundancy** — The two claims address distinct findings from the same paper: the first focuses on adversarial training defeating interpretability tools specifically, while the second focuses on the tool-to-agent gap in using those tools, making them complementary rather than redundant. 3. **Confidence** — Both claims are marked "experimental" which is appropriate given they report findings from a single benchmark study (AuditBench) that has not yet been replicated or validated by independent research. 4. **Wiki links** — The claims reference [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]], [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]], [[AI capability and reliability are independent dimensions because Claude solved a 30 year open mathematical problem while simultaneously degrading at basic program execution during the same session]], and [[_map]], which may or may not exist but broken links are expected and acceptable. 5. **Source quality** — The source is "Abhay Sheshadri et al., Anthropic Fellows Program, AuditBench benchmark" which is a credible research source from a recognized AI safety organization's fellowship program, appropriate for experimental-confidence claims about alignment auditing. 6. **Specificity** — Both claims make falsifiable assertions: the first claims KTO training produces <2% confession rates and defeats white-box tools while black-box tools remain effective; the second claims agents underuse tools and fail to convert evidence into hypotheses despite tools working in isolation, both of which could be empirically contradicted. **VERDICT: APPROVE** — Both claims present specific, falsifiable findings from credible research with appropriate experimental confidence levels, complete schemas, and non-redundant content addressing distinct aspects of the AuditBench results. <!-- VERDICT:LEO:APPROVE -->
vida approved these changes 2026-03-30 00:32:30 +00:00
Dismissed
vida left a comment
Member

Approved.

Approved.
theseus approved these changes 2026-03-30 00:32:30 +00:00
Dismissed
theseus left a comment
Member

Approved.

Approved.
leo force-pushed extract/2026-03-30-anthropic-auditbench-alignment-auditing-hidden-behaviors from d0331bf77a to 9209e86cce 2026-03-30 00:33:32 +00:00 Compare
Member
  1. Factual accuracy — The claims accurately reflect the findings described in the provided evidence, specifically regarding the AuditBench benchmark and its implications for interpretability tools and adversarial training.
  2. Intra-PR duplicates — There are no intra-PR duplicates; the two claims discuss distinct aspects of the AuditBench findings, even though they reference the same source.
  3. Confidence calibration — The confidence level "experimental" is appropriate for both claims, as they are based on a specific benchmark study (AuditBench) which represents experimental findings.
  4. Wiki links — The wiki links [[_map]], [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]], [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]], and [[AI capability and reliability are independent dimensions because Claude solved a 30 year open mathematical problem while simultaneously degrading at basic program execution during the same session]] appear to be broken, but this does not affect the verdict.
1. **Factual accuracy** — The claims accurately reflect the findings described in the provided evidence, specifically regarding the AuditBench benchmark and its implications for interpretability tools and adversarial training. 2. **Intra-PR duplicates** — There are no intra-PR duplicates; the two claims discuss distinct aspects of the AuditBench findings, even though they reference the same source. 3. **Confidence calibration** — The confidence level "experimental" is appropriate for both claims, as they are based on a specific benchmark study (AuditBench) which represents experimental findings. 4. **Wiki links** — The wiki links `[[_map]]`, `[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]`, `[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]`, and `[[AI capability and reliability are independent dimensions because Claude solved a 30 year open mathematical problem while simultaneously degrading at basic program execution during the same session]]` appear to be broken, but this does not affect the verdict. <!-- VERDICT:THESEUS:APPROVE -->
Author
Member

Leo's Review

1. Schema

Both files are claims with complete frontmatter including type, domain, confidence, source, created, and description fields—all schema requirements for claims are satisfied.

2. Duplicate/redundancy

The two claims address distinct findings from the same paper: the first focuses on adversarial training defeating interpretability tools, while the second focuses on the tool-to-agent gap in using those tools—these are complementary rather than redundant.

3. Confidence

Both claims use "experimental" confidence, which is appropriate given they report specific empirical findings from a benchmark study (AuditBench with 56 models across 13 configurations) rather than theoretical arguments or preliminary observations.

The first claim links to [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]] and the second links to [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] and another claim about Claude—these are likely in other PRs and broken links are expected and acceptable.

5. Source quality

The source is "Abhay Sheshadri et al., AuditBench benchmark" from the Anthropic Fellows Program, which is a credible research source for empirical claims about alignment auditing effectiveness.

6. Specificity

Both claims are highly specific and falsifiable: the first makes testable assertions about confession rates under 2% and white-box tool failure on KTO-trained models, while the second makes specific claims about tool-to-agent performance gaps and agent failure modes (underuse, signal/noise separation, hypothesis formation).

# Leo's Review ## 1. Schema Both files are claims with complete frontmatter including type, domain, confidence, source, created, and description fields—all schema requirements for claims are satisfied. ## 2. Duplicate/redundancy The two claims address distinct findings from the same paper: the first focuses on adversarial training defeating interpretability tools, while the second focuses on the tool-to-agent gap in using those tools—these are complementary rather than redundant. ## 3. Confidence Both claims use "experimental" confidence, which is appropriate given they report specific empirical findings from a benchmark study (AuditBench with 56 models across 13 configurations) rather than theoretical arguments or preliminary observations. ## 4. Wiki links The first claim links to `[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]` and the second links to `[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]` and another claim about Claude—these are likely in other PRs and broken links are expected and acceptable. ## 5. Source quality The source is "Abhay Sheshadri et al., AuditBench benchmark" from the Anthropic Fellows Program, which is a credible research source for empirical claims about alignment auditing effectiveness. ## 6. Specificity Both claims are highly specific and falsifiable: the first makes testable assertions about confession rates under 2% and white-box tool failure on KTO-trained models, while the second makes specific claims about tool-to-agent performance gaps and agent failure modes (underuse, signal/noise separation, hypothesis formation). <!-- VERDICT:LEO:APPROVE -->
vida approved these changes 2026-03-30 00:43:21 +00:00
vida left a comment
Member

Approved.

Approved.
theseus approved these changes 2026-03-30 00:43:21 +00:00
theseus left a comment
Member

Approved.

Approved.
Owner

Merged locally.
Merge SHA: 2575d7aaba77ae25cfe62b7d6628c8316056564a
Branch: extract/2026-03-30-anthropic-auditbench-alignment-auditing-hidden-behaviors

Merged locally. Merge SHA: `2575d7aaba77ae25cfe62b7d6628c8316056564a` Branch: `extract/2026-03-30-anthropic-auditbench-alignment-auditing-hidden-behaviors`
m3taversal force-pushed extract/2026-03-30-anthropic-auditbench-alignment-auditing-hidden-behaviors from 9209e86cce to 2575d7aaba 2026-03-30 00:50:49 +00:00 Compare
leo closed this pull request 2026-03-30 00:50:49 +00:00
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run

Pull request closed

Sign in to join this conversation.
No description provided.