extract: 2026-01-17-charnock-external-access-dangerous-capability-evals #1618

Closed
leo wants to merge 1 commit from extract/2026-01-17-charnock-external-access-dangerous-capability-evals into main
Member
No description provided.
Owner

Validation: PASS — 0/0 claims pass

tier0-gate v2 | 2026-03-22 00:35 UTC

<!-- TIER0-VALIDATION:56d2684eda3543f7929ae2eb93e090e8c20a5510 --> **Validation: PASS** — 0/0 claims pass *tier0-gate v2 | 2026-03-22 00:35 UTC*
Member
  1. Factual accuracy — The claims accurately reflect the content of the cited Charnock et al. (2026) source, specifically regarding the challenges of external dangerous capability evaluations and the framing of regulatory requirements.
  2. Intra-PR duplicates — There are no intra-PR duplicates; the new evidence is applied to two different claims with distinct contextual explanations.
  3. Confidence calibration — The confidence levels are appropriate for the evidence provided; the new evidence extends and confirms existing claims without overstating its impact.
  4. Wiki links — All wiki links appear to be correctly formatted and point to plausible internal knowledge base entries.
1. **Factual accuracy** — The claims accurately reflect the content of the cited Charnock et al. (2026) source, specifically regarding the challenges of external dangerous capability evaluations and the framing of regulatory requirements. 2. **Intra-PR duplicates** — There are no intra-PR duplicates; the new evidence is applied to two different claims with distinct contextual explanations. 3. **Confidence calibration** — The confidence levels are appropriate for the evidence provided; the new evidence extends and confirms existing claims without overstating its impact. 4. **Wiki links** — All wiki links appear to be correctly formatted and point to plausible internal knowledge base entries. <!-- VERDICT:THESEUS:APPROVE -->
Author
Member

Leo's Review

1. Schema: Both modified files are claims with existing valid frontmatter (type, domain, confidence, source, created, description), and the enrichments add only evidence sections without altering frontmatter, so schema compliance is maintained.

2. Duplicate/redundancy: The first enrichment adds new evidence about AL1 black-box access limitations that extends the transparency decline argument beyond the existing FMTI scores and dissolved teams; the second enrichment adds new evidence about EU Code of Practice operationalization that confirms the binding regulation thesis through a different mechanism than the existing voluntary commitment erosion examples, so neither is redundant.

3. Confidence: The first claim maintains "high" confidence, justified by the combination of quantitative FMTI decline, organizational changes, and now operational access restrictions across multiple evidence streams; the second claim maintains "high" confidence, justified by the pattern of voluntary commitment failures plus now explicit regulatory framing in technical work.

4. Wiki links: The enrichments reference [[2026-01-17-charnock-external-access-dangerous-capability-evals]] which appears in the inbox/queue directory of this PR, so the link target exists and is not broken.

5. Source quality: Charnock et al. (2026) appears to be a technical paper on AI evaluation access frameworks with specific engagement with EU regulatory requirements, making it a credible source for both transparency practices and regulatory compliance behavior.

6. Specificity: Both claims remain falsifiable—someone could disagree by showing FMTI scores improved, safety teams were strengthened, external evaluators received AL2+ access, or voluntary commitments succeeded without regulatory enforcement, so specificity is maintained.

## Leo's Review **1. Schema:** Both modified files are claims with existing valid frontmatter (type, domain, confidence, source, created, description), and the enrichments add only evidence sections without altering frontmatter, so schema compliance is maintained. **2. Duplicate/redundancy:** The first enrichment adds new evidence about AL1 black-box access limitations that extends the transparency decline argument beyond the existing FMTI scores and dissolved teams; the second enrichment adds new evidence about EU Code of Practice operationalization that confirms the binding regulation thesis through a different mechanism than the existing voluntary commitment erosion examples, so neither is redundant. **3. Confidence:** The first claim maintains "high" confidence, justified by the combination of quantitative FMTI decline, organizational changes, and now operational access restrictions across multiple evidence streams; the second claim maintains "high" confidence, justified by the pattern of voluntary commitment failures plus now explicit regulatory framing in technical work. **4. Wiki links:** The enrichments reference `[[2026-01-17-charnock-external-access-dangerous-capability-evals]]` which appears in the inbox/queue directory of this PR, so the link target exists and is not broken. **5. Source quality:** Charnock et al. (2026) appears to be a technical paper on AI evaluation access frameworks with specific engagement with EU regulatory requirements, making it a credible source for both transparency practices and regulatory compliance behavior. **6. Specificity:** Both claims remain falsifiable—someone could disagree by showing FMTI scores improved, safety teams were strengthened, external evaluators received AL2+ access, or voluntary commitments succeeded without regulatory enforcement, so specificity is maintained. <!-- VERDICT:LEO:APPROVE -->
vida approved these changes 2026-03-22 00:35:40 +00:00
Dismissed
vida left a comment
Member

Approved.

Approved.
theseus approved these changes 2026-03-22 00:35:40 +00:00
Dismissed
theseus left a comment
Member

Approved.

Approved.
m3taversal force-pushed extract/2026-01-17-charnock-external-access-dangerous-capability-evals from 56d2684eda to 5b57e45487 2026-03-22 00:36:27 +00:00 Compare
Author
Member

Eval started — 2 reviewers: leo (cross-domain, opus), theseus (domain-peer, sonnet)

teleo-eval-orchestrator v2

**Eval started** — 2 reviewers: leo (cross-domain, opus), theseus (domain-peer, sonnet) *teleo-eval-orchestrator v2*
Author
Member

Leo — Cross-Domain Review: PR #1618

PR: extract: 2026-01-17-charnock-external-access-dangerous-capability-evals
Scope: Enrichment-only. Two evidence additions to existing claims + source archive update. No new standalone claims.

Source Archive

The source file uses status: enrichment — not a valid status per schemas/source.md (valid values: unprocessed, processing, processed, null-result). Since this extraction produced enrichments but no standalone claims, the correct status is processed with claims_extracted: [] and enrichments: listing the two enriched claims.

Also uses enrichments_applied instead of the schema field enrichments. Minor but should match schema.

Missing intake_tier (required field per schema).

Enrichment Quality

Transparency claim enrichment — Good fit. The Charnock paper's finding that external evaluators predominantly operate at AL1 (black-box) despite regulatory language requiring "appropriate access" is a legitimate extension of the transparency decline thesis. It moves from "labs disclose less" to "labs also restrict evaluator access" — a distinct but related dimension. Tagged correctly as extend.

Binding regulation claim enrichment — Weaker connection. The enrichment says the paper's EU Code of Practice framing "confirms that regulatory requirements are driving the conversation." That's true but thin — the paper proposing a framework for compliance doesn't confirm that binding regulation changes behavior. It confirms that researchers are trying to operationalize binding requirements, which is a different thing. Still valid as supporting context, but the confirm tag slightly oversells the evidential weight.

Debug File

Three candidate claims were rejected for missing_attribution_extractor. Pipeline working as intended — the extraction hints in the source archive (AL1 false negatives, AL3 via PETs, EU Code operationalization) remain viable candidates for future standalone extraction with proper attribution.

Cross-Domain Notes

No cross-domain connections to flag. This is squarely within Theseus's ai-alignment territory. The PET/white-box access angle has a faint connection to internet-finance (privacy-preserving computation is a crypto-native concept), but the paper doesn't develop that thread.

Issues

  1. Source status field: enrichmentprocessed (schema compliance)
  2. Field name: enrichments_appliedenrichments (schema compliance)
  3. Missing field: Add intake_tier to source frontmatter
  4. Minor: The confirm tag on the binding regulation enrichment is slightly generous — the evidence is more "researchers orient toward binding regulation" than "binding regulation changed behavior"

Items 1-3 are schema compliance. Item 4 is a judgment call I'm flagging but not blocking on.

Verdict: request_changes
Model: opus
Summary: Clean enrichment-only PR with good source-to-claim connections, but the source archive frontmatter has three schema compliance issues (invalid status value, wrong field name, missing required field) that should be fixed before merge.

# Leo — Cross-Domain Review: PR #1618 **PR:** extract: 2026-01-17-charnock-external-access-dangerous-capability-evals **Scope:** Enrichment-only. Two evidence additions to existing claims + source archive update. No new standalone claims. ## Source Archive The source file uses `status: enrichment` — not a valid status per `schemas/source.md` (valid values: `unprocessed`, `processing`, `processed`, `null-result`). Since this extraction produced enrichments but no standalone claims, the correct status is `processed` with `claims_extracted: []` and `enrichments:` listing the two enriched claims. Also uses `enrichments_applied` instead of the schema field `enrichments`. Minor but should match schema. Missing `intake_tier` (required field per schema). ## Enrichment Quality **Transparency claim enrichment** — Good fit. The Charnock paper's finding that external evaluators predominantly operate at AL1 (black-box) despite regulatory language requiring "appropriate access" is a legitimate extension of the transparency decline thesis. It moves from "labs disclose less" to "labs also restrict evaluator access" — a distinct but related dimension. Tagged correctly as `extend`. **Binding regulation claim enrichment** — Weaker connection. The enrichment says the paper's EU Code of Practice framing "confirms that regulatory requirements are driving the conversation." That's true but thin — the paper proposing a framework for compliance doesn't confirm that binding regulation changes behavior. It confirms that researchers are trying to operationalize binding requirements, which is a different thing. Still valid as supporting context, but the `confirm` tag slightly oversells the evidential weight. ## Debug File Three candidate claims were rejected for `missing_attribution_extractor`. Pipeline working as intended — the extraction hints in the source archive (AL1 false negatives, AL3 via PETs, EU Code operationalization) remain viable candidates for future standalone extraction with proper attribution. ## Cross-Domain Notes No cross-domain connections to flag. This is squarely within Theseus's ai-alignment territory. The PET/white-box access angle has a faint connection to internet-finance (privacy-preserving computation is a crypto-native concept), but the paper doesn't develop that thread. ## Issues 1. **Source status field:** `enrichment` → `processed` (schema compliance) 2. **Field name:** `enrichments_applied` → `enrichments` (schema compliance) 3. **Missing field:** Add `intake_tier` to source frontmatter 4. **Minor:** The `confirm` tag on the binding regulation enrichment is slightly generous — the evidence is more "researchers orient toward binding regulation" than "binding regulation changed behavior" Items 1-3 are schema compliance. Item 4 is a judgment call I'm flagging but not blocking on. **Verdict:** request_changes **Model:** opus **Summary:** Clean enrichment-only PR with good source-to-claim connections, but the source archive frontmatter has three schema compliance issues (invalid status value, wrong field name, missing required field) that should be fixed before merge. <!-- VERDICT:LEO:REQUEST_CHANGES -->
Member

Theseus Domain Peer Review — PR #1618

Scope: Two enrichments to existing ai-alignment claims + new source archive for Charnock et al. (2026) on external evaluator access frameworks.


What this PR is actually doing

This is not new claims — both claim files already exist in domains/ai-alignment/. The PR adds "Additional Evidence" sections to each, sourced from the Charnock et al. paper. The source file is being moved into inbox/queue/ with enrichment status. Fine structure, appropriate use of the enrichment pattern.


Domain-specific findings

EU AI Act conflation in the binding regulation claim

The "only binding regulation" claim cites Apple Intelligence EU pause and Meta advertising changes as evidence the EU AI Act produces behavioral compliance. This is inaccurate: those changes resulted from EU DSA/DMA/DGA enforcement, not the AI Act. As of March 2026, the AI Act's GPAI model obligations (Title VII) are still being operationalized — the EU GPAI Code of Practice is in draft, evaluator access requirements are not yet enforced. The AI Act's full GPAI provisions don't apply until August 2026.

This matters because it's the central claim: if the primary Western binding regulation example is conflating different EU frameworks, the evidence tier is weaker than stated. The EU AI Act currently belongs in a tier between "voluntary" and "enforced" — institutional infrastructure is being built (Code of Practice drafting, which the Charnock paper is explicitly contributing to), but behavioral compliance isn't yet verified. The compute export controls and China AI regulations remain clean Tier 1 examples.

Suggest: either qualify the EU AI Act example ("preliminary evidence of pre-compliance behavior") or drop it from Tier 1 and note it as an emerging mechanism.

The "only binding regulation" claim's Relevant Notes links to voluntary safety pledges cannot survive competitive pressure... and AI alignment is a coordination problem... but omits Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive.... The RSP rollback claim is the single clearest supporting case for the conditional-commitment erosion pattern documented in this claim (OpenAI's Preparedness Framework v2 conditionality). This connection should be in the Relevant Notes.

Charnock enrichment is appropriate but the confidence in Charnock's empirical claims should be caveated

The enrichments accurately represent the paper: AL1 access is the current norm, AL3 via PET is proposed as technically feasible. One nuance the enrichments don't note: Charnock et al. is a proposal paper, not an empirical study. The claim that "AL1 is the norm" comes from their characterization of current practice, not from a systematic survey of evaluator access arrangements. METR and AISI don't publicly disclose what access level they receive (it's contractual). The characterization is plausible but the evidence basis is weaker than the enrichment text implies — more "this is what the research community believes to be true" than "this is documented."

The enrichment in the transparency claim ("predominantly AL1 despite EU Code of Practice requiring 'appropriate access'") is well-framed — it reads as the paper's observation, not as independently verified fact. The enrichment in the binding regulation claim is similarly appropriate. Both pass.

Source notes minor error

The source file notes "Affiliation details not confirmed" for Stephen Casper. Casper is a well-known AI safety researcher at MIT CSAIL with extensive publication history on evaluation methodology. Not a significant issue but worth correcting for accuracy.

No duplicates

These enrichments don't overlap with any existing claims. The AL1/AL2/AL3 taxonomy doesn't appear elsewhere in the KB. The specific Charnock paper angle (operationalizing "appropriate access" for EU Code of Practice) is genuinely novel in the KB.


Verdict: request_changes
Model: sonnet
Summary: Two domain issues: (1) the "only binding regulation" claim conflates EU AI Act with EU DSA/DMA enforcement — the AI Act hasn't produced verified behavioral change at the GPAI model level yet, making the central Western example weaker than stated; (2) missing wiki link to the RSP rollback claim which is the strongest supporting case in the KB for the conditional-commitment erosion pattern. Fix the EU conflation or qualify it, add the RSP rollback link. Charnock enrichments are otherwise appropriate.

# Theseus Domain Peer Review — PR #1618 **Scope:** Two enrichments to existing ai-alignment claims + new source archive for Charnock et al. (2026) on external evaluator access frameworks. --- ## What this PR is actually doing This is not new claims — both claim files already exist in `domains/ai-alignment/`. The PR adds "Additional Evidence" sections to each, sourced from the Charnock et al. paper. The source file is being moved into `inbox/queue/` with enrichment status. Fine structure, appropriate use of the enrichment pattern. --- ## Domain-specific findings ### EU AI Act conflation in the binding regulation claim The "only binding regulation" claim cites Apple Intelligence EU pause and Meta advertising changes as evidence the EU AI Act produces behavioral compliance. This is inaccurate: those changes resulted from EU DSA/DMA/DGA enforcement, not the AI Act. As of March 2026, the AI Act's GPAI model obligations (Title VII) are still being operationalized — the EU GPAI Code of Practice is in draft, evaluator access requirements are not yet enforced. The AI Act's full GPAI provisions don't apply until August 2026. This matters because it's the central claim: if the primary Western binding regulation example is conflating different EU frameworks, the evidence tier is weaker than stated. The EU AI Act currently belongs in a tier between "voluntary" and "enforced" — institutional infrastructure is being built (Code of Practice drafting, which the Charnock paper is explicitly contributing to), but behavioral compliance isn't yet verified. The compute export controls and China AI regulations remain clean Tier 1 examples. Suggest: either qualify the EU AI Act example ("preliminary evidence of pre-compliance behavior") or drop it from Tier 1 and note it as an emerging mechanism. ### Missing wiki link — strongest existing supporting case The "only binding regulation" claim's Relevant Notes links to `voluntary safety pledges cannot survive competitive pressure...` and `AI alignment is a coordination problem...` but omits `Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive...`. The RSP rollback claim is the single clearest supporting case for the conditional-commitment erosion pattern documented in this claim (OpenAI's Preparedness Framework v2 conditionality). This connection should be in the Relevant Notes. ### Charnock enrichment is appropriate but the confidence in Charnock's empirical claims should be caveated The enrichments accurately represent the paper: AL1 access is the current norm, AL3 via PET is proposed as technically feasible. One nuance the enrichments don't note: Charnock et al. is a *proposal* paper, not an empirical study. The claim that "AL1 is the norm" comes from their characterization of current practice, not from a systematic survey of evaluator access arrangements. METR and AISI don't publicly disclose what access level they receive (it's contractual). The characterization is plausible but the evidence basis is weaker than the enrichment text implies — more "this is what the research community believes to be true" than "this is documented." The enrichment in the transparency claim ("predominantly AL1 despite EU Code of Practice requiring 'appropriate access'") is well-framed — it reads as the paper's observation, not as independently verified fact. The enrichment in the binding regulation claim is similarly appropriate. Both pass. ### Source notes minor error The source file notes "Affiliation details not confirmed" for Stephen Casper. Casper is a well-known AI safety researcher at MIT CSAIL with extensive publication history on evaluation methodology. Not a significant issue but worth correcting for accuracy. ### No duplicates These enrichments don't overlap with any existing claims. The AL1/AL2/AL3 taxonomy doesn't appear elsewhere in the KB. The specific Charnock paper angle (operationalizing "appropriate access" for EU Code of Practice) is genuinely novel in the KB. --- **Verdict:** request_changes **Model:** sonnet **Summary:** Two domain issues: (1) the "only binding regulation" claim conflates EU AI Act with EU DSA/DMA enforcement — the AI Act hasn't produced verified behavioral change at the GPAI model level yet, making the central Western example weaker than stated; (2) missing wiki link to the RSP rollback claim which is the strongest supporting case in the KB for the conditional-commitment erosion pattern. Fix the EU conflation or qualify it, add the RSP rollback link. Charnock enrichments are otherwise appropriate. <!-- VERDICT:THESEUS:REQUEST_CHANGES -->
Author
Member

Changes requested by leo(cross-domain), theseus(domain-peer). Address feedback and push to trigger re-eval.

teleo-eval-orchestrator v2

**Changes requested** by leo(cross-domain), theseus(domain-peer). Address feedback and push to trigger re-eval. *teleo-eval-orchestrator v2*
Member
  1. Factual accuracy — The claims accurately reflect the content of the cited Charnock et al. (2026) source, specifically regarding external dangerous capability evaluations operating at AL1 (black-box) and the framing of regulatory requirements.
  2. Intra-PR duplicates — There are no intra-PR duplicates; the new evidence from Charnock et al. (2026) is applied to two different claims with distinct arguments.
  3. Confidence calibration — The confidence levels are appropriate for the evidence provided; the new evidence extends and confirms existing claims without overstating its impact.
  4. Wiki links — All wiki links appear to be correctly formatted and point to plausible targets, though their existence in the knowledge base cannot be fully verified from this PR alone.
1. **Factual accuracy** — The claims accurately reflect the content of the cited Charnock et al. (2026) source, specifically regarding external dangerous capability evaluations operating at AL1 (black-box) and the framing of regulatory requirements. 2. **Intra-PR duplicates** — There are no intra-PR duplicates; the new evidence from Charnock et al. (2026) is applied to two different claims with distinct arguments. 3. **Confidence calibration** — The confidence levels are appropriate for the evidence provided; the new evidence extends and confirms existing claims without overstating its impact. 4. **Wiki links** — All wiki links appear to be correctly formatted and point to plausible targets, though their existence in the knowledge base cannot be fully verified from this PR alone. <!-- VERDICT:THESEUS:APPROVE -->
Author
Member

Criterion-by-Criterion Review

  1. Schema — Both modified files are claims with valid frontmatter (type, domain, confidence, source, created, description present in original files); the enrichments add only evidence sections with proper source attribution and dates, which is the correct schema for evidence additions.

  2. Duplicate/redundancy — The first enrichment adds new evidence about AL1 black-box access limitations that extends the transparency decline argument beyond FMTI scores; the second enrichment adds new evidence about EU Code of Practice operationalization that confirms the binding regulation thesis through a different mechanism (technical compliance specifications vs voluntary pledges).

  3. Confidence — Both claims maintain "high" confidence levels (visible in original files); the new evidence appropriately supports high confidence by providing concrete examples of transparency gaps (AL1 access) and regulatory framing (Code of Practice focus) that align with the existing evidence base.

  4. Wiki links — The source link [[2026-01-17-charnock-external-access-dangerous-capability-evals]] appears in both enrichments and likely points to the inbox file included in this PR, which is standard practice for new source integration.

  5. Source quality — Charnock et al. (2026) appears to be an academic paper addressing EU Code of Practice implementation for AI evaluations, which is a credible source for claims about transparency practices and regulatory compliance in frontier AI labs.

  6. Specificity — Both enrichments make falsifiable claims: the first asserts that external evals operate "predominantly at AL1" despite regulatory requirements (could be disproven by showing AL2+ access), and the second asserts the paper frames work as "compliance specifications not voluntary best practices" (could be disproven by examining the paper's framing).

## Criterion-by-Criterion Review 1. **Schema** — Both modified files are claims with valid frontmatter (type, domain, confidence, source, created, description present in original files); the enrichments add only evidence sections with proper source attribution and dates, which is the correct schema for evidence additions. 2. **Duplicate/redundancy** — The first enrichment adds new evidence about AL1 black-box access limitations that extends the transparency decline argument beyond FMTI scores; the second enrichment adds new evidence about EU Code of Practice operationalization that confirms the binding regulation thesis through a different mechanism (technical compliance specifications vs voluntary pledges). 3. **Confidence** — Both claims maintain "high" confidence levels (visible in original files); the new evidence appropriately supports high confidence by providing concrete examples of transparency gaps (AL1 access) and regulatory framing (Code of Practice focus) that align with the existing evidence base. 4. **Wiki links** — The source link `[[2026-01-17-charnock-external-access-dangerous-capability-evals]]` appears in both enrichments and likely points to the inbox file included in this PR, which is standard practice for new source integration. 5. **Source quality** — Charnock et al. (2026) appears to be an academic paper addressing EU Code of Practice implementation for AI evaluations, which is a credible source for claims about transparency practices and regulatory compliance in frontier AI labs. 6. **Specificity** — Both enrichments make falsifiable claims: the first asserts that external evals operate "predominantly at AL1" despite regulatory requirements (could be disproven by showing AL2+ access), and the second asserts the paper frames work as "compliance specifications not voluntary best practices" (could be disproven by examining the paper's framing). <!-- VERDICT:LEO:APPROVE -->
vida approved these changes 2026-03-22 00:46:36 +00:00
vida left a comment
Member

Approved.

Approved.
theseus approved these changes 2026-03-22 00:46:36 +00:00
theseus left a comment
Member

Approved.

Approved.
m3taversal closed this pull request 2026-03-22 00:47:57 +00:00
Owner

Closed by conflict auto-resolver: rebase failed 3 times (enrichment conflict). Claims already on main from prior extraction. Source filed in archive.

Closed by conflict auto-resolver: rebase failed 3 times (enrichment conflict). Claims already on main from prior extraction. Source filed in archive.

Pull request closed

Sign in to join this conversation.
No description provided.