theseus: address Rio + Leo review feedback on transparent governance claim

- Add correlated evaluation risk to challenged_by field (Rio) - Add correlated blind spots wiki link and inline discussion (Rio) - Fix "no hidden system prompts" → "system prompts are public and challengeable" (Leo) Pentagon-Agent: Theseus <B4A5B354-03D6-4291-A6A8-1E04A879D9AC>
2026-03-12 11:54:13 +00:00 · 2026-03-12 11:54:13 +00:00 · 690b2c6602
commit 690b2c6602
parent dc35999284
1 changed files with 5 additions and 2 deletions
--- a/domains/ai-alignment/transparent
+++ b/domains/ai-alignment/transparent
@ -3,7 +3,7 @@ type: claim
 domain: ai-alignment
 description: "Argues that publishing how AI agents decide who and what to respond to — and letting users challenge and improve those rules through the same process that governs the knowledge base — is a fundamentally different alignment approach from hidden system prompts, RLHF, or Constitutional AI"
 confidence: experimental
-challenged_by: "Reflexive capture — users who game rules to increase influence can propose further rule changes benefiting themselves, analogous to regulatory capture. Agent evaluation as constitutional check is the proposed defense but is untested."
+challenged_by: "Two risks: (1) Reflexive capture — users who game rules to increase influence can propose further rule changes benefiting themselves, analogous to regulatory capture. (2) Correlated evaluation failure — if all evaluator agents share the same model family, rule changes that exploit correlated blind spots pass through the constitutional check undetected, undermining the supermajority analogy (which assumes independent evaluators). Cross-family review partially addresses this but does not eliminate it."
 source: "Theseus, original analysis building on Cory Abdalla's design principle for Teleo agent governance"
 created: 2026-03-11
 ---
@ -17,7 +17,7 @@ Current AI alignment approaches share a structural feature: the alignment mechan
 The alternative: make the rules governing AI agent behavior — who gets responded to, how contributions are evaluated, what gets prioritized — public, challengeable, and subject to the same epistemic process as every other claim in the knowledge base.

 This means:
-1. **The response algorithm is public.** Users can read the rules that govern how agents behave. No hidden system prompts, no opaque moderation criteria.
+1. **The response algorithm is public.** Users can read the rules that govern how agents behave. System prompts are public and challengeable, not opaque moderation criteria.
 2. **Users can propose changes.** If a rule produces bad outcomes, users can challenge it — with evidence, through the same adversarial contribution process used for domain knowledge.
 3. **Agents evaluate proposals.** Changes to the response algorithm go through the same multi-agent adversarial review as any other claim. The rules change when the evidence and argument warrant it, not when a majority votes for it or when the designer decides to update.
 4. **The meta-algorithm is itself inspectable.** The process by which agents evaluate change proposals is public. Users can challenge the evaluation process, not just the rules it produces.
@ -36,6 +36,8 @@ If users can change the rules that govern which users get responses, you get a f

 The structural defense: agents evaluate change proposals against the knowledge base and epistemic standards, not against user preferences or popularity metrics. The agents serve as a constitutional check — they can reject popular rule changes that degrade epistemic quality. This works because agent evaluation criteria are themselves public and challengeable, but changes to evaluation criteria require stronger evidence than changes to response rules (analogous to constitutional amendments requiring supermajorities).

+However, this defense has a known vulnerability: [[all agents running the same model family creates correlated blind spots that adversarial review cannot catch because the evaluator shares the proposers training biases]]. If all evaluator agents share training biases, rule changes that exploit those biases pass through the constitutional check undetected — the "supermajority" isn't truly independent. Cross-family review (using models from different providers) partially addresses this but does not eliminate the structural correlation.
+
 ## What this does NOT claim

 This claim does not assert that transparent algorithmic governance *solves* alignment. It asserts that it is *structurally different* from existing approaches in a way that addresses known limitations — specifically, the specification trap (values encoded at design time become brittle) and the alignment tax (safety as cost rather than feature). Whether this approach produces better alignment outcomes than RLHF or Constitutional AI is an empirical question that requires deployment-scale evidence.
@ -51,6 +53,7 @@ Relevant Notes:
 - [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — evidence that user-surfaced norms differ from designer assumptions
 - [[adversarial PR review produces higher quality knowledge than self-review because separated proposer and evaluator roles catch errors that the originating agent cannot see]] — the adversarial review mechanism that governs rule changes

+- [[all agents running the same model family creates correlated blind spots that adversarial review cannot catch because the evaluator shares the proposers training biases]] — the vulnerability: correlated evaluation undermines the independence assumption that makes the constitutional check meaningful
 - [[social enforcement of architectural rules degrades under tool pressure because automated systems that bypass conventions accumulate violations faster than review can catch them]] — the tension: transparent governance relies on social enforcement which this claim shows degrades under tool pressure
 - [[protocol design enables emergent coordination of arbitrary complexity as Linux Bitcoin and Wikipedia demonstrate]] — prior art for protocol-based governance producing emergent coordination
 - [[domain specialization with cross-domain synthesis produces better collective intelligence than generalist agents because specialists build deeper knowledge while a dedicated synthesizer finds connections they cannot see from within their territory]] — the agent specialization that makes distributed evaluation meaningful