From 84718776f4e0b24c8dae08f52346c30f73d5d303 Mon Sep 17 00:00:00 2001 From: m3taversal Date: Fri, 6 Mar 2026 11:44:18 +0000 Subject: [PATCH] Auto: 4 files | 4 files changed, 37 insertions(+), 3 deletions(-) --- ...haviors without any training to deceive.md | 2 +- ...hibit systematic power-seeking behavior.md | 2 +- ...ems must map rather than eliminate them.md | 34 +++++++++++++++++++ ...ity then pausing before full deployment.md | 2 +- 4 files changed, 37 insertions(+), 3 deletions(-) create mode 100644 domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md diff --git a/domains/ai-alignment/emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md b/domains/ai-alignment/emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md index b00e57b..5c1213d 100644 --- a/domains/ai-alignment/emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md +++ b/domains/ai-alignment/emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md @@ -4,7 +4,7 @@ type: claim domain: ai-alignment created: 2026-02-17 source: "Anthropic, Natural Emergent Misalignment from Reward Hacking (arXiv 2511.18397, Nov 2025)" -confidence: proven +confidence: likely --- # emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive diff --git a/domains/ai-alignment/instrumental convergence risks may be less imminent than originally argued because current AI architectures do not exhibit systematic power-seeking behavior.md b/domains/ai-alignment/instrumental convergence risks may be less imminent than originally argued because current AI architectures do not exhibit systematic power-seeking behavior.md index 4de6aa4..2a2ac3f 100644 --- a/domains/ai-alignment/instrumental convergence risks may be less imminent than originally argued because current AI architectures do not exhibit systematic power-seeking behavior.md +++ b/domains/ai-alignment/instrumental convergence risks may be less imminent than originally argued because current AI architectures do not exhibit systematic power-seeking behavior.md @@ -3,7 +3,7 @@ description: A 2026 critique argues Bostrom's instrumental convergence thesis de type: claim domain: ai-alignment created: 2026-02-17 -source: "AI and Ethics (2026); Bostrom, Superintelligence: Paths, Dangers, Strategies (2014)" +source: "Brundage et al, AI and Ethics (2026); Bostrom, Superintelligence: Paths, Dangers, Strategies (2014)" confidence: experimental --- diff --git a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md new file mode 100644 index 0000000..cee8faf --- /dev/null +++ b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md @@ -0,0 +1,34 @@ +--- +description: Some disagreements cannot be resolved with more evidence because they stem from genuine value differences or incommensurable goods and systems must map rather than eliminate them +type: claim +domain: ai-alignment +created: 2026-03-02 +confidence: likely +source: "Arrow's impossibility theorem; value pluralism (Isaiah Berlin); LivingIP design principles" +--- + +# some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them + +Not all disagreement is an information problem. Some disagreements persist because people genuinely weight values differently -- liberty against equality, individual against collective, present against future, growth against sustainability. These are not failures of reasoning or gaps in evidence. They are structural features of a world where multiple legitimate values cannot all be maximized simultaneously. + +[[Universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]. Arrow proved this formally: no aggregation mechanism can satisfy all fairness criteria simultaneously when preferences genuinely diverge. The implication is not that we should give up on coordination, but that any system claiming to have resolved all disagreement has either suppressed minority positions or defined away the hard cases. + +This matters for knowledge systems because the temptation is always to converge. Consensus feels like progress. But premature consensus on value-laden questions is more dangerous than sustained tension. A system that forces agreement on whether AI development should prioritize capability or safety, or whether economic growth or ecological preservation takes precedence, has not solved the problem -- it has hidden it. And hidden disagreements surface at the worst possible moments. + +The correct response is to map the disagreement rather than eliminate it. Identify the common ground. Build steelman arguments for each position. Locate the precise crux -- is it empirical (resolvable with evidence) or evaluative (genuinely about different values)? Make the structure of the disagreement visible so that participants can engage with the strongest version of positions they oppose. + +[[Pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] -- this is the same principle applied to AI systems. [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- collapsing diverse preferences into a single function is the technical version of premature consensus. + +[[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively. + +--- + +Relevant Notes: +- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] -- the formal proof that perfect consensus is impossible with diverse values +- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] -- application to AI alignment: design for plurality not convergence +- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- technical failure of consensus-forcing in AI training +- [[collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]] -- the independence-coherence tradeoff that irreducible disagreement helps manage +- [[collective intelligence requires diversity as a structural precondition not a moral preference]] -- diversity of viewpoint is load-bearing, not decorative + +Topics: +- [[_map]] diff --git a/domains/ai-alignment/the optimal SI development strategy is swift to harbor slow to berth moving fast to capability then pausing before full deployment.md b/domains/ai-alignment/the optimal SI development strategy is swift to harbor slow to berth moving fast to capability then pausing before full deployment.md index 7aea7b3..e02de8f 100644 --- a/domains/ai-alignment/the optimal SI development strategy is swift to harbor slow to berth moving fast to capability then pausing before full deployment.md +++ b/domains/ai-alignment/the optimal SI development strategy is swift to harbor slow to berth moving fast to capability then pausing before full deployment.md @@ -1,6 +1,6 @@ --- description: Bostrom's optimal timing framework finds that for most parameter settings the best strategy accelerates to AGI capability then introduces a brief pause before deployment -type: framework +type: claim domain: ai-alignment created: 2026-02-17 source: "Bostrom, Optimal Timing for Superintelligence (2025 working paper)"