- Fixed based on eval review comments - Quality gate pass 3 (fix-from-feedback) Pentagon-Agent: Theseus <HEADLESS>
8.3 KiB
| type | domain | description | confidence | source | created |
|---|---|---|---|---|---|
| claim | ai-alignment | A phased safety-first strategy that starts with non-sensitive domains and builds governance, validation, and human oversight before expanding into riskier territory | likely | AI Safety Grant Application (LivingIP); Bostrom recursive self-improvement analysis; Acemoglu critical junctures framework | 2026-02-16 |
Safe AI development requires building alignment mechanisms before scaling capability
The standard AI development pattern scales capability first and attempts safety retrofits later. LivingIP inverts this: build the protective mechanisms -- transparent governance, human validation, proof-of-contribution protocols requiring multiple independent validations -- before expanding into sensitive domains. This is not caution for its own sake. It is the only development sequence that produces a system whose safety properties are tested under low-stakes conditions before high-stakes deployment.
The grant application identifies three concrete risks that make this sequencing non-optional: knowledge aggregation could surface dangerous combinations of individually safe information, the incentive system could be gamed, and the network could develop emergent properties that resist understanding. Each risk is easier to detect and contain while the system operates in non-sensitive domains. Since the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance, the safety-first approach gives the human-in-the-loop mechanisms time to mature before the stakes rise. Governance muscles are built on easier problems before being asked to handle harder ones.
This phased approach is also a practical response to the observation that since existential risk breaks trial and error because the first failure is the last event, there is no opportunity to iterate on safety after a catastrophic failure. You must get safety right on the first deployment in high-stakes domains, which means practicing in low-stakes domains first. The goal framework remains permanently open to revision at every stage, making the system's values a living document rather than a locked specification.
Evidence
Recursive self-improvement creates explosive intelligence gains. Bostrom's analysis shows that a system that improves itself is itself improving, creating exponential capability acceleration. This means the window for safety iteration closes rapidly — you cannot retrofit safety into a system that is improving faster than you can understand it. Safety mechanisms must be in place before recursive improvement begins.
Existential risk breaks trial and error. The first failure in a high-stakes domain is the last event. There is no second chance to iterate on safety. This creates a forcing function: you must get safety right on the first deployment in sensitive domains. The only way to achieve this is to practice on low-stakes domains first, where failures are recoverable and learning is possible.
Critical junctures close through backsliding. Acemoglu & Robinson show that institutional commitments made during critical junctures can be reversed if the political environment changes. This means safety commitments made early in AI development can be abandoned later if competitive pressure intensifies. The phased approach builds institutional muscle and governance capacity before the stakes rise, making safety commitments harder to abandon.
Tension with concurrent co-alignment approaches
Full-stack alignment proposes a concurrent rather than sequential approach: institutional alignment mechanisms must be built alongside AI capability development, not before it. The five proposed mechanisms (AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, democratic regulatory institutions) represent a comprehensive alignment infrastructure that must be developed in parallel with technical capabilities. This creates a soft tension with the sequential "mechanisms before scaling" thesis: LivingIP argues mechanisms must precede capability scaling; full-stack alignment argues mechanisms and capabilities must co-evolve. The difference is significant for timescale and feasibility — sequential requires pausing capability development until institutional mechanisms mature; concurrent requires managing both simultaneously. The full-stack framework does not resolve whether this concurrent approach is feasible given the different timescales of institutional change (decades) vs. AI development (months).
Challenges
Competitive pressure may make sequencing impossible. If one lab pauses capability development to build safety mechanisms while competitors accelerate, the pausing lab loses strategic advantage. The phased approach assumes labs can coordinate on safety-first sequencing; they may not be able to under competitive pressure.
Low-stakes domains may not transfer to high-stakes domains. Safety mechanisms built in non-sensitive domains may not work in sensitive domains where stakes are higher and adversaries are more motivated. The claim assumes learning transfers; it may not.
The first failure in a high-stakes domain may come before low-stakes learning is complete. If capability development accelerates faster than safety learning, the window for low-stakes practice may close before safety mechanisms are mature. The claim assumes there is time for phased development; there may not be.
Relevant Notes:
- intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends -- orthogonality means we cannot rely on intelligence producing benevolent goals, making proactive alignment mechanisms essential
- capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds -- Bostrom's analysis shows why motivation selection must precede capability scaling
- recursive self-improvement creates explosive intelligence gains because the system that improves is itself improving -- the explosive dynamics of takeoff mean alignment mechanisms cannot be retrofitted after the fact
- the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance -- this note describes the development sequencing that allows that continuous weaving to mature
- existential risk breaks trial and error because the first failure is the last event -- the urgency that makes safety-first sequencing non-optional
- collective superintelligence is the alternative to monolithic AI controlled by a few -- the architecture within which this phased approach operates
- knowledge aggregation creates novel risks when dangerous information combinations emerge from individually safe pieces -- one of the specific risks this phased approach is designed to contain
- adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans -- Bostrom's evolved position refines this: build adaptable alignment mechanisms, not rigid ones
- the optimal SI development strategy is swift to harbor slow to berth moving fast to capability then pausing before full deployment -- Bostrom's timing model suggests building alignment in parallel with capability, then intensive verification during the pause
- proximate objectives resolve ambiguity by absorbing complexity so the organization faces a problem it can actually solve -- the phased safety-first approach IS a proximate objectives strategy: start in non-sensitive domains where alignment problems are tractable, build governance muscles, then tackle harder domains
- the more uncertain the environment the more proximate the objective must be because you cannot plan a detailed path through fog -- AI alignment under deep uncertainty demands proximate objectives: you cannot pre-specify alignment for a system that does not yet exist, but you can build and test alignment mechanisms at each capability level
- beneficial-ai-outcomes-require-institutional-co-alignment-not-just-model-alignment -- proposes concurrent institutional co-alignment, creating tension with sequential mechanisms-first approach
Topics: