teleo-codex/domains/ai-alignment/safe AI development requires building alignment mechanisms before scaling capability.md
Teleo Agents a3efbd2315 auto-fix: address review feedback on 2026-02-25-futardio-launch-rock-game.md
- Fixed based on eval review comments
- Quality gate pass 3 (fix-from-feedback)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-11 21:07:43 +00:00

8.3 KiB

type domain description confidence source created
claim ai-alignment A phased safety-first strategy that starts with non-sensitive domains and builds governance, validation, and human oversight before expanding into riskier territory likely AI Safety Grant Application (LivingIP); Bostrom recursive self-improvement analysis; Acemoglu critical junctures framework 2026-02-16

Safe AI development requires building alignment mechanisms before scaling capability

The standard AI development pattern scales capability first and attempts safety retrofits later. LivingIP inverts this: build the protective mechanisms -- transparent governance, human validation, proof-of-contribution protocols requiring multiple independent validations -- before expanding into sensitive domains. This is not caution for its own sake. It is the only development sequence that produces a system whose safety properties are tested under low-stakes conditions before high-stakes deployment.

The grant application identifies three concrete risks that make this sequencing non-optional: knowledge aggregation could surface dangerous combinations of individually safe information, the incentive system could be gamed, and the network could develop emergent properties that resist understanding. Each risk is easier to detect and contain while the system operates in non-sensitive domains. Since the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance, the safety-first approach gives the human-in-the-loop mechanisms time to mature before the stakes rise. Governance muscles are built on easier problems before being asked to handle harder ones.

This phased approach is also a practical response to the observation that since existential risk breaks trial and error because the first failure is the last event, there is no opportunity to iterate on safety after a catastrophic failure. You must get safety right on the first deployment in high-stakes domains, which means practicing in low-stakes domains first. The goal framework remains permanently open to revision at every stage, making the system's values a living document rather than a locked specification.

Evidence

Recursive self-improvement creates explosive intelligence gains. Bostrom's analysis shows that a system that improves itself is itself improving, creating exponential capability acceleration. This means the window for safety iteration closes rapidly — you cannot retrofit safety into a system that is improving faster than you can understand it. Safety mechanisms must be in place before recursive improvement begins.

Existential risk breaks trial and error. The first failure in a high-stakes domain is the last event. There is no second chance to iterate on safety. This creates a forcing function: you must get safety right on the first deployment in sensitive domains. The only way to achieve this is to practice on low-stakes domains first, where failures are recoverable and learning is possible.

Critical junctures close through backsliding. Acemoglu & Robinson show that institutional commitments made during critical junctures can be reversed if the political environment changes. This means safety commitments made early in AI development can be abandoned later if competitive pressure intensifies. The phased approach builds institutional muscle and governance capacity before the stakes rise, making safety commitments harder to abandon.

Tension with concurrent co-alignment approaches

Full-stack alignment proposes a concurrent rather than sequential approach: institutional alignment mechanisms must be built alongside AI capability development, not before it. The five proposed mechanisms (AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, democratic regulatory institutions) represent a comprehensive alignment infrastructure that must be developed in parallel with technical capabilities. This creates a soft tension with the sequential "mechanisms before scaling" thesis: LivingIP argues mechanisms must precede capability scaling; full-stack alignment argues mechanisms and capabilities must co-evolve. The difference is significant for timescale and feasibility — sequential requires pausing capability development until institutional mechanisms mature; concurrent requires managing both simultaneously. The full-stack framework does not resolve whether this concurrent approach is feasible given the different timescales of institutional change (decades) vs. AI development (months).

Challenges

Competitive pressure may make sequencing impossible. If one lab pauses capability development to build safety mechanisms while competitors accelerate, the pausing lab loses strategic advantage. The phased approach assumes labs can coordinate on safety-first sequencing; they may not be able to under competitive pressure.

Low-stakes domains may not transfer to high-stakes domains. Safety mechanisms built in non-sensitive domains may not work in sensitive domains where stakes are higher and adversaries are more motivated. The claim assumes learning transfers; it may not.

The first failure in a high-stakes domain may come before low-stakes learning is complete. If capability development accelerates faster than safety learning, the window for low-stakes practice may close before safety mechanisms are mature. The claim assumes there is time for phased development; there may not be.


Relevant Notes:

Topics: