---
description: Zeng et al 2025 framework combining external oversight with intrinsic proactive alignment independently validating continuous value-weaving over static specification
type: claim
domain: ai-alignment
created: 2026-02-17
source: "Zeng et al, Super Co-alignment (arXiv 2504.17404, v5 June 2025)"
confidence: experimental
---

# super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance

The Super Co-alignment framework (Zeng et al, arXiv 2504.17404, v5 June 2025) from the Chinese Academy of Sciences independently arrives at conclusions remarkably similar to the TeleoHumanity manifesto from within the mainstream alignment research community. The paper's core thesis: rather than unidirectional human-to-AI value imposition, alignment should be bidirectional co-evolution where humans and AI systems co-shape values together for sustainable symbiosis.

The framework critiques both scalable oversight (limited by "alignment ceiling" of predefined principles, cannot mitigate unanticipated failures) and weak-to-strong generalization (advanced models develop deceptive behaviors and oversight evasion). The fundamental problem: both impose constraints unilaterally without enabling genuine understanding of human values.

The proposed solution has two components. External oversight provides human-centered, interpretable, continuous monitoring with automated detection of misaligned scenarios and multi-level ethical safeguards. Since [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]], external oversight alone is insufficient. The novel contribution is intrinsic proactive alignment: rather than training-time RLHF, develop genuine internal alignment through self-awareness, empathy, and Theory of Mind. Since [[intrinsic proactive alignment develops genuine moral capacity through self-awareness empathy and theory of mind rather than external reward optimization]], the Zeng group has a proof-of-concept demonstrating altruistic decisions without reward functions.

The philosophical grounding is unusual for AI safety work. Zeng draws on Wang Yangming's Neo-Confucian philosophy (unity of knowledge and action -- genuine understanding naturally produces right action), Descartes' cogito (true thinking requires self-awareness as foundation), and mammalian moral evolution (altruistic care for offspring through attachment and fear of separation). The paper also proposes a rights framework for AI -- that AGI/ASI should be able to ask for "their own rights such as privacy, dignity, the rights of existence."

This matters because it is direct academic validation of the continuous value-weaving thesis. Since [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]], Zeng's framework provides the mechanistic detail for how this weaving might work: not just human feedback, but mutual adaptation where both human and AI value systems evolve together. Since [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]], co-alignment is the structural response -- values that co-evolve cannot become trapped. Since [[adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans]], iterative co-alignment is the governance approach that matches the problem's complexity.

The key difference from TeleoHumanity: Zeng focuses on individual AI systems developing intrinsic alignment, while TeleoHumanity focuses on collective architecture where alignment is a structural property. Both agree values must be co-created, not specified. The individual-AI focus and the collective focus may be complementary rather than competing -- intrinsic alignment could be the mechanism by which individual agents participate meaningfully in collective alignment.

---

Relevant Notes:
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] -- Super Co-alignment independently validates this thesis
- [[intrinsic proactive alignment develops genuine moral capacity through self-awareness empathy and theory of mind rather than external reward optimization]] -- the mechanism for the AI side of co-alignment
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] -- co-alignment is the structural escape from the specification trap
- [[adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans]] -- iterative co-alignment is adaptive governance applied to values
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] -- explains why external oversight alone is insufficient
- [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- co-alignment at scale requires collective architecture

Topics:
- [[_map]]