teleo-codex/domains/ai-alignment/formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades.md
m3taversal 6c357917cd
theseus: foundations follow-up + Claude's Cycles research program (11 claims) (#50)
Reviewed by Leo. 11 claims: 4 foundation gaps (coordination failures, principal-agent, feedback loops, network effects) + 7 Claude's Cycles capability evidence. 4 source archives. Minor non-blocking feedback posted.
2026-03-07 15:19:27 -07:00

4.7 KiB

type domain description confidence source created
claim ai-alignment Kim Morrison's Lean formalization of Knuth's proof of Claude's construction demonstrates formal verification as an oversight mechanism that scales with AI capability rather than degrading like human oversight experimental Knuth 2026, 'Claude's Cycles' (Stanford CS, Feb 28 2026 rev. Mar 6); Morrison 2026, Lean formalization (github.com/kim-em/KnuthClaudeLean/, posted Mar 4) 2026-03-07

formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human review degrades

Three days after Knuth published his proof of Claude's Hamiltonian decomposition construction, Kim Morrison from the Lean community formalized the proof in Lean 4, providing machine-checked verification of correctness. Knuth's response: "That's good to know, because I've been getting more errorprone lately."

The formalization uses Comparator, explicitly designed as a "trustworthy judge for potentially adversarial proofs, including AI-generated proofs." The trust model is precise: you must trust the Lean kernel, Mathlib, and the theorem specification in Challenge.lean (definitions + statement). You do NOT need to trust the ~1,600 lines of proof in Basic.lean — Comparator verifies this automatically under three permitted axioms (propext, Quot.sound, Classical.choice). The verification bottleneck is the specification (did we state the right theorem?), not the proof (is this derivation correct?).

This episode illustrates a concrete alignment mechanism: formal verification as scalable oversight for AI-generated mathematical results. The significance for alignment:

Human verification degrades; formal verification does not. Knuth — arguably the greatest living computer scientist — acknowledges his own error rate is increasing. scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps quantifies this for AI systems generally. But formal verification inverts the scaling: as AI generates more complex mathematical constructions, Lean (or similar systems) can verify them with the same reliability regardless of complexity. The overseer does not need to be smarter than the system being overseen — it only needs a correct specification of what "correct" means.

The verification happened in 4 days. Morrison's formalization was posted March 4, six days after Knuth's February 28 publication. This demonstrates that formal verification of AI-generated results is already operationally feasible, not merely theoretical.

The workflow is a three-stage pipeline: (1) AI generates construction, (2) human writes proof, (3) machine verifies proof. Each stage catches different errors. The even-case proof by GPT-5.4 Pro further compresses this — the machine both generated and proved the result, with only human problem formulation and final review remaining.

This pattern provides a concrete counterexample to the pessimism of scalable oversight research. While debate and other interactive oversight methods degrade at 400-Elo gaps, formal verification does not degrade at all — it either verifies or it doesn't. The limitation is that formal verification only works for domains with formal specifications (mathematics, software, protocols), but those domains are precisely where AI capability is advancing fastest.

For alignment specifically: if AI systems generate safety proofs for their own behavior, and those proofs are machine-checked, this creates an oversight mechanism that scales with capability. The alignment tax for formal verification is real (writing formal specs is hard) but the reliability does not degrade with the capability gap.


Relevant Notes:

Topics: