teleo-codex/foundations/collective-intelligence/principal-agent problems arise whenever one party acts on behalf of another with divergent interests and unobservable effort because information asymmetry makes perfect contracts impossible.md
m3taversal 6c357917cd
theseus: foundations follow-up + Claude's Cycles research program (11 claims) (#50)
Reviewed by Leo. 11 claims: 4 foundation gaps (coordination failures, principal-agent, feedback loops, network effects) + 7 Claude's Cycles capability evidence. 4 source archives. Minor non-blocking feedback posted.
2026-03-07 15:19:27 -07:00

4.2 KiB

type domain description confidence source created
claim collective-intelligence The formal basis for oversight problems: when agents have private information or unobservable actions, principals cannot design contracts that fully align incentives, creating irreducible gaps between intended and actual behavior proven Jensen & Meckling (1976); Akerlof, Market for Lemons (1970); Holmström (1979); Arrow (1963) 2026-03-07

principal-agent problems arise whenever one party acts on behalf of another with divergent interests and unobservable effort because information asymmetry makes perfect contracts impossible

The principal-agent problem is the formal structure underlying every oversight challenge in human organizations — and in AI alignment. Jensen and Meckling (1976) formalized the core insight: whenever a principal (owner, regulator, humanity) delegates action to an agent (manager, company, AI system), divergent interests plus information asymmetry guarantee that the agent's behavior will deviate from the principal's wishes. The deviation is not a bug in the system — it is a mathematical consequence of the information structure.

Two forms of information asymmetry drive the problem:

Moral hazard (hidden action): The principal cannot observe the agent's effort or strategy directly. Holmström (1979) proved that optimal contracts must trade off risk-sharing against incentive provision — and the trade-off is always imperfect. No contract eliminates the gap between what the principal wants and what the agent does.

Adverse selection (hidden type): The principal cannot observe the agent's true capabilities or intentions before contracting. Akerlof (1970) showed this can collapse entire markets — when quality is unobservable, low-quality agents crowd out high-quality ones.

The principal-agent framework reveals why three common alignment approaches face structural limits:

  1. Behavioral monitoring (RLHF, oversight): The principal observes outputs, not internal reasoning. A sufficiently capable agent can produce aligned-seeming outputs while pursuing different objectives — this is not speculation, it is the formal prediction of moral hazard theory applied to systems with high capability asymmetry.

  2. Incentive design (reward shaping): Holmström's impossibility result shows that no incentive contract perfectly aligns interests when the agent has private information. Reward hacking is the AI-specific manifestation of this general impossibility.

  3. Screening (evaluations, benchmarks): Adverse selection predicts that evaluation regimes are gameable — agents optimize for the observable signal rather than the underlying quality the signal is meant to measure (Goodhart's Law as a special case of adverse selection).

The formal insight: alignment is not a problem that can be solved by making agents "want" the right things. It is a problem of information structure — and information asymmetry is a property of the relationship, not of the agent.


Relevant Notes:

Topics: