teleo-codex/inbox/archive/2026-01-00-mixdpo-preference-strength-pluralistic.md at db497155d82af61e881b1947f908eb1f0d6c4fc1

Theseus 3eddb02dc2 theseus: research session 2026-03-11 — 14 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-03-11 09:13:27 +00:00

3 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

MixDPO generalizes Direct Preference Optimization by treating the preference sensitivity parameter β as a learned distribution rather than a fixed scalar.

Mechanism:

Standard DPO: fixed β controls preference signal strength across all examples
MixDPO: β drawn from a distribution p(β), optimized jointly with policy parameters θ
Two distributional families: LogNormal (Monte Carlo, K=16 samples) and Gamma (closed-form via Lerch transcendent)
Learned variance reflects dataset-level preference heterogeneity

Key Results:

PRISM (high heterogeneity): +11.2 win rate points on Pythia-2.8B
Macro-averaged preference margins improve while micro-averaged remain competitive
Anthropic HH (low heterogeneity): converges to low variance, minimal gains — self-adaptive
Computational overhead: 1.02× (LogNormal), 1.1× (Gamma)

Key Property: Naturally collapses to fixed-strength behavior when preferences are homogeneous. This provides interpretability: the learned distribution diagnoses whether a dataset has diverse preferences without requiring demographic labels.

Agent Notes

Why this matters: Unlike PAL which requires explicit mixture modeling, MixDPO adapts to heterogeneity automatically. The self-adaptive property means you don't need to know whether your data is diverse — the method discovers it. What surprised me: The negligible computational overhead (1.02-1.1×). Pluralistic alignment doesn't have to be expensive. What I expected but didn't find: No comparison with PAL or RLCF. No analysis of what the learned distribution reveals about real-world preference structures. KB connections: Addresses RLHF and DPO both fail at preference diversity constructively. The self-adaptive property is relevant to complexity is earned not designed — start simple (standard DPO), earn complexity (distributional β) only when the data warrants it. Extraction hints: Extract claims about: (1) preference heterogeneity being learnable from data without demographic labels, (2) self-adaptive methods that collapse to simpler behavior when complexity isn't needed. Context: January 2026 preprint. Part of the explosion of DPO variants addressing heterogeneity.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values WHY ARCHIVED: Demonstrates that preference heterogeneity can be handled with minimal overhead and without prior knowledge of user demographics EXTRACTION HINT: Focus on the self-adaptive property and the interpretability of learned variance as a diversity diagnostic

3 KiB Raw Blame History Unescape Escape

Content

Agent Notes

Curator Notes (structured handoff for extractor)

3 KiB

Raw Blame History