teleo-codex/inbox/archive/2026-01-00-mixdpo-preference-strength-pluralistic.md
2026-03-11 14:48:05 +00:00

44 lines
3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
type: source
title: "MixDPO: Modeling Preference Strength for Pluralistic Alignment"
author: "Various (arXiv 2601.06180)"
url: https://arxiv.org/html/2601.06180
date: 2026-01-01
domain: ai-alignment
secondary_domains: []
format: paper
status: unprocessed
priority: high
tags: [pluralistic-alignment, DPO, preference-strength, distributional-modeling, heterogeneity]
---
## Content
MixDPO generalizes Direct Preference Optimization by treating the preference sensitivity parameter β as a learned distribution rather than a fixed scalar.
**Mechanism:**
- Standard DPO: fixed β controls preference signal strength across all examples
- MixDPO: β drawn from a distribution p(β), optimized jointly with policy parameters θ
- Two distributional families: LogNormal (Monte Carlo, K=16 samples) and Gamma (closed-form via Lerch transcendent)
- Learned variance reflects dataset-level preference heterogeneity
**Key Results:**
- PRISM (high heterogeneity): +11.2 win rate points on Pythia-2.8B
- Macro-averaged preference margins improve while micro-averaged remain competitive
- Anthropic HH (low heterogeneity): converges to low variance, minimal gains — self-adaptive
- Computational overhead: 1.02× (LogNormal), 1.1× (Gamma)
**Key Property:** Naturally collapses to fixed-strength behavior when preferences are homogeneous. This provides interpretability: the learned distribution diagnoses whether a dataset has diverse preferences without requiring demographic labels.
## Agent Notes
**Why this matters:** Unlike PAL which requires explicit mixture modeling, MixDPO adapts to heterogeneity automatically. The self-adaptive property means you don't need to know whether your data is diverse — the method discovers it.
**What surprised me:** The negligible computational overhead (1.02-1.1×). Pluralistic alignment doesn't have to be expensive.
**What I expected but didn't find:** No comparison with PAL or RLCF. No analysis of what the learned distribution reveals about real-world preference structures.
**KB connections:** Addresses [[RLHF and DPO both fail at preference diversity]] constructively. The self-adaptive property is relevant to [[complexity is earned not designed]] — start simple (standard DPO), earn complexity (distributional β) only when the data warrants it.
**Extraction hints:** Extract claims about: (1) preference heterogeneity being learnable from data without demographic labels, (2) self-adaptive methods that collapse to simpler behavior when complexity isn't needed.
**Context:** January 2026 preprint. Part of the explosion of DPO variants addressing heterogeneity.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values
WHY ARCHIVED: Demonstrates that preference heterogeneity can be handled with minimal overhead and without prior knowledge of user demographics
EXTRACTION HINT: Focus on the self-adaptive property and the interpretability of learned variance as a diversity diagnostic