What is DPO (Direct Preference Optimization)?
AI EngineeringLast updated:
An alignment technique that fine-tunes LLMs directly on human preference pairs without training a separate reward model.
DPO simplifies the RLHF pipeline by reformulating alignment as a supervised learning problem over preferred vs. rejected output pairs. It is more stable and computationally cheaper than PPO-based RLHF while achieving comparable quality.