Loading…
Loading…
Written by Max Zeshut
Founder at Agentmelt
A training technique that aligns AI models with human preferences by directly optimizing on preference data (pairs of responses where one is preferred over the other) without requiring a separate reward model. DPO is simpler and more stable than RLHF while achieving similar alignment quality. It's increasingly used to train models that follow instructions accurately, refuse harmful requests, and maintain helpful behavior—all of which directly affect AI agent quality and safety.