Techniques
Direct Preference Optimization (DPO)
Definition
“
A simplified alternative to RLHF for aligning language models with human preferences. DPO eliminates the need for training a separate reward model by directly optimizing the language model using preference pairs. DPO is computationally cheaper and more stable than RLHF while achieving comparable alignment results.
”
Related Terms
No related terms linked yet.
Explore all terms →