Techniques

Direct Preference Optimization (DPO)

Definition

A simplified alternative to RLHF for aligning language models with human preferences. DPO eliminates the need for training a separate reward model by directly optimizing the language model using preference pairs. DPO is computationally cheaper and more stable than RLHF while achieving comparable alignment results.

Related Terms

No related terms linked yet.

Explore all terms →

Explore companies in this space

Techniques Companies

View Techniques companies