Back to GlossaryTraining

RLHF (Reinforcement Learning from Human Feedback)

Definition

A training technique where a language model is aligned with human preferences by training a reward model on human rankings of model outputs, then optimizing the language model against that reward.

RLHF was a breakthrough in making language models helpful, harmless, and honest. The process typically has three stages: (1) supervised fine-tuning on high-quality demonstrations, (2) training a reward model on human preference data (humans rank multiple model outputs), and (3) optimizing the language model using reinforcement learning (typically PPO) to maximize the reward model's score. OpenAI popularized RLHF with InstructGPT and ChatGPT, demonstrating that it dramatically improves model behavior and user experience. Variants include DPO (Direct Preference Optimization), which simplifies the process by eliminating the separate reward model, and RLAIF, where AI provides feedback instead of humans. RLHF remains a critical step in producing commercial-quality AI assistants.

Companies in Training

View Training companies →