RLHF (Reinforcement Learning from Human Feedback)
Definition
A training technique where a language model is aligned with human preferences by training a reward model on human rankings of model outputs, then optimizing the language model against that reward.
RLHF was a breakthrough in making language models helpful, harmless, and honest. The process typically has three stages: (1) supervised fine-tuning on high-quality demonstrations, (2) training a reward model on human preference data (humans rank multiple model outputs), and (3) optimizing the language model using reinforcement learning (typically PPO) to maximize the reward model's score. OpenAI popularized RLHF with InstructGPT and ChatGPT, demonstrating that it dramatically improves model behavior and user experience. Variants include DPO (Direct Preference Optimization), which simplifies the process by eliminating the separate reward model, and RLAIF, where AI provides feedback instead of humans. RLHF remains a critical step in producing commercial-quality AI assistants.
Related Terms
AI Alignment
The research field focused on ensuring AI systems behave in accordance with human values, intentions...
Constitutional AI
An approach to AI alignment developed by Anthropic where the model is trained to follow a set of pri...
Fine-Tuning
The process of taking a pre-trained model and further training it on a smaller, task-specific datase...
Reinforcement Learning
A type of machine learning where an agent learns to make decisions by taking actions in an environme...