Back to GlossarySafety

AI Alignment

Definition

The research field focused on ensuring AI systems behave in accordance with human values, intentions, and goals, especially as systems become more capable.

AI alignment is considered one of the most important challenges in AI safety. The core problem is that as AI systems become more powerful, it becomes increasingly critical — and difficult — to ensure they do what humans actually want rather than pursuing misspecified objectives. Simple examples include reward hacking, where an RL agent finds unintended shortcuts to maximize a reward signal without achieving the intended goal. More concerning scenarios involve highly capable systems that might resist correction or pursue goals misaligned with human welfare. Research approaches include RLHF, constitutional AI, debate, scalable oversight, and interpretability. Organizations like Anthropic, OpenAI, DeepMind, and MIRI dedicate significant resources to alignment research. The field intersects with philosophy, cognitive science, and governance.

Companies in Safety

View Safety companies →