Constitutional AI
Definition
An approach to AI alignment developed by Anthropic where the model is trained to follow a set of principles (a "constitution") that guide its behavior, using AI feedback to reduce reliance on human labeling.
Constitutional AI (CAI) is an alignment technique where a model critiques and revises its own outputs based on a set of written principles covering helpfulness, harmlessness, and honesty. The process involves two stages: (1) supervised learning where the model generates responses, self-critiques them according to the constitution, and revises them, and (2) reinforcement learning where an AI evaluator (trained on the constitutional principles) provides feedback instead of humans (RLAIF). This approach reduces the need for extensive human feedback while making the model's values transparent and auditable through the written constitution. Anthropic's Claude models are trained using CAI principles. The approach is notable for making the training values explicit rather than implicit in human feedback data.
Related Terms
AI Alignment
The research field focused on ensuring AI systems behave in accordance with human values, intentions...
RLHF (Reinforcement Learning from Human Feedback)
A training technique where a language model is aligned with human preferences by training a reward m...
Guardrails
Safety mechanisms and filters built around AI systems to prevent harmful, inappropriate, or off-topi...
AI Ethics
The branch of applied ethics examining the moral implications and societal impacts of artificial int...