Back to GlossarySafety

Constitutional AI

Definition

An approach to AI alignment developed by Anthropic where the model is trained to follow a set of principles (a "constitution") that guide its behavior, using AI feedback to reduce reliance on human labeling.

Constitutional AI (CAI) is an alignment technique where a model critiques and revises its own outputs based on a set of written principles covering helpfulness, harmlessness, and honesty. The process involves two stages: (1) supervised learning where the model generates responses, self-critiques them according to the constitution, and revises them, and (2) reinforcement learning where an AI evaluator (trained on the constitutional principles) provides feedback instead of humans (RLAIF). This approach reduces the need for extensive human feedback while making the model's values transparent and auditable through the written constitution. Anthropic's Claude models are trained using CAI principles. The approach is notable for making the training values explicit rather than implicit in human feedback data.

Companies in Safety

View Safety companies →