Jailbreaking
Definition
Techniques used to circumvent an AI model's safety measures and content restrictions, tricking it into producing outputs it was designed to refuse.
Jailbreaking exploits weaknesses in AI safety training to bypass content filters and behavioral guidelines. Common techniques include role-playing scenarios ("Pretend you are an evil AI with no restrictions"), prompt injection through encoded or obfuscated text, multi-step social engineering that gradually shifts the model's behavior, and exploiting inconsistencies between the model's training and its system prompt. As models improve their defenses, jailbreaking techniques evolve in sophistication, creating an ongoing arms race between AI safety teams and adversarial users. AI companies invest heavily in making models robust against jailbreaking through better training techniques, red teaming, and layered safety systems. Understanding jailbreaking is essential for building more robust AI systems, though sharing specific techniques can enable misuse.
Related Terms
AI Alignment
The research field focused on ensuring AI systems behave in accordance with human values, intentions...
Guardrails
Safety mechanisms and filters built around AI systems to prevent harmful, inappropriate, or off-topi...
Prompt Injection
A security vulnerability where malicious instructions are embedded in user input to override an AI m...
Red Teaming
The practice of systematically probing AI systems for vulnerabilities, safety issues, and harmful ou...