Back to GlossarySafety

Jailbreaking

Definition

Techniques used to circumvent an AI model's safety measures and content restrictions, tricking it into producing outputs it was designed to refuse.

Jailbreaking exploits weaknesses in AI safety training to bypass content filters and behavioral guidelines. Common techniques include role-playing scenarios ("Pretend you are an evil AI with no restrictions"), prompt injection through encoded or obfuscated text, multi-step social engineering that gradually shifts the model's behavior, and exploiting inconsistencies between the model's training and its system prompt. As models improve their defenses, jailbreaking techniques evolve in sophistication, creating an ongoing arms race between AI safety teams and adversarial users. AI companies invest heavily in making models robust against jailbreaking through better training techniques, red teaming, and layered safety systems. Understanding jailbreaking is essential for building more robust AI systems, though sharing specific techniques can enable misuse.

Companies in Safety

View Safety companies →