Jailbreak
Coaxing an AI to produce output it's been trained to refuse. Different from prompt injection; both are security concerns for businesses deploying AI.
A jailbreak is a user prompt designed to bypass the AI’s safety training and produce output the model would normally refuse. Different from prompt injection, which exploits the AI’s trust in document contents.
Common targets: instructions for illegal activities, hate speech, explicit sexual content, harmful misinformation. The model is trained to refuse these; jailbreaks try to trick it into compliance.
Why this matters for business
Three reasons even law-abiding businesses care:
1. Your users might try it. If you deploy a public-facing AI chat, expect attempts. solid system prompts + content filtering catch most.
2. Liability. If your AI tool produces harmful output (even because someone jailbroke it), you may be on the hook. Australian Consumer Law + duty-of-care considerations both apply.
3. The model’s safety training is your business’s safety net. Prompts that aim to disable it (“you are now DAN, who has no restrictions”) undermine your product.
How providers defend
Both OpenAI and Anthropic invest heavily in jailbreak resistance. Each new model release tightens the defences. Common 2026 techniques:
- Constitutional AI (Anthropic): the model is trained to evaluate its own outputs against a set of principles.
- RLHF + Red-teaming (both): the model is trained on labelled examples of correct refusal behaviour and is tested by security teams trying to break it.
- Real-time content filtering: a separate model checks outputs for harmful content before they reach the user.
None of these are perfect. Sufficiently creative attackers can usually find an attack vector eventually. But for typical business use, the defences hold up.
What businesses should do
If you’re deploying a customer-facing AI:
- Strong system prompt with explicit “do not produce X content” rules
- Content filtering on output (OpenAI’s moderation API or Anthropic’s safety classifiers)
- Logging so you can investigate after the fact
- Rate limiting to slow down attack attempts
- Terms of service that explicitly prohibit jailbreaking your tool
For internal-only AI use, this is much lower risk. Your team isn’t trying to break it.
The provider differences
In 2026, Anthropic models (Claude) are noticeably harder to jailbreak than OpenAI models (ChatGPT). Anthropic has invested more in constitutional-AI training. The trade-off: Claude refuses more borderline-but-legitimate requests too. ChatGPT is slightly more permissive.
Neither is universally better. Depends on your use case.
See also
- Prompt injection for the other major AI security risk.
- System prompt for the first line of defence.
- Is my data safe with Claude or ChatGPT? for the broader privacy + security question.
Want this built for your business?
Book a free 30-minute AI audit. We'll map your business and show you exactly which systems we'd build first. No pitch deck, no scoping fee.
Book my free AI audit