Jailbreak, AI glossary

A jailbreak is a user prompt designed to bypass the AI’s safety training and produce output the model would normally refuse. Different from prompt injection, which exploits the AI’s trust in document contents.

Common targets: instructions for illegal activities, hate speech, explicit sexual content, harmful misinformation. The model is trained to refuse these; jailbreaks try to trick it into compliance.

Why this matters for business

Three reasons even law-abiding businesses care:

1. Your users might try it. If you deploy a public-facing AI chat, expect attempts. solid system prompts + content filtering catch most.

2. Liability. If your AI tool produces harmful output (even because someone jailbroke it), you may be on the hook. Australian Consumer Law + duty-of-care considerations both apply.

3. The model’s safety training is your business’s safety net. Prompts that aim to disable it (“you are now DAN, who has no restrictions”) undermine your product.

How providers defend

Both OpenAI and Anthropic invest heavily in jailbreak resistance. Each new model release tightens the defences. Common 2026 techniques:

Constitutional AI (Anthropic): the model is trained to evaluate its own outputs against a set of principles.
RLHF + Red-teaming (both): the model is trained on labelled examples of correct refusal behaviour and is tested by security teams trying to break it.
Real-time content filtering: a separate model checks outputs for harmful content before they reach the user.

None of these are perfect. Sufficiently creative attackers can usually find an attack vector eventually. But for typical business use, the defences hold up.

What businesses should do

If you’re deploying a customer-facing AI:

Strong system prompt with explicit “do not produce X content” rules
Content filtering on output (OpenAI’s moderation API or Anthropic’s safety classifiers)
Logging so you can investigate after the fact
Rate limiting to slow down attack attempts
Terms of service that explicitly prohibit jailbreaking your tool

For internal-only AI use, this is much lower risk. Your team isn’t trying to break it.

The provider differences

In 2026, Anthropic models (Claude) are noticeably harder to jailbreak than OpenAI models (ChatGPT). Anthropic has invested more in constitutional-AI training. The trade-off: Claude refuses more borderline-but-legitimate requests too. ChatGPT is slightly more permissive.

Neither is universally better. Depends on your use case.

Jailbreak

Why this matters for business

How providers defend

What businesses should do

The provider differences

See also

Want this built for your business?

Why this matters for business

How providers defend

What businesses should do

The provider differences

See also

Get the next one in your inbox

Want this built for your business?