Prompt caching
A feature that lets language models reuse previously-processed prompt content at a discounted rate, usually ~90% off the input price. Major cost-saver on multi-turn conversations.
Prompt caching is a feature that lets a language model reuse previously-processed content from a prior request at a heavily discounted rate. Anthropic, OpenAI and others all support some form of it.
On Anthropic specifically, cached prompt tokens are billed at roughly 10% of the standard input rate. For long multi-turn sessions with a stable system prompt + tools + early conversation content, this cuts effective input cost dramatically.
How it works (Anthropic)
When you make a request to the Anthropic API (including via Claude Code), you can mark portions of your prompt as cacheable. Anthropic caches those portions for the next 5 minutes (or longer if hit again). Subsequent requests that share the same cached prefix pay only 10% of the normal input rate for that content.
In Claude Code, caching is on by default, you don’t need to think about it. The harness automatically marks the system prompt, tool definitions, and early conversation as cacheable.
What this means in AUD
A typical Claude Code session has a large constant prefix: your CLAUDE.md (~3-5k tokens), tool definitions (~3-8k tokens), and conversation history that grows over time.
Without caching:
- Turn 1: ~10k input tokens × $4.65/M (AUD on Sonnet 4.6) = $0.047
- Turn 5: ~30k input tokens × $4.65/M = $0.140
- Turn 20: ~80k input tokens × $4.65/M = $0.372
With caching:
- Turn 1: same, since nothing’s cached yet
- Turn 5: most of the input is a cache hit, ~$0.014 cost
- Turn 20: ~$0.037
The savings compound. Over a real working day, prompt caching cuts the input portion of your bill by 50-80%.
How to maximise cache hits
- Keep your CLAUDE.md stable. Don’t restructure mid-session.
- Use
/compactinstead of/clearwhen you can. Compact preserves cacheable structure; clear nukes the cache. - Don’t constantly add/remove tools. Tool definitions are part of the cached prefix.
- Long sessions beat short sessions for caching economics. The cache amortises.
What doesn’t cache
- Your most recent message
- The model’s most recent response
- Anything before the most recent edit to your CLAUDE.md or tool list
The cached portion is whatever’s stable + identical to a recent prior request.
Related terms
Want this built for your business?
Book a free 30-minute AI audit. We'll map your business and show you exactly which systems we'd build first. No pitch deck, no scoping fee.
Book my free AI audit