Guide

What is a context window, and why does it matter for your business?

Context windows in plain English. What they are, how they limit AI, why 1 million tokens matters, and the practical consequences for Australian small business use.

In short

A context window is how much information an AI can hold in working memory at once. In May 2026, Claude Sonnet 4.6 holds 1 million tokens (~750,000 words). GPT-5 holds 256k tokens (~190,000 words). Gemini 2.5 Pro holds 1 million tokens. Bigger context lets AI read whole books, audit whole codebases, and remember more of a long conversation. For most business chat, you don’t need the biggest; for analysing long documents, you do.

The 60-second mental model

Imagine you’re a consultant who can read very fast but has zero permanent memory. Every meeting, you start from scratch. Whatever someone tells you or hands you, you can hold in your head for the meeting, but once the meeting ends, it’s gone.

That’s an AI model. The “head holding things” capacity is the context window.

  • A small context (4k tokens) = you can read a 3-page brief
  • A medium context (32k tokens) = you can read a 25-page document
  • A large context (200k tokens) = you can read a 150-page report
  • A very large context (1M tokens) = you can read a 750-page book

The bigger the window, the more you can hand the model at once. But every query that uses the context costs money (per-token billing) and runs slower.

What “token” actually means

A token is a chunk of text the model processes. Roughly:

  • 1 token ≈ 0.75 words in English
  • 1 page of text ≈ 500 tokens
  • 1 short email ≈ 100-200 tokens
  • 1 PDF book (300 pages) ≈ 150,000 tokens

The exact ratio varies by language and content type. Code uses more tokens per character. Common English words use fewer.

For most practical purposes: “tokens are how AI is billed and how much it can hold”. You don’t need to count them precisely.

Why bigger isn’t always better

Three practical limits:

1. Cost scales linearly. A 1M-token query costs roughly 8x a 128k-token query (and a hundred times a 10k-token query). For everyday chat, this is overkill spending.

2. Speed scales inversely. Bigger context = slower response. A 10k-token query returns in 5-10 seconds. A 500k-token query might take 30-90 seconds.

3. Accuracy degrades in the middle. Models suffer from “lost in the middle” syndrome: they pay close attention to the start and end of context, but mid-context information sometimes gets glossed over. Stuffing 500k tokens because you can doesn’t mean the model uses them effectively.

The right move: match context to task. Use small context for short queries; large context for genuinely long documents.

What 1 million tokens actually unlocks

The 1M-token context (Claude Sonnet 4.6 + Gemini 2.5 Pro) makes new things possible:

Whole-book analysis. Upload a 300-page legal contract. Ask “what are the indemnification clauses and how do they compare to industry-standard?” The model reads it all.

Whole-codebase audits. Paste an entire 50,000-line codebase. Ask “find security vulnerabilities”. Model reviews the lot.

Multi-month conversation logs. Past clients’ Slack history for context on their preferences. Past emails for tone training. Past customer support tickets for pattern recognition.

Bulk SKU analysis. Paste 5,000 product descriptions. Ask “find duplicates, identify SKUs without images, flag inconsistent pricing”.

All of these were impossible at 32k tokens. They’re routine at 1M.

What you’d care about for normal business use

Most Australian SMB AI use sits well below the context limits:

TaskTypical tokensLimit you’ll hit
Drafting an email500-2,000Never
Long-form blog post2,000-10,000Never
Reviewing a contract (10 pages)5,000-15,000Never
Analysing a quarterly report (50 pages)30,000-50,00032k limit if on older models
Whole-website audit (50 pages)50,000-100,000128k limit on GPT-5
Full codebase review100k-500k+Anything below 200k
Reading a full novel100k-300kMost 2025-era models

Practical guidance: pay attention to context limits when you’re doing a specific, document-heavy task. Ignore them for chat-based work.

How to use context efficiently

Three patterns that save money and improve accuracy:

1. Start fresh for different tasks. Don’t try to have one long-running chat that covers everything. Start a new conversation for each distinct task. Smaller context, faster response, better accuracy.

2. Summarise long inputs first. Instead of pasting 100 pages and asking a question, first ask the AI to summarise the document into 5 bullet points. Then start a new conversation with just the summary + your question.

3. Put your question at the end. Models pay closest attention to the most-recent content. If you paste context and then ask a question, put the question at the very bottom.

The provider differences

Mid-2026 state:

ProviderModelContextPricing (rough AUD per M tokens in)
AnthropicClaude Sonnet 4.61M~$4.50
AnthropicClaude Opus 4.71M~$22
AnthropicClaude Haiku 4.51M~$1.20
OpenAIGPT-5256k~$30
OpenAIGPT-5 mini256k~$1.50
GoogleGemini 2.5 Pro1M~$5
GoogleGemini 2.5 Flash1M~$0.75

If long context matters and budget doesn’t, Claude Sonnet 4.6 is the sweet spot. If cheap-and-fast matters more, Gemini 2.5 Flash. If you’re on ChatGPT and don’t need the absolute biggest, GPT-5 at 256k is plenty for 95% of business tasks.

The “prompt caching” bonus

Anthropic introduced prompt caching in 2024 (now widely supported across providers). The idea: if you reuse the same context across many queries (e.g. a 200-page knowledge base that every query references), the second query onwards is 90% cheaper because the model caches the context.

This is how we run agents that touch large contexts repeatedly without breaking the budget. See Prompt caching for the deeper explanation.

What’s next

Common questions

What's a token?
Roughly three-quarters of an English word. 'Australian small business' is 4-5 tokens. A 500-word email is ~650 tokens. A 100-page PDF might be 50,000 tokens. Tokens are how AI providers bill (per million tokens in + per million tokens out).
Should I always use the model with the biggest context?
No. Bigger context = higher per-query cost and often slower response time. Match the model to the task. For most chat conversations, even 32k tokens is plenty. For analysing long documents, 200k+ matters.
How does context relate to memory?
Models have no persistent memory between conversations by default. Each conversation starts fresh. The 'memory' inside a conversation IS the context window. Once it fills up, older content gets compressed or dropped.
What happens when I exceed the context window?
Depends on the tool. Some refuse the message. Some truncate the oldest content silently (worst, you don't know what was lost). Some compress older content into summaries (Claude does this in long Claude Code sessions). Always preferable to start fresh sessions for very different tasks.
Does context window affect accuracy?
Yes, indirectly. Models can use information from anywhere in the context window, but accuracy degrades at the very start and very end of the window (the 'lost in the middle' problem). For best accuracy, put the most important context near the top or bottom and the question at the end.

Want this built for your business?

Book a free 30-minute AI audit. We'll map your business and show you exactly which systems we'd build first. No pitch deck, no scoping fee.

Book my free AI audit