Latency, AI glossary

Latency is the time it takes from sending a prompt to receiving the response. For AI APIs, latency is typically measured as:

Time to first token (TTFT): how long until the response starts streaming back
Total response time: how long until the full response is complete

For Claude in 2026, from a Sydney-based client:

So a typical Sonnet response of 500 output tokens lands in 5-13 seconds total.

When latency matters

Background agents that run overnight (60 seconds vs 6 seconds is irrelevant if you’re asleep)
Document drafts where you’ll edit before sending
Batch processing where you’re handing the model 100 tasks
One-off research queries

Pick a faster model. Haiku < Sonnet < Opus by 2-3x.
Reduce input tokens. Smaller prompts process faster.
Reduce output tokens. Shorter responses finish sooner.
Use streaming. Even though total time is the same, TTFT is when “something happens”, better UX.
Pre-warm with prompt caching. Subsequent calls in a session start faster.
Geographic proximity. If you’re serving global users, deploy via Bedrock or Vertex in the right region.

Quality. If you’re getting wrong answers fast, fixing latency makes them wrong faster. Get the right answer first; optimise for speed second.