Multimodal, AI glossary

A multimodal model accepts more than one type of input, typically text plus images, sometimes audio and video too. As of mid-2026, all the major frontier models (Claude 4.x, GPT-5, Gemini 2.5) are multimodal by default.

In practice, “multimodal” usually means you can upload a screenshot and ask Claude to describe what’s in it, or paste a chart and ask for analysis, or hand it a photo of a whiteboard and ask for the equation to be transcribed.

What it unlocks for Australian small business

Receipt + invoice OCR: paste a photo of a receipt, get structured data back (vendor, date, items, AUD total, GST)
Screenshot Q&A: take a screenshot of your Shopify analytics, ask “what’s going on with conversion this week?”
Document analysis: PDFs and scanned documents become readable + analysable
Diagram understanding: feed a hand-drawn architecture sketch, get a written summary
Brand image review: paste a draft social post graphic, ask “does this match our brand voice notes?”

Cost implications

Image input is billed at a different rate than text:

Claude charges by image at roughly the equivalent of ~1,500 tokens per typical image (USD ~$0.005 on Sonnet 4.6, AUD ~$0.008)
Larger images cost more (up to ~7k tokens for a high-resolution image)
Resize images before uploading where possible, 1024px wide is usually enough

For occasional use, multimodal is negligible. For heavy batch image processing (e.g. OCR-ing 1,000 receipts), the cost adds up; pre-process with cheaper OCR (Tesseract, AWS Textract) first and only hand Claude the structured output.

What multimodal still doesn’t do well

Reading dense, small text in screenshots (transcribe with a dedicated OCR first)
Counting precisely (Claude is okay-but-not-great at “how many people are in this photo?”)
Generating images (frontier text models like Claude don’t generate images; for that you want Midjourney, DALL·E 3, or Higgsfield)
Real-time video understanding (still emerging; voice + frame-by-frame works but isn’t smooth)

What it unlocks for Australian small business

Cost implications

What multimodal still doesn’t do well

Related terms

Get the next one in your inbox

Want this built for your business?

Keep reading

Large Language Model

Context window

Embedding