Glossary

Multimodal

A model that accepts more than one type of input (text + images, sometimes audio + video). Modern Claude, GPT, Gemini are all multimodal.

A multimodal model accepts more than one type of input, typically text plus images, sometimes audio and video too. As of mid-2026, all the major frontier models (Claude 4.x, GPT-5, Gemini 2.5) are multimodal by default.

In practice, “multimodal” usually means you can upload a screenshot and ask Claude to describe what’s in it, or paste a chart and ask for analysis, or hand it a photo of a whiteboard and ask for the equation to be transcribed.

What it unlocks for Australian small business

  • Receipt + invoice OCR: paste a photo of a receipt, get structured data back (vendor, date, items, AUD total, GST)
  • Screenshot Q&A: take a screenshot of your Shopify analytics, ask “what’s going on with conversion this week?”
  • Document analysis: PDFs and scanned documents become readable + analysable
  • Diagram understanding: feed a hand-drawn architecture sketch, get a written summary
  • Brand image review: paste a draft social post graphic, ask “does this match our brand voice notes?”

Cost implications

Image input is billed at a different rate than text:

  • Claude charges by image at roughly the equivalent of ~1,500 tokens per typical image (USD ~$0.005 on Sonnet 4.6, AUD ~$0.008)
  • Larger images cost more (up to ~7k tokens for a high-resolution image)
  • Resize images before uploading where possible, 1024px wide is usually enough

For occasional use, multimodal is negligible. For heavy batch image processing (e.g. OCR-ing 1,000 receipts), the cost adds up; pre-process with cheaper OCR (Tesseract, AWS Textract) first and only hand Claude the structured output.

What multimodal still doesn’t do well

  • Reading dense, small text in screenshots (transcribe with a dedicated OCR first)
  • Counting precisely (Claude is okay-but-not-great at “how many people are in this photo?”)
  • Generating images (frontier text models like Claude don’t generate images; for that you want Midjourney, DALL·E 3, or Higgsfield)
  • Real-time video understanding (still emerging; voice + frame-by-frame works but isn’t smooth)
Related terms

Want this built for your business?

Book a free 30-minute AI audit. We'll map your business and show you exactly which systems we'd build first. No pitch deck, no scoping fee.

Book my free AI audit