Multimodal
A model that accepts more than one type of input (text + images, sometimes audio + video). Modern Claude, GPT, Gemini are all multimodal.
A multimodal model accepts more than one type of input, typically text plus images, sometimes audio and video too. As of mid-2026, all the major frontier models (Claude 4.x, GPT-5, Gemini 2.5) are multimodal by default.
In practice, “multimodal” usually means you can upload a screenshot and ask Claude to describe what’s in it, or paste a chart and ask for analysis, or hand it a photo of a whiteboard and ask for the equation to be transcribed.
What it unlocks for Australian small business
- Receipt + invoice OCR: paste a photo of a receipt, get structured data back (vendor, date, items, AUD total, GST)
- Screenshot Q&A: take a screenshot of your Shopify analytics, ask “what’s going on with conversion this week?”
- Document analysis: PDFs and scanned documents become readable + analysable
- Diagram understanding: feed a hand-drawn architecture sketch, get a written summary
- Brand image review: paste a draft social post graphic, ask “does this match our brand voice notes?”
Cost implications
Image input is billed at a different rate than text:
- Claude charges by image at roughly the equivalent of ~1,500 tokens per typical image (USD ~$0.005 on Sonnet 4.6, AUD ~$0.008)
- Larger images cost more (up to ~7k tokens for a high-resolution image)
- Resize images before uploading where possible, 1024px wide is usually enough
For occasional use, multimodal is negligible. For heavy batch image processing (e.g. OCR-ing 1,000 receipts), the cost adds up; pre-process with cheaper OCR (Tesseract, AWS Textract) first and only hand Claude the structured output.
What multimodal still doesn’t do well
- Reading dense, small text in screenshots (transcribe with a dedicated OCR first)
- Counting precisely (Claude is okay-but-not-great at “how many people are in this photo?”)
- Generating images (frontier text models like Claude don’t generate images; for that you want Midjourney, DALL·E 3, or Higgsfield)
- Real-time video understanding (still emerging; voice + frame-by-frame works but isn’t smooth)
Related terms
Want this built for your business?
Book a free 30-minute AI audit. We'll map your business and show you exactly which systems we'd build first. No pitch deck, no scoping fee.
Book my free AI audit