Multimodal
A model that accepts more than one type of input (text + images, sometimes audio + video). Modern Claude, GPT, Gemini are all multimodal.
A multimodal model accepts more than one type of input, typically text plus images, sometimes audio and video too. As of mid-2026, all the major frontier models (Claude 4.x, GPT-5, Gemini 2.5) are multimodal by default.
In practice, “multimodal” usually means you can upload a screenshot and ask Claude to describe what’s in it, or paste a chart and ask for analysis, or hand it a photo of a whiteboard and ask for the equation to be transcribed.
What it unlocks for Australian small business
- Receipt + invoice OCR: paste a photo of a receipt, get structured data back (vendor, date, items, AUD total, GST)
- Screenshot Q&A: take a screenshot of your Shopify analytics, ask “what’s going on with conversion this week?”
- Document analysis: PDFs and scanned documents become readable + analysable
- Diagram understanding: feed a hand-drawn architecture sketch, get a written summary
- Brand image review: paste a draft social post graphic, ask “does this match our brand voice notes?”
Cost implications
Image input is billed at a different rate than text:
- Claude charges by image at roughly the equivalent of ~1,500 tokens per typical image (USD ~$0.005 on Sonnet 4.6, AUD ~$0.008)
- Larger images cost more (up to ~7k tokens for a high-resolution image)
- Resize images before uploading where possible, 1024px wide is usually enough
For occasional use, multimodal is negligible. For heavy batch image processing (e.g. OCR-ing 1,000 receipts), the cost adds up; pre-process with cheaper OCR (Tesseract, AWS Textract) first and only hand Claude the structured output.
What multimodal still doesn’t do well
- Reading dense, small text in screenshots (transcribe with a dedicated OCR first)
- Counting precisely (Claude is okay-but-not-great at “how many people are in this photo?”)
- Generating images (frontier text models like Claude don’t generate images; for that you want Midjourney, DALL·E 3, or Higgsfield)
- Real-time video understanding (still emerging; voice + frame-by-frame works but isn’t smooth)
Related terms
Want this built for your business?
Book a free 30-minute AI audit. We'll map your business and show you exactly which systems we'd build first. No pitch deck, no scoping fee.
Book my free AI auditOr have us run it for you, end to end: On Autopilot is Australia's outsourced AI department.