Most professionals have tasks that require AI to understand images. Screenshots of error messages. PDFs that are actually scans. Product photos to describe and tag. Diagrams to explain. Charts to analyze. For every one of these tasks, using cloud vision API means sending those images to servers you do not control.
Ollama’s vision model support changes this. You can run image-understanding models entirely locally — analyzing sensitive screenshots, confidential documents, and proprietary diagrams without any data leaving your machine.
Vision capability has reached the threshold of practical usefulness on consumer hardware. Gemma 4 9B runs on a gaming GPU, analyzes images accurately, and supports tool calling — the first local vision model that genuinely handles professional-grade visual tasks.
This guide covers every vision model available via Ollama, practical workflows for the most common use cases, and how to build vision pipelines for automated image processing.
🔗 This is Post #8 in the Ollama Unlocked series. For hardware requirements, see Hardware Guide (Post #3). For building automated vision pipelines, see Building AI Apps With Ollama and Python (Post #11).
Available Vision Models in Ollama
Gemma 4 (Google — April 2026) — Recommended
ollama pull gemma4:9b # Best quality-to-hardware ratio
ollama pull gemma4:27b # Higher quality, 18GB VRAM
Gemma 4 is the current best vision model for local use. Key capabilities:
- Multimodal from the ground up — vision is not bolted on, it is native to the architecture
- Tool calling — can call functions based on image content
- Document understanding — strong on PDFs, forms, tables, charts
- Multiple images — handles multiple images in a single conversation
- Languages — multilingual image understanding including text in images
VRAM: Gemma 4 9B requires ~7GB; Gemma 4 27B requires ~18GB
Llama 3.2 Vision (Meta)
ollama pull llama3.2-vision:11b # Best Llama vision option
ollama pull llama3.2-vision:90b # High quality, 48GB+ VRAM
Meta’s Llama 3.2 Vision was the first widely-available high-quality local vision model. It remains a strong option, particularly for users already running Llama models who want vision capability in the same model family.
Strengths: Strong on natural image description, scene understanding, object recognition Weaknesses: Less capable than Gemma 4 on document-heavy tasks and structured data in images
Moondream (Lightweight Vision)
ollama pull moondream # 1.7B — runs on anything
ollama pull moondream2 # Improved version
Moondream is a tiny vision model — 1.7B parameters — designed to run on minimal hardware including CPU-only systems. Quality is limited compared to larger models but it handles basic image description, object detection, and simple captioning tasks.
Best for: Embedded applications, constrained hardware, high-volume captioning where speed matters over depth
LLaVA
ollama pull llava:7b # Classic vision model
ollama pull llava:13b # Better quality
ollama pull llava:34b # Best quality LLaVA
ollama pull llava-phi3 # Microsoft Phi-3 based variant
LLaVA (Large Language and Vision Assistant) is an older but still functional vision architecture. Gemma 4 and Llama 3.2 Vision outperform it on most benchmarks, but LLaVA 34B remains capable for users with sufficient VRAM who want a proven model.
Basic Vision Usage
Command Line Image Analysis
# Pass an image file directly
ollama run gemma4:9b "Describe what you see in this image" /path/to/image.jpg
# Or use the interactive mode and paste image path
ollama run gemma4:9b
# Then in the prompt: [image path] followed by your question
API Image Analysis
import ollama
import base64
from pathlib import Path
def analyze_image(image_path: str, question: str, model: str = "gemma4:9b") -> str:
"""Analyze an image with a local vision model."""
# Read and encode the image
image_data = base64.b64encode(Path(image_path).read_bytes()).decode()
response = ollama.chat(
model=model,
messages=[
{
"role": "user",
"content": question,
"images": [image_data]
}
]
)
return response["message"]["content"]
# Usage examples
description = analyze_image("product_photo.jpg", "Describe this product for an e-commerce listing")
print(description)
error_analysis = analyze_image("error_screenshot.png",
"What error is shown in this screenshot and what is likely causing it?")
print(error_analysis)
Practical Vision Workflows
Workflow 1: Screenshot and Error Analysis
One of the highest-value use cases — analyzing error screenshots, UI bugs, and console output as images:
def analyze_error_screenshot(screenshot_path: str) -> dict:
"""Extract error information from a screenshot."""
prompt = """Analyze this screenshot and extract:
1. The exact error message (if visible)
2. The error type/category
3. The likely root cause
4. Recommended debugging steps
Return as JSON:
{
"error_message": "...",
"error_type": "...",
"likely_cause": "...",
"debug_steps": ["step1", "step2", ...]
}"""
response = analyze_image(screenshot_path, prompt)
import json, re
# Extract JSON from response
json_match = re.search(r'\{.*\}', response, re.DOTALL)
if json_match:
return json.loads(json_match.group())
return {"raw_response": response}
Workflow 2: Document and PDF Analysis
For scanned documents, forms, and image-based PDFs:
def extract_document_data(document_image_path: str, schema: str) -> str:
"""Extract structured data from a document image."""
prompt = f"""Extract data from this document image.
Expected fields to find:
{schema}
For each field:
- If found, provide the exact value as shown
- If not found, indicate 'not found'
- If unclear/ambiguous, indicate 'unclear: [what you see]'
Return as JSON with field names as keys."""
return analyze_image(document_image_path, prompt)
# Example: Invoice processing
invoice_schema = """
- invoice_number
- date
- vendor_name
- total_amount
- line_items (array of description + amount)
- payment_terms
"""
result = extract_document_data("invoice_scan.jpg", invoice_schema)
Workflow 3: Bulk Image Captioning
For e-commerce, media libraries, or content organization:
import ollama
import base64
from pathlib import Path
import json
import time
def caption_images_bulk(image_folder: str, output_file: str):
"""Generate captions for all images in a folder."""
image_folder = Path(image_folder)
results = {}
image_extensions = {'.jpg', '.jpeg', '.png', '.webp', '.gif'}
images = [f for f in image_folder.iterdir()
if f.suffix.lower() in image_extensions]
print(f"Processing {len(images)} images...")
for i, image_path in enumerate(images):
print(f" [{i+1}/{len(images)}] {image_path.name}")
image_data = base64.b64encode(image_path.read_bytes()).decode()
response = ollama.chat(
model="gemma4:9b",
messages=[{
"role": "user",
"content": """Describe this image for:
1. alt_text: Concise accessibility description (1 sentence)
2. caption: Descriptive caption for display (1-2 sentences)
3. tags: 5-8 relevant tags (comma-separated)
4. category: Most fitting category from:
product, person, nature, architecture, document, chart, other
Return as JSON.""",
"images": [image_data]
}]
)
try:
content = response["message"]["content"]
json_match = __import__('re').search(r'\{.*\}', content, __import__('re').DOTALL)
if json_match:
results[image_path.name] = json.loads(json_match.group())
else:
results[image_path.name] = {"raw": content}
except Exception as e:
results[image_path.name] = {"error": str(e)}
time.sleep(0.2) # Brief pause between requests
# Save results
with open(output_file, 'w') as f:
json.dump(results, f, indent=2)
print(f"\nDone. Results saved to {output_file}")
return results
# Usage
caption_images_bulk("product_photos/", "captions_output.json")
Workflow 4: Chart and Graph Analysis
def analyze_chart(chart_image_path: str) -> str:
"""Extract insights from a chart or graph image."""
prompt = """Analyze this chart/graph and provide:
1. Chart type (bar, line, pie, scatter, etc.)
2. What metric(s) are being measured
3. Time period or categories shown (if applicable)
4. Key trend or finding (the main insight)
5. Notable outliers or anomalies
6. Data values for the 3 most important data points
Be specific about numbers where they are readable from the chart."""
return analyze_image(chart_image_path, prompt)
Workflow 5: Multi-Image Comparison
Gemma 4 supports multiple images in a single request — useful for before/after comparisons, A/B design reviews, or document version comparison:
def compare_images(image_path_1: str, image_path_2: str, question: str) -> str:
"""Compare two images side by side."""
img1_data = base64.b64encode(Path(image_path_1).read_bytes()).decode()
img2_data = base64.b64encode(Path(image_path_2).read_bytes()).decode()
response = ollama.chat(
model="gemma4:9b",
messages=[{
"role": "user",
"content": f"Compare these two images. {question}",
"images": [img1_data, img2_data]
}]
)
return response["message"]["content"]
# Usage
comparison = compare_images(
"design_v1.png",
"design_v2.png",
"What changed between version 1 and version 2? Which version has better visual hierarchy?"
)
Vision Models in Open WebUI
In Open WebUI, vision model usage is seamless:
- Select a vision model (gemma4:9b) from the model picker
- Click the image icon or drag-and-drop an image into the chat
- Type your question about the image
- Conversation continues with image context maintained
Open WebUI’s vision integration handles base64 encoding automatically — no code required for interactive vision work.
Hardware Requirements for Vision
Vision models require slightly more VRAM than text-only models of the same base size because image tokens (typically 256–1024 per image) add to the context:
| Model | Text VRAM | With Image | Notes |
|---|---|---|---|
| Gemma 4 9B | 7 GB | 8–9 GB | Recommended minimum |
| Gemma 4 27B | 18 GB | 20 GB | High quality |
| Llama 3.2 Vision 11B | 9 GB | 10–11 GB | Good alternative |
| LLaVA 13B | 9 GB | 10 GB | Older but capable |
| Moondream | 2 GB | 2.5 GB | Any hardware |
Multiple images: Each additional image adds approximately 1–2GB of context VRAM usage. A conversation with 4 images may require 12–13GB for Gemma 4 9B.
Common Vision Model Mistakes
Mistake 1: Using a text-only model for vision tasks
Sending an image to llama4:scout (a text-only model) does not work — Ollama returns an error. Always verify you are using a vision-capable model.
Mistake 2: Very high-resolution images Most vision models process images at a fixed resolution internally (typically 224×224 or 336×336 pixels). Sending a 4K image does not produce 4K-quality analysis — it gets scaled down. For text extraction from images, ensure text is large enough to read at the scaled resolution.
Mistake 3: Expecting OCR quality from vision models Vision models are better at understanding and explaining images than precisely transcribing every character. For strict OCR (exact text extraction), dedicated OCR tools produce more reliable results. For understanding documents and extracting key information, vision models work well.
Mistake 4: Not specifying output format “What does this document say?” produces a prose description. “Extract all text from this document as JSON, preserving structure” produces something you can actually use programmatically.
Conclusion
Local vision models in 2026 have crossed the usefulness threshold for most professional image tasks. Gemma 4 9B on a gaming GPU handles document analysis, screenshot debugging, chart interpretation, and image captioning at quality levels that justify replacing cloud vision APIs for privacy-sensitive work.
The combination of local processing and Gemma 4’s native multimodal architecture means the questions you could not previously ask about sensitive images — contract scans, medical documents, proprietary diagrams, internal screenshots — can now be answered without any data leaving your machine.
Your next step: ollama pull gemma4:9b. Take a screenshot of something on your screen and run: ollama run gemma4:9b "Describe what is shown in this image and identify any problems" /path/to/screenshot.png. The response will immediately demonstrate what local vision capability looks like today.
📚 Continue the Series:
Last updated: May 2026. Vision model capabilities and supported formats update with Ollama releases. Verify current vision model availability at ollama.com/library filtered by “vision” capability.