Skip to content
← Back to Blog

Vision Models Locally: Gemma 4 and Llama 3.2 Vision for Private Image Understanding

Running vision models locally means analyzing screenshots, documents, product photos, medical images, and diagrams without sending anything to cloud...

Featured cover graphic for: Vision Models Locally: Gemma 4 and Llama 3.2 Vision for Private Image Understanding

Most professionals have tasks that require AI to understand images. Screenshots of error messages. PDFs that are actually scans. Product photos to describe and tag. Diagrams to explain. Charts to analyze. For every one of these tasks, using cloud vision API means sending those images to servers you do not control.

Ollama’s vision model support changes this. You can run image-understanding models entirely locally — analyzing sensitive screenshots, confidential documents, and proprietary diagrams without any data leaving your machine.

Vision capability has reached the threshold of practical usefulness on consumer hardware. Gemma 4 9B runs on a gaming GPU, analyzes images accurately, and supports tool calling — the first local vision model that genuinely handles professional-grade visual tasks.

This guide covers every vision model available via Ollama, practical workflows for the most common use cases, and how to build vision pipelines for automated image processing.

🔗 This is Post #8 in the Ollama Unlocked series. For hardware requirements, see Hardware Guide (Post #3). For building automated vision pipelines, see Building AI Apps With Ollama and Python (Post #11).


Available Vision Models in Ollama

ollama pull gemma4:9b    # Best quality-to-hardware ratio
ollama pull gemma4:27b   # Higher quality, 18GB VRAM

Gemma 4 is the current best vision model for local use. Key capabilities:

  • Multimodal from the ground up — vision is not bolted on, it is native to the architecture
  • Tool calling — can call functions based on image content
  • Document understanding — strong on PDFs, forms, tables, charts
  • Multiple images — handles multiple images in a single conversation
  • Languages — multilingual image understanding including text in images

VRAM: Gemma 4 9B requires ~7GB; Gemma 4 27B requires ~18GB


Llama 3.2 Vision (Meta)

ollama pull llama3.2-vision:11b    # Best Llama vision option
ollama pull llama3.2-vision:90b    # High quality, 48GB+ VRAM

Meta’s Llama 3.2 Vision was the first widely-available high-quality local vision model. It remains a strong option, particularly for users already running Llama models who want vision capability in the same model family.

Strengths: Strong on natural image description, scene understanding, object recognition Weaknesses: Less capable than Gemma 4 on document-heavy tasks and structured data in images


Moondream (Lightweight Vision)

ollama pull moondream      # 1.7B — runs on anything
ollama pull moondream2     # Improved version

Moondream is a tiny vision model — 1.7B parameters — designed to run on minimal hardware including CPU-only systems. Quality is limited compared to larger models but it handles basic image description, object detection, and simple captioning tasks.

Best for: Embedded applications, constrained hardware, high-volume captioning where speed matters over depth


LLaVA

ollama pull llava:7b       # Classic vision model
ollama pull llava:13b      # Better quality
ollama pull llava:34b      # Best quality LLaVA
ollama pull llava-phi3     # Microsoft Phi-3 based variant

LLaVA (Large Language and Vision Assistant) is an older but still functional vision architecture. Gemma 4 and Llama 3.2 Vision outperform it on most benchmarks, but LLaVA 34B remains capable for users with sufficient VRAM who want a proven model.


Basic Vision Usage

Command Line Image Analysis

# Pass an image file directly
ollama run gemma4:9b "Describe what you see in this image" /path/to/image.jpg

# Or use the interactive mode and paste image path
ollama run gemma4:9b
# Then in the prompt: [image path] followed by your question

API Image Analysis

import ollama
import base64
from pathlib import Path

def analyze_image(image_path: str, question: str, model: str = "gemma4:9b") -> str:
    """Analyze an image with a local vision model."""
    
    # Read and encode the image
    image_data = base64.b64encode(Path(image_path).read_bytes()).decode()
    
    response = ollama.chat(
        model=model,
        messages=[
            {
                "role": "user",
                "content": question,
                "images": [image_data]
            }
        ]
    )
    
    return response["message"]["content"]

# Usage examples
description = analyze_image("product_photo.jpg", "Describe this product for an e-commerce listing")
print(description)

error_analysis = analyze_image("error_screenshot.png", 
    "What error is shown in this screenshot and what is likely causing it?")
print(error_analysis)

Practical Vision Workflows

Workflow 1: Screenshot and Error Analysis

One of the highest-value use cases — analyzing error screenshots, UI bugs, and console output as images:

def analyze_error_screenshot(screenshot_path: str) -> dict:
    """Extract error information from a screenshot."""
    
    prompt = """Analyze this screenshot and extract:
1. The exact error message (if visible)
2. The error type/category
3. The likely root cause
4. Recommended debugging steps

Return as JSON:
{
  "error_message": "...",
  "error_type": "...",
  "likely_cause": "...",
  "debug_steps": ["step1", "step2", ...]
}"""

    response = analyze_image(screenshot_path, prompt)
    
    import json, re
    # Extract JSON from response
    json_match = re.search(r'\{.*\}', response, re.DOTALL)
    if json_match:
        return json.loads(json_match.group())
    return {"raw_response": response}

Workflow 2: Document and PDF Analysis

For scanned documents, forms, and image-based PDFs:

def extract_document_data(document_image_path: str, schema: str) -> str:
    """Extract structured data from a document image."""
    
    prompt = f"""Extract data from this document image.
    
Expected fields to find:
{schema}

For each field:
- If found, provide the exact value as shown
- If not found, indicate 'not found'
- If unclear/ambiguous, indicate 'unclear: [what you see]'

Return as JSON with field names as keys."""

    return analyze_image(document_image_path, prompt)

# Example: Invoice processing
invoice_schema = """
- invoice_number
- date
- vendor_name
- total_amount
- line_items (array of description + amount)
- payment_terms
"""

result = extract_document_data("invoice_scan.jpg", invoice_schema)

Workflow 3: Bulk Image Captioning

For e-commerce, media libraries, or content organization:

import ollama
import base64
from pathlib import Path
import json
import time

def caption_images_bulk(image_folder: str, output_file: str):
    """Generate captions for all images in a folder."""
    
    image_folder = Path(image_folder)
    results = {}
    
    image_extensions = {'.jpg', '.jpeg', '.png', '.webp', '.gif'}
    images = [f for f in image_folder.iterdir() 
              if f.suffix.lower() in image_extensions]
    
    print(f"Processing {len(images)} images...")
    
    for i, image_path in enumerate(images):
        print(f"  [{i+1}/{len(images)}] {image_path.name}")
        
        image_data = base64.b64encode(image_path.read_bytes()).decode()
        
        response = ollama.chat(
            model="gemma4:9b",
            messages=[{
                "role": "user",
                "content": """Describe this image for:
1. alt_text: Concise accessibility description (1 sentence)
2. caption: Descriptive caption for display (1-2 sentences)  
3. tags: 5-8 relevant tags (comma-separated)
4. category: Most fitting category from: 
   product, person, nature, architecture, document, chart, other

Return as JSON.""",
                "images": [image_data]
            }]
        )
        
        try:
            content = response["message"]["content"]
            json_match = __import__('re').search(r'\{.*\}', content, __import__('re').DOTALL)
            if json_match:
                results[image_path.name] = json.loads(json_match.group())
            else:
                results[image_path.name] = {"raw": content}
        except Exception as e:
            results[image_path.name] = {"error": str(e)}
        
        time.sleep(0.2)  # Brief pause between requests
    
    # Save results
    with open(output_file, 'w') as f:
        json.dump(results, f, indent=2)
    
    print(f"\nDone. Results saved to {output_file}")
    return results

# Usage
caption_images_bulk("product_photos/", "captions_output.json")

Workflow 4: Chart and Graph Analysis

def analyze_chart(chart_image_path: str) -> str:
    """Extract insights from a chart or graph image."""
    
    prompt = """Analyze this chart/graph and provide:
1. Chart type (bar, line, pie, scatter, etc.)
2. What metric(s) are being measured
3. Time period or categories shown (if applicable)
4. Key trend or finding (the main insight)
5. Notable outliers or anomalies
6. Data values for the 3 most important data points

Be specific about numbers where they are readable from the chart."""
    
    return analyze_image(chart_image_path, prompt)

Workflow 5: Multi-Image Comparison

Gemma 4 supports multiple images in a single request — useful for before/after comparisons, A/B design reviews, or document version comparison:

def compare_images(image_path_1: str, image_path_2: str, question: str) -> str:
    """Compare two images side by side."""
    
    img1_data = base64.b64encode(Path(image_path_1).read_bytes()).decode()
    img2_data = base64.b64encode(Path(image_path_2).read_bytes()).decode()
    
    response = ollama.chat(
        model="gemma4:9b",
        messages=[{
            "role": "user",
            "content": f"Compare these two images. {question}",
            "images": [img1_data, img2_data]
        }]
    )
    
    return response["message"]["content"]

# Usage
comparison = compare_images(
    "design_v1.png", 
    "design_v2.png",
    "What changed between version 1 and version 2? Which version has better visual hierarchy?"
)

Vision Models in Open WebUI

In Open WebUI, vision model usage is seamless:

  1. Select a vision model (gemma4:9b) from the model picker
  2. Click the image icon or drag-and-drop an image into the chat
  3. Type your question about the image
  4. Conversation continues with image context maintained

Open WebUI’s vision integration handles base64 encoding automatically — no code required for interactive vision work.


Hardware Requirements for Vision

Vision models require slightly more VRAM than text-only models of the same base size because image tokens (typically 256–1024 per image) add to the context:

Model Text VRAM With Image Notes
Gemma 4 9B 7 GB 8–9 GB Recommended minimum
Gemma 4 27B 18 GB 20 GB High quality
Llama 3.2 Vision 11B 9 GB 10–11 GB Good alternative
LLaVA 13B 9 GB 10 GB Older but capable
Moondream 2 GB 2.5 GB Any hardware

Multiple images: Each additional image adds approximately 1–2GB of context VRAM usage. A conversation with 4 images may require 12–13GB for Gemma 4 9B.


Common Vision Model Mistakes

Mistake 1: Using a text-only model for vision tasks Sending an image to llama4:scout (a text-only model) does not work — Ollama returns an error. Always verify you are using a vision-capable model.

Mistake 2: Very high-resolution images Most vision models process images at a fixed resolution internally (typically 224×224 or 336×336 pixels). Sending a 4K image does not produce 4K-quality analysis — it gets scaled down. For text extraction from images, ensure text is large enough to read at the scaled resolution.

Mistake 3: Expecting OCR quality from vision models Vision models are better at understanding and explaining images than precisely transcribing every character. For strict OCR (exact text extraction), dedicated OCR tools produce more reliable results. For understanding documents and extracting key information, vision models work well.

Mistake 4: Not specifying output format “What does this document say?” produces a prose description. “Extract all text from this document as JSON, preserving structure” produces something you can actually use programmatically.


Conclusion

Local vision models in 2026 have crossed the usefulness threshold for most professional image tasks. Gemma 4 9B on a gaming GPU handles document analysis, screenshot debugging, chart interpretation, and image captioning at quality levels that justify replacing cloud vision APIs for privacy-sensitive work.

The combination of local processing and Gemma 4’s native multimodal architecture means the questions you could not previously ask about sensitive images — contract scans, medical documents, proprietary diagrams, internal screenshots — can now be answered without any data leaving your machine.

Your next step: ollama pull gemma4:9b. Take a screenshot of something on your screen and run: ollama run gemma4:9b "Describe what is shown in this image and identify any problems" /path/to/screenshot.png. The response will immediately demonstrate what local vision capability looks like today.


📚 Continue the Series:


Last updated: May 2026. Vision model capabilities and supported formats update with Ollama releases. Verify current vision model availability at ollama.com/library filtered by “vision” capability.

Frequently Asked Questions (FAQ)

Can local vision models handle handwritten text?
With limitations. Clear, neat handwriting is often readable. Cursive or messy handwriting frequently fails. For handwriting recognition specifically, dedicated handwriting OCR tools are more reliable than general vision models.
What image formats does Ollama support?
JPEG, PNG, WebP, and GIF (static). For PDF image extraction, convert pages to PNG first using tools like pdf2image or poppler.
How do I process many images efficiently?
Use the async Python client for parallel processing, or add delays between sequential requests to avoid overwhelming your GPU. For large batches (1000+ images), the Ollama Message Batches API equivalent is not yet available — sequential processing with rate control is the practical approach.
Are local vision models private enough for medical images?
The privacy aspect is covered — no data leaves your machine. Whether local vision models are sufficiently accurate for medical use depends on your specific requirements and applicable regulations. Never rely on AI for clinical diagnosis without qualified medical review.

Disclaimer: The information contained on this blog is for academic and educational purposes only. Unauthorized use and/or duplication of this material without express and written permission from this site's author and/or owner is strictly prohibited. The materials (images, logos, content) contained in this web site are protected by applicable copyright and trademark law.