Skip to content
← Back to Blog

Ollama on Docker and Production Deployment: Run Local AI at Scale

Moving Ollama from a personal tool to a production service requires proper Docker configuration, GPU passthrough, persistent storage, health...

Featured cover graphic for: Ollama on Docker and Production Deployment: Run Local AI at Scale

Running Ollama locally for personal use is straightforward. Running it reliably as a service — for a team, for a production application, or as a shared server — requires more deliberate configuration. Models need to be pre-loaded on startup. GPU resources need to be correctly allocated. The service needs to restart automatically after failures. Access needs to be controlled. Performance needs to be monitored.

This guide covers the production-grade Ollama deployment stack: Docker Compose with GPU passthrough, automated model management, nginx reverse proxy with authentication, health monitoring, and the operational patterns that keep a shared Ollama server running reliably.

🔗 This is Post #12 in the Ollama Unlocked series. For the applications that run on top of this infrastructure, see Building AI Apps With Ollama (Post #11). For team-specific configuration, see Ollama for Business (Post #14).


Docker Compose: The Production Foundation

Basic Ollama Service

# docker-compose.yml
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    
    ports:
      - "127.0.0.1:11434:11434"  # Only expose locally (nginx handles external)
    
    volumes:
      - ollama_models:/root/.ollama  # Persist models between restarts
    
    environment:
      - OLLAMA_KEEP_ALIVE=30m        # Keep models loaded 30 minutes
      - OLLAMA_NUM_PARALLEL=2        # Handle 2 concurrent requests
      - OLLAMA_MAX_LOADED_MODELS=3   # Max models in memory simultaneously
    
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    depends_on:
      - ollama
    
    ports:
      - "127.0.0.1:3000:8080"
    
    volumes:
      - open-webui-data:/app/backend/data
    
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_SECRET_KEY=your-secret-key-here  # Change this!

  # Model initializer — pulls required models on startup
  model-init:
    image: ollama/ollama:latest
    depends_on:
      - ollama
    entrypoint: >
      sh -c "
        sleep 10 &&
        ollama pull llama4:scout &&
        ollama pull nomic-embed-text &&
        ollama pull qwen3:7b &&
        echo 'Models ready.'
      "
    environment:
      - OLLAMA_HOST=http://ollama:11434
    restart: "no"  # Run once, then exit

volumes:
  ollama_models:
  open-webui-data:

Start the stack:

docker compose up -d

# Follow startup logs
docker compose logs -f ollama

# Check model init progress
docker compose logs model-init

GPU Configuration: NVIDIA

For NVIDIA GPU passthrough in Docker:

# Install NVIDIA Container Toolkit (Ubuntu/Debian)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify GPU passthrough:

docker run --rm --gpus all nvidia/cuda:11.5.2-base-ubuntu20.04 nvidia-smi

Multi-GPU configuration:

# Use specific GPUs (by index)
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          device_ids: ['0', '1']  # Use GPU 0 and 1
          capabilities: [gpu]

AMD GPU Configuration

# AMD ROCm GPU passthrough
services:
  ollama:
    image: ollama/ollama:rocm  # Use ROCm image
    devices:
      - /dev/kfd
      - /dev/dri
    group_add:
      - video
      - render
    environment:
      - HSA_OVERRIDE_GFX_VERSION=11.0.0  # Adjust for your GPU

Nginx Reverse Proxy and Authentication

Expose Ollama securely to your network with nginx handling authentication and HTTPS:

# /etc/nginx/conf.d/ollama.conf

# Rate limiting zone
limit_req_zone $binary_remote_addr zone=ollama_limit:10m rate=30r/m;

server {
    listen 443 ssl;
    server_name ollama.yourteam.internal;  # Or your domain
    
    ssl_certificate     /etc/ssl/certs/ollama.crt;
    ssl_certificate_key /etc/ssl/private/ollama.key;
    
    # Basic auth for team access
    auth_basic "Ollama API";
    auth_basic_user_file /etc/nginx/.htpasswd;
    
    # Rate limiting
    limit_req zone=ollama_limit burst=60 nodelay;
    
    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        
        # Required for streaming responses
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 600s;
        proxy_send_timeout 600s;
        
        # Increase buffer for large model responses
        proxy_buffer_size 128k;
        proxy_buffers 4 256k;
    }
}

# Redirect HTTP to HTTPS
server {
    listen 80;
    server_name ollama.yourteam.internal;
    return 301 https://$server_name$request_uri;
}

Create user accounts:

# Install apache2-utils for htpasswd
sudo apt install apache2-utils

# Create password file and add users
sudo htpasswd -c /etc/nginx/.htpasswd user1
sudo htpasswd /etc/nginx/.htpasswd user2
sudo htpasswd /etc/nginx/.htpasswd user3

Self-signed certificate for internal use:

sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
  -keyout /etc/ssl/private/ollama.key \
  -out /etc/ssl/certs/ollama.crt \
  -subj "/C=US/CN=ollama.yourteam.internal"

Systemd Service: Non-Docker Production

For production Linux servers where Docker overhead is undesirable:

# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama Service
After=network-online.target
Wants=network-online.target

[Service]
Type=exec
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3

# Security
NoNewPrivileges=yes
PrivateTmp=yes

# Performance settings
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_FLASH_ATTN=1"
Environment="OLLAMA_MODELS=/var/lib/ollama/models"

# GPU (NVIDIA)
Environment="CUDA_VISIBLE_DEVICES=0"

[Install]
WantedBy=multi-user.target
# Create ollama system user
sudo useradd -r -s /bin/false -m -d /var/lib/ollama ollama
sudo usermod -a -G video ollama  # For GPU access

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
sudo systemctl status ollama

Model Management in Production

Automated Model Sync Script

#!/bin/bash
# /usr/local/bin/ollama-sync-models.sh
# Run via cron to ensure required models are always available

set -e

REQUIRED_MODELS=(
    "llama4:scout"
    "qwen3:7b"
    "nomic-embed-text"
    "gemma4:9b"
)

LOG_FILE="/var/log/ollama-model-sync.log"

log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

log "Starting model sync..."

# Check Ollama is running
if ! curl -sf http://localhost:11434/api/tags > /dev/null; then
    log "ERROR: Ollama is not running"
    exit 1
fi

# Get list of installed models
INSTALLED=$(curl -s http://localhost:11434/api/tags | \
    python3 -c "import json,sys; data=json.load(sys.stdin); \
    print('\n'.join([m['name'] for m in data.get('models',[])]))")

for model in "${REQUIRED_MODELS[@]}"; do
    if echo "$INSTALLED" | grep -q "^${model}$"; then
        log "  ✓ $model (already installed)"
    else
        log "  ↓ Pulling $model..."
        ollama pull "$model"
        log "  ✓ $model (pulled successfully)"
    fi
done

log "Model sync complete."
# Make executable and add to cron
chmod +x /usr/local/bin/ollama-sync-models.sh

# Add to crontab — run daily at 3 AM to pick up model updates
echo "0 3 * * * /usr/local/bin/ollama-sync-models.sh" | sudo crontab -

Health Monitoring

Simple Health Check Script

#!/usr/bin/env python3
# health_check.py

import requests
import time
import json
from datetime import datetime

OLLAMA_HOST = "http://localhost:11434"

def check_ollama_health() -> dict:
    """Comprehensive Ollama health check."""
    results = {
        "timestamp": datetime.now().isoformat(),
        "status": "unknown",
        "checks": {}
    }
    
    # Check 1: API reachability
    try:
        resp = requests.get(f"{OLLAMA_HOST}/api/tags", timeout=5)
        models = resp.json().get("models", [])
        results["checks"]["api"] = {
            "status": "pass",
            "models_loaded": len(models)
        }
    except Exception as e:
        results["checks"]["api"] = {"status": "fail", "error": str(e)}
        results["status"] = "down"
        return results
    
    # Check 2: Test inference (fast model)
    try:
        start = time.time()
        resp = requests.post(
            f"{OLLAMA_HOST}/api/generate",
            json={"model": "qwen3:7b", "prompt": "Hi", "stream": False},
            timeout=30
        )
        duration = time.time() - start
        data = resp.json()
        tokens_per_sec = data.get("eval_count", 1) / max(duration, 0.1)
        
        results["checks"]["inference"] = {
            "status": "pass",
            "response_time_s": round(duration, 2),
            "tokens_per_second": round(tokens_per_sec, 1)
        }
    except Exception as e:
        results["checks"]["inference"] = {"status": "fail", "error": str(e)}
    
    # Check 3: GPU memory (NVIDIA only)
    try:
        import subprocess
        gpu_info = subprocess.run(
            ["nvidia-smi", "--query-gpu=memory.used,memory.free",
             "--format=csv,noheader,nounits"],
            capture_output=True, text=True, timeout=5
        )
        if gpu_info.returncode == 0:
            used, free = gpu_info.stdout.strip().split(", ")
            results["checks"]["gpu_memory"] = {
                "status": "pass",
                "used_mb": int(used),
                "free_mb": int(free)
            }
    except Exception:
        results["checks"]["gpu_memory"] = {"status": "skip", "reason": "nvidia-smi not available"}
    
    # Overall status
    failed = [k for k, v in results["checks"].items() if v.get("status") == "fail"]
    results["status"] = "unhealthy" if failed else "healthy"
    
    return results

if __name__ == "__main__":
    health = check_ollama_health()
    print(json.dumps(health, indent=2))
    
    if health["status"] != "healthy":
        exit(1)

Add Health Check to Docker Compose

services:
  ollama:
    image: ollama/ollama:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s  # Give time for model loading

Performance Tuning for Production

Key Environment Variables

# Concurrent requests (default: 1)
# Higher = more throughput but more VRAM per request
OLLAMA_NUM_PARALLEL=2

# Max models kept in memory (default: 1 on GPU, 3 on CPU)
OLLAMA_MAX_LOADED_MODELS=3

# How long to keep models loaded after last request
OLLAMA_KEEP_ALIVE=30m

# Flash Attention (faster, lower memory usage)
OLLAMA_FLASH_ATTN=1

# K/V cache quantization (reduces VRAM at slight quality cost)
OLLAMA_KV_CACHE_TYPE=q8_0

# CPU threads (for CPU-based inference)
OLLAMA_NUM_THREAD=8

# Restrict to specific GPU
CUDA_VISIBLE_DEVICES=0

Response Queue Management

For shared servers handling multiple users, implement request queuing:

# request_queue.py — Simple request queue for multiple users
import asyncio
import ollama
from fastapi import FastAPI
from collections import deque

app = FastAPI()
request_queue = asyncio.Queue(maxsize=50)  # Max 50 queued requests
active_requests = 0
MAX_CONCURRENT = 2  # Based on your VRAM

@app.post("/generate")
async def generate(prompt: str, model: str = "llama4:scout"):
    if request_queue.full():
        return {"error": "Server busy — try again in a moment"}, 503
    
    future = asyncio.Future()
    await request_queue.put((prompt, model, future))
    return await future

async def process_queue():
    """Process requests from the queue."""
    global active_requests
    
    while True:
        prompt, model, future = await request_queue.get()
        
        # Wait for a slot if at capacity
        while active_requests >= MAX_CONCURRENT:
            await asyncio.sleep(0.1)
        
        active_requests += 1
        try:
            response = ollama.generate(model=model, prompt=prompt)
            future.set_result(response["response"])
        except Exception as e:
            future.set_exception(e)
        finally:
            active_requests -= 1
            request_queue.task_done()

Backup and Disaster Recovery

#!/bin/bash
# backup-ollama.sh — Backup model metadata and configurations
# Note: Model weights themselves don't need backup (re-pull from ollama.com)

BACKUP_DIR="/backup/ollama"
DATE=$(date +%Y%m%d)

mkdir -p "$BACKUP_DIR/$DATE"

# Backup model manifests (fast — just metadata, not weights)
cp -r ~/.ollama/models/manifests "$BACKUP_DIR/$DATE/manifests"

# Backup custom Modelfiles
if [ -d ~/modelfiles ]; then
    cp -r ~/modelfiles "$BACKUP_DIR/$DATE/modelfiles"
fi

# Backup Open WebUI data if applicable
if docker volume inspect open-webui-data &>/dev/null; then
    docker run --rm \
        -v open-webui-data:/data \
        -v "$BACKUP_DIR/$DATE":/backup \
        alpine tar czf /backup/open-webui-data.tar.gz /data
fi

# Keep last 7 days
find "$BACKUP_DIR" -maxdepth 1 -mtime +7 -exec rm -rf {} \;

echo "Backup complete: $BACKUP_DIR/$DATE"

Common Production Issues

Issue: Model loads slowly on first request Fix: Add a warmup request to the startup script:

# After starting Ollama, pre-load the primary model
curl -s http://localhost:11434/api/generate \
  -d '{"model":"llama4:scout","prompt":"warm","stream":false,"keep_alive":-1}' > /dev/null

Issue: Out of VRAM with multiple models Fix: Set OLLAMA_MAX_LOADED_MODELS=1 and increase OLLAMA_KEEP_ALIVE for your primary model. Let secondary models unload and reload.

Issue: Slow responses under concurrent load Fix: This is expected — Ollama serializes GPU inference. Increase OLLAMA_NUM_PARALLEL only if you have VRAM headroom. Each parallel request requires additional VRAM for the KV cache.

Issue: Ollama crashes after system update Fix: Always re-run the install script after major system updates to ensure GPU drivers and Ollama are compatible: curl -fsSL https://ollama.com/install.sh | sh


Conclusion

Production Ollama is straightforward with Docker Compose as the foundation. The Docker Compose configuration in this guide — with GPU passthrough, persistent storage, health checks, and model initialization — gives you a reliable base in under 30 minutes.

Your next step: Copy the Docker Compose configuration, adjust the model list to match your use case, run docker compose up -d, and verify the health check passes. You have a production Ollama service.


📚 Continue the Series:

Last updated: May 2026. Docker and NVIDIA Container Toolkit versions update regularly. Verify GPU passthrough compatibility at docs.nvidia.com/datacenter/cloud-native/container-toolkit.

Frequently Asked Questions (FAQ)

Can I run Ollama in Kubernetes?
Yes — the Ollama Docker image runs in Kubernetes with NVIDIA device plugin for GPU support. Use a StatefulSet with persistent volume for model storage. Community Helm charts are available.
How do I prevent unauthorized access to my Ollama server?
Bind Ollama to localhost only (`OLLAMA_HOST=127.0.0.1:11434`) and use nginx with authentication for external access. Never expose port 11434 directly to the internet.
How much storage do I need for a production model set?
Plan for 10GB per 7B model, 20GB per 27B model at Q4 quantization. A standard set of 5–6 models (llama4:scout, qwen3.6:27b, deepseek-r1:14b, gemma4:9b, nomic-embed-text) requires approximately 60–70GB.

Disclaimer: The information contained on this blog is for academic and educational purposes only. Unauthorized use and/or duplication of this material without express and written permission from this site's author and/or owner is strictly prohibited. The materials (images, logos, content) contained in this web site are protected by applicable copyright and trademark law.