Running Ollama locally for personal use is straightforward. Running it reliably as a service — for a team, for a production application, or as a shared server — requires more deliberate configuration. Models need to be pre-loaded on startup. GPU resources need to be correctly allocated. The service needs to restart automatically after failures. Access needs to be controlled. Performance needs to be monitored.
This guide covers the production-grade Ollama deployment stack: Docker Compose with GPU passthrough, automated model management, nginx reverse proxy with authentication, health monitoring, and the operational patterns that keep a shared Ollama server running reliably.
🔗 This is Post #12 in the Ollama Unlocked series. For the applications that run on top of this infrastructure, see Building AI Apps With Ollama (Post #11). For team-specific configuration, see Ollama for Business (Post #14).
Docker Compose: The Production Foundation
Basic Ollama Service
# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "127.0.0.1:11434:11434" # Only expose locally (nginx handles external)
volumes:
- ollama_models:/root/.ollama # Persist models between restarts
environment:
- OLLAMA_KEEP_ALIVE=30m # Keep models loaded 30 minutes
- OLLAMA_NUM_PARALLEL=2 # Handle 2 concurrent requests
- OLLAMA_MAX_LOADED_MODELS=3 # Max models in memory simultaneously
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
depends_on:
- ollama
ports:
- "127.0.0.1:3000:8080"
volumes:
- open-webui-data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_SECRET_KEY=your-secret-key-here # Change this!
# Model initializer — pulls required models on startup
model-init:
image: ollama/ollama:latest
depends_on:
- ollama
entrypoint: >
sh -c "
sleep 10 &&
ollama pull llama4:scout &&
ollama pull nomic-embed-text &&
ollama pull qwen3:7b &&
echo 'Models ready.'
"
environment:
- OLLAMA_HOST=http://ollama:11434
restart: "no" # Run once, then exit
volumes:
ollama_models:
open-webui-data:
Start the stack:
docker compose up -d
# Follow startup logs
docker compose logs -f ollama
# Check model init progress
docker compose logs model-init
GPU Configuration: NVIDIA
For NVIDIA GPU passthrough in Docker:
# Install NVIDIA Container Toolkit (Ubuntu/Debian)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verify GPU passthrough:
docker run --rm --gpus all nvidia/cuda:11.5.2-base-ubuntu20.04 nvidia-smi
Multi-GPU configuration:
# Use specific GPUs (by index)
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0', '1'] # Use GPU 0 and 1
capabilities: [gpu]
AMD GPU Configuration
# AMD ROCm GPU passthrough
services:
ollama:
image: ollama/ollama:rocm # Use ROCm image
devices:
- /dev/kfd
- /dev/dri
group_add:
- video
- render
environment:
- HSA_OVERRIDE_GFX_VERSION=11.0.0 # Adjust for your GPU
Nginx Reverse Proxy and Authentication
Expose Ollama securely to your network with nginx handling authentication and HTTPS:
# /etc/nginx/conf.d/ollama.conf
# Rate limiting zone
limit_req_zone $binary_remote_addr zone=ollama_limit:10m rate=30r/m;
server {
listen 443 ssl;
server_name ollama.yourteam.internal; # Or your domain
ssl_certificate /etc/ssl/certs/ollama.crt;
ssl_certificate_key /etc/ssl/private/ollama.key;
# Basic auth for team access
auth_basic "Ollama API";
auth_basic_user_file /etc/nginx/.htpasswd;
# Rate limiting
limit_req zone=ollama_limit burst=60 nodelay;
location / {
proxy_pass http://127.0.0.1:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# Required for streaming responses
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 600s;
proxy_send_timeout 600s;
# Increase buffer for large model responses
proxy_buffer_size 128k;
proxy_buffers 4 256k;
}
}
# Redirect HTTP to HTTPS
server {
listen 80;
server_name ollama.yourteam.internal;
return 301 https://$server_name$request_uri;
}
Create user accounts:
# Install apache2-utils for htpasswd
sudo apt install apache2-utils
# Create password file and add users
sudo htpasswd -c /etc/nginx/.htpasswd user1
sudo htpasswd /etc/nginx/.htpasswd user2
sudo htpasswd /etc/nginx/.htpasswd user3
Self-signed certificate for internal use:
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
-keyout /etc/ssl/private/ollama.key \
-out /etc/ssl/certs/ollama.crt \
-subj "/C=US/CN=ollama.yourteam.internal"
Systemd Service: Non-Docker Production
For production Linux servers where Docker overhead is undesirable:
# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama Service
After=network-online.target
Wants=network-online.target
[Service]
Type=exec
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
# Security
NoNewPrivileges=yes
PrivateTmp=yes
# Performance settings
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_FLASH_ATTN=1"
Environment="OLLAMA_MODELS=/var/lib/ollama/models"
# GPU (NVIDIA)
Environment="CUDA_VISIBLE_DEVICES=0"
[Install]
WantedBy=multi-user.target
# Create ollama system user
sudo useradd -r -s /bin/false -m -d /var/lib/ollama ollama
sudo usermod -a -G video ollama # For GPU access
# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
sudo systemctl status ollama
Model Management in Production
Automated Model Sync Script
#!/bin/bash
# /usr/local/bin/ollama-sync-models.sh
# Run via cron to ensure required models are always available
set -e
REQUIRED_MODELS=(
"llama4:scout"
"qwen3:7b"
"nomic-embed-text"
"gemma4:9b"
)
LOG_FILE="/var/log/ollama-model-sync.log"
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
log "Starting model sync..."
# Check Ollama is running
if ! curl -sf http://localhost:11434/api/tags > /dev/null; then
log "ERROR: Ollama is not running"
exit 1
fi
# Get list of installed models
INSTALLED=$(curl -s http://localhost:11434/api/tags | \
python3 -c "import json,sys; data=json.load(sys.stdin); \
print('\n'.join([m['name'] for m in data.get('models',[])]))")
for model in "${REQUIRED_MODELS[@]}"; do
if echo "$INSTALLED" | grep -q "^${model}$"; then
log " ✓ $model (already installed)"
else
log " ↓ Pulling $model..."
ollama pull "$model"
log " ✓ $model (pulled successfully)"
fi
done
log "Model sync complete."
# Make executable and add to cron
chmod +x /usr/local/bin/ollama-sync-models.sh
# Add to crontab — run daily at 3 AM to pick up model updates
echo "0 3 * * * /usr/local/bin/ollama-sync-models.sh" | sudo crontab -
Health Monitoring
Simple Health Check Script
#!/usr/bin/env python3
# health_check.py
import requests
import time
import json
from datetime import datetime
OLLAMA_HOST = "http://localhost:11434"
def check_ollama_health() -> dict:
"""Comprehensive Ollama health check."""
results = {
"timestamp": datetime.now().isoformat(),
"status": "unknown",
"checks": {}
}
# Check 1: API reachability
try:
resp = requests.get(f"{OLLAMA_HOST}/api/tags", timeout=5)
models = resp.json().get("models", [])
results["checks"]["api"] = {
"status": "pass",
"models_loaded": len(models)
}
except Exception as e:
results["checks"]["api"] = {"status": "fail", "error": str(e)}
results["status"] = "down"
return results
# Check 2: Test inference (fast model)
try:
start = time.time()
resp = requests.post(
f"{OLLAMA_HOST}/api/generate",
json={"model": "qwen3:7b", "prompt": "Hi", "stream": False},
timeout=30
)
duration = time.time() - start
data = resp.json()
tokens_per_sec = data.get("eval_count", 1) / max(duration, 0.1)
results["checks"]["inference"] = {
"status": "pass",
"response_time_s": round(duration, 2),
"tokens_per_second": round(tokens_per_sec, 1)
}
except Exception as e:
results["checks"]["inference"] = {"status": "fail", "error": str(e)}
# Check 3: GPU memory (NVIDIA only)
try:
import subprocess
gpu_info = subprocess.run(
["nvidia-smi", "--query-gpu=memory.used,memory.free",
"--format=csv,noheader,nounits"],
capture_output=True, text=True, timeout=5
)
if gpu_info.returncode == 0:
used, free = gpu_info.stdout.strip().split(", ")
results["checks"]["gpu_memory"] = {
"status": "pass",
"used_mb": int(used),
"free_mb": int(free)
}
except Exception:
results["checks"]["gpu_memory"] = {"status": "skip", "reason": "nvidia-smi not available"}
# Overall status
failed = [k for k, v in results["checks"].items() if v.get("status") == "fail"]
results["status"] = "unhealthy" if failed else "healthy"
return results
if __name__ == "__main__":
health = check_ollama_health()
print(json.dumps(health, indent=2))
if health["status"] != "healthy":
exit(1)
Add Health Check to Docker Compose
services:
ollama:
image: ollama/ollama:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s # Give time for model loading
Performance Tuning for Production
Key Environment Variables
# Concurrent requests (default: 1)
# Higher = more throughput but more VRAM per request
OLLAMA_NUM_PARALLEL=2
# Max models kept in memory (default: 1 on GPU, 3 on CPU)
OLLAMA_MAX_LOADED_MODELS=3
# How long to keep models loaded after last request
OLLAMA_KEEP_ALIVE=30m
# Flash Attention (faster, lower memory usage)
OLLAMA_FLASH_ATTN=1
# K/V cache quantization (reduces VRAM at slight quality cost)
OLLAMA_KV_CACHE_TYPE=q8_0
# CPU threads (for CPU-based inference)
OLLAMA_NUM_THREAD=8
# Restrict to specific GPU
CUDA_VISIBLE_DEVICES=0
Response Queue Management
For shared servers handling multiple users, implement request queuing:
# request_queue.py — Simple request queue for multiple users
import asyncio
import ollama
from fastapi import FastAPI
from collections import deque
app = FastAPI()
request_queue = asyncio.Queue(maxsize=50) # Max 50 queued requests
active_requests = 0
MAX_CONCURRENT = 2 # Based on your VRAM
@app.post("/generate")
async def generate(prompt: str, model: str = "llama4:scout"):
if request_queue.full():
return {"error": "Server busy — try again in a moment"}, 503
future = asyncio.Future()
await request_queue.put((prompt, model, future))
return await future
async def process_queue():
"""Process requests from the queue."""
global active_requests
while True:
prompt, model, future = await request_queue.get()
# Wait for a slot if at capacity
while active_requests >= MAX_CONCURRENT:
await asyncio.sleep(0.1)
active_requests += 1
try:
response = ollama.generate(model=model, prompt=prompt)
future.set_result(response["response"])
except Exception as e:
future.set_exception(e)
finally:
active_requests -= 1
request_queue.task_done()
Backup and Disaster Recovery
#!/bin/bash
# backup-ollama.sh — Backup model metadata and configurations
# Note: Model weights themselves don't need backup (re-pull from ollama.com)
BACKUP_DIR="/backup/ollama"
DATE=$(date +%Y%m%d)
mkdir -p "$BACKUP_DIR/$DATE"
# Backup model manifests (fast — just metadata, not weights)
cp -r ~/.ollama/models/manifests "$BACKUP_DIR/$DATE/manifests"
# Backup custom Modelfiles
if [ -d ~/modelfiles ]; then
cp -r ~/modelfiles "$BACKUP_DIR/$DATE/modelfiles"
fi
# Backup Open WebUI data if applicable
if docker volume inspect open-webui-data &>/dev/null; then
docker run --rm \
-v open-webui-data:/data \
-v "$BACKUP_DIR/$DATE":/backup \
alpine tar czf /backup/open-webui-data.tar.gz /data
fi
# Keep last 7 days
find "$BACKUP_DIR" -maxdepth 1 -mtime +7 -exec rm -rf {} \;
echo "Backup complete: $BACKUP_DIR/$DATE"
Common Production Issues
Issue: Model loads slowly on first request Fix: Add a warmup request to the startup script:
# After starting Ollama, pre-load the primary model
curl -s http://localhost:11434/api/generate \
-d '{"model":"llama4:scout","prompt":"warm","stream":false,"keep_alive":-1}' > /dev/null
Issue: Out of VRAM with multiple models
Fix: Set OLLAMA_MAX_LOADED_MODELS=1 and increase OLLAMA_KEEP_ALIVE for your primary model. Let secondary models unload and reload.
Issue: Slow responses under concurrent load
Fix: This is expected — Ollama serializes GPU inference. Increase OLLAMA_NUM_PARALLEL only if you have VRAM headroom. Each parallel request requires additional VRAM for the KV cache.
Issue: Ollama crashes after system update
Fix: Always re-run the install script after major system updates to ensure GPU drivers and Ollama are compatible: curl -fsSL https://ollama.com/install.sh | sh
Conclusion
Production Ollama is straightforward with Docker Compose as the foundation. The Docker Compose configuration in this guide — with GPU passthrough, persistent storage, health checks, and model initialization — gives you a reliable base in under 30 minutes.
Your next step: Copy the Docker Compose configuration, adjust the model list to match your use case, run docker compose up -d, and verify the health check passes. You have a production Ollama service.
📚 Continue the Series:
Last updated: May 2026. Docker and NVIDIA Container Toolkit versions update regularly. Verify GPU passthrough compatibility at docs.nvidia.com/datacenter/cloud-native/container-toolkit.