Every time you send a message to ChatGPT, Claude, or Gemini, that message travels to a server you do not control. It is processed by infrastructure you do not own. It may be retained, logged, or used to train future models. And it costs you either subscription fees or API credits.
Ollama changes all of this. Ollama is a tool that lets you download, manage, and run AI models entirely on your own computer — no internet connection required during inference, no API keys, no subscriptions, no data leaving your machine. Your prompts and responses stay local.
The tradeoff is real: local models are generally less capable than the frontier cloud models. GPT-5.5 and Claude Opus 4.5 are significantly more capable than anything you can run locally on a consumer laptop today. But the gap has narrowed dramatically in 2026. The current best local models — Llama 4 Scout, Kimi K2.6, Qwen 3.6, Gemma 4 — handle the majority of professional tasks that most people use AI for, with performance that would have required cloud access just 18 months ago.
If you value privacy, want to work offline, need to process sensitive data, or want to eliminate AI subscription costs — Ollama is worth understanding.
This guide installs Ollama, runs your first model, and gives you the complete workflow to replace most cloud AI interactions with private local alternatives.
🔗 Welcome to Ollama Unlocked. This is Post #1 — the foundation. The series covers the model selection guide, hardware requirements, Open WebUI, RAG with Ollama, the Ollama API, building apps, and much more. Read this first; every other post builds on it.
What Ollama Is and How It Works
Ollama is an open-source tool that manages the entire local LLM workflow:
- Downloads model files from the Ollama library (4,500+ models as of May 2026)
- Manages model storage and VRAM allocation automatically
- Serves models through a REST API on your local machine (port 11434 by default)
- Provides a command-line interface for running models interactively
- Handles quantization, GPU/CPU routing, and memory management without manual configuration
The current version is Ollama v0.24.0 (released May 14, 2026), with v0.30.0 in release candidate. The Python library version is 0.6.2 (April 29, 2026).
Ollama uses GGUF format for model files — the current standard for quantized local models — and uses llama.cpp directly for inference (since the v0.24 architecture change). On Apple Silicon, it uses MLX for hardware-accelerated inference.
Installation: Mac, Windows, and Linux
Mac (Recommended Starting Point)
Ollama on Mac has the best out-of-the-box experience, particularly on Apple Silicon (M1/M2/M3/M4).
Method 1 — Direct download (easiest):
- Go to ollama.com/download
- Download the Mac installer
- Open the
.zipand drag Ollama to Applications - Launch Ollama — it runs as a menu bar app
- Open Terminal and run:
ollama run llama4:scout
Method 2 — Homebrew:
brew install ollama
brew services start ollama
Verify installation:
ollama --version
# ollama version 0.24.0
Apple Silicon advantage: M-series chips have unified memory — the GPU and CPU share the same memory pool. A Mac with 24GB unified memory can run larger models than a Windows PC with a 12GB dedicated VRAM GPU. An M4 Pro Mac with 48GB runs 70B parameter models smoothly.
Windows
- Go to ollama.com/download
- Download
OllamaSetup.exe - Run the installer — it installs Ollama as a Windows service
- Open PowerShell or Windows Terminal
- Run:
ollama run llama4:scout
GPU support: Ollama on Windows supports NVIDIA (CUDA) and AMD (ROCm) GPUs automatically. Detection is automatic — no manual configuration needed.
WSL2 note: Ollama runs natively on Windows now. WSL2 is no longer required, though WSL2 with GPU passthrough still works.
Linux
# Official install script (fetches latest stable)
curl -fsSL https://ollama.com/install.sh | sh
# Verify
ollama --version
# Start as a system service (auto-started by installer)
sudo systemctl status ollama
GPU setup on Linux:
# NVIDIA — ensure drivers are installed
nvidia-smi # Should show your GPU
# AMD — ROCm support
# Ollama detects ROCm automatically if installed
rocm-smi # Verify ROCm installation
Updating:
# Re-run the install script — it updates in place
curl -fsSL https://ollama.com/install.sh | sh
# Or via package manager if installed that way
sudo snap refresh ollama
Running Your First Model
With Ollama installed, one command downloads and runs a model:
ollama run llama4:scout
What happens:
- Ollama checks if
llama4:scoutis already downloaded - If not, downloads the quantized model file (~6GB for Q4 quantization of Llama 4 Scout)
- Loads the model into memory (GPU VRAM + system RAM as needed)
- Opens an interactive chat session in your terminal
You are now running a state-of-the-art AI model entirely locally.
To exit: Type /bye or press Ctrl+D
The Essential Ollama Commands
# Pull a model without running it (pre-download)
ollama pull llama4:scout
# Run a model interactively
ollama run llama4:scout
# Run with a one-shot prompt (non-interactive)
ollama run llama4:scout "Explain quantum entanglement in plain English"
# List all downloaded models
ollama list
# Show model info (size, parameters, quantization)
ollama show llama4:scout
# Remove a model (frees disk space)
ollama rm llama4:scout
# Check what's currently running
ollama ps
# Pull a specific model size/quantization
ollama pull qwen3.6:27b
# Run with custom parameters
ollama run llama4:scout --num-ctx 32768 # Set context window
# Start the Ollama server manually (if not auto-starting)
ollama serve
The 2026 Model Library: What to Download First
The Ollama library contains 4,500+ models as of May 2026. Most users need 3–5 models to cover all use cases. Here are the current best choices:
For General Use (Best First Model)
ollama pull llama4:scout
Llama 4 Scout (Meta, April 2026) — The current best general-purpose local model. Uses Mixture-of-Experts (MoE) architecture: 17B parameters active per token, 109B total. Runs on ~10GB VRAM despite the large total size. Excellent for conversation, writing, analysis, and reasoning.
For Coding
ollama pull kimi-k2.6 # Best overall coding (MIT licensed)
ollama pull qwen3.6:27b # Best dense coding model, 77.2% SWE-bench
ollama pull devstral:24b # Best agentic coding
For Reasoning
ollama pull deepseek-r1:7b # Fast local reasoning (7B)
ollama pull deepseek-r1:14b # Better reasoning (14B)
ollama pull deepseek-r1:32b # Best local reasoning (32B)
For Efficiency (Low Hardware Requirements)
ollama pull llama3.2:3b # Runs on any hardware (3B)
ollama pull gemma3:9b # Google's efficient model
ollama pull phi3:mini # Microsoft's tiny but capable model
For Vision (Image Understanding)
ollama pull gemma4:9b # Vision + tool calling
ollama pull llama3.2-vision:11b # Llama vision model
Quick Reference: Model Sizes vs Hardware
| Model | VRAM Required | Best For |
|---|---|---|
| Any 3B model | 2–4 GB | Older hardware, basic tasks |
| Any 7–8B model | 6–8 GB | Most consumer GPUs, everyday use |
| Any 14B model | 10–12 GB | RTX 3080/4070, M2 Pro 16GB |
| Any 27–32B model | 18–24 GB | RTX 4090, M3 Max 36GB |
| Llama 4 Scout (MoE) | ~10 GB | Most gaming GPUs (MoE efficiency) |
| Kimi K2.6 (MoE) | ~20 GB | High-end GPUs |
| Any 70B model | 40–48 GB | M4 Max, multi-GPU setups |
Running without a GPU: All models can run on CPU — it is slow but functional. A 7B model runs at ~5–10 tokens/second on a modern CPU. Fast enough to use; too slow to feel snappy.
The Ollama API: Your Local OpenAI Replacement
Ollama serves a REST API on http://localhost:11434 that is compatible with the OpenAI API format. Any application built for OpenAI can point to Ollama instead with minimal changes.
Direct API call:
curl http://localhost:11434/api/chat \
-d '{
"model": "llama4:scout",
"messages": [
{"role": "user", "content": "Why is the sky blue?"}
]
}'
Using the OpenAI compatibility endpoint:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama4:scout",
"messages": [
{"role": "user", "content": "Why is the sky blue?"}
]
}'
This means you can use the OpenAI Python library pointed at your local Ollama server — no code changes needed beyond the base URL.
Adding a User Interface: Open WebUI
The command-line interface works but most users prefer a chat interface. Open WebUI is the best option — it is free, open-source, and runs locally.
Install Open WebUI with Docker:
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000 in your browser. You get a ChatGPT-like interface connected to your local Ollama models.
Without Docker (using pip):
pip install open-webui
open-webui serve
Open WebUI is covered in detail in Open WebUI: The Best Interface for Running Ollama (Post #4 in this series).
Ollama Cloud: When Local Hardware Is Not Enough
Ollama v0.24 introduced cloud model offloading — the ability to run massive models (671B+) via Ollama’s cloud service while keeping your local Ollama workflow.
# Cloud models require signing in to ollama.com
ollama pull deepseek-v3.1:671b-cloud
ollama pull gpt-oss:120b-cloud
ollama pull kimi-k2:1t-cloud
Cloud models use the same Ollama API — your code does not change. You pay per token for cloud inference instead of buying hardware. This gives you access to frontier-scale models through the same workflow as your local models.
This is not the privacy-first local AI experience — cloud models send data to Ollama’s servers. But it gives you one consistent tool for both local and cloud AI.
Integrating Ollama With Claude Desktop and Codex
Ollama v0.24.0 added native support for the Codex App (OpenAI’s desktop coding agent) and Claude Desktop:
# Launch Codex App with Ollama backend
ollama launch codex-app
# Restore previous session
ollama launch codex-app --restore
Claude Desktop can connect to Ollama’s local server for private Claude-workflow-compatible interactions. This means using Claude Desktop’s interface with locally running open models — preserving the Claude UI experience while keeping inference local.
This integration is covered fully in Ollama for Developers: The Complete Local AI Dev Environment (Post #13).
Privacy and Security: What “Local” Actually Means
When running models locally with Ollama:
What stays on your machine:
- Your prompts and questions
- The model’s responses
- Conversation history
- Any documents or data you provide
What Ollama.com receives (when you pull models):
- The model download request (model name + your IP address)
- This is equivalent to a package manager download — not your actual prompts
What Ollama.com does NOT receive:
- Any of your actual conversations
- Any data you process through local models
- Any documents, images, or files you analyze
For maximum privacy: Pull models once on a connected network. After pulling, you can run Ollama completely offline — no network connection required for inference.
Common Setup Mistakes
Mistake 1: Choosing a model too large for your hardware Trying to run a 70B model on a machine with 16GB RAM will work — extremely slowly, using disk swap. Match your model size to your available RAM/VRAM. A well-matched 7B model outperforms a swapping 70B model in practice.
Mistake 2: Not verifying GPU detection
After installing Ollama, run ollama run llama3.2:3b and check the output. If you see “using CPU” unexpectedly, your GPU drivers may need updating. Run ollama ps to see hardware utilization.
Mistake 3: Using old model names
The model library updates constantly. llama2 is outdated — use llama3.2 or llama4:scout. mistral refers to the 7B v0.3 — newer Mistral models have specific names. Check ollama.com/library for current model names.
Mistake 4: Not setting up Open WebUI or another interface The terminal is functional but uncomfortable for regular use. Spending 10 minutes setting up Open WebUI transforms the experience. It is worth the setup.
Mistake 5: Ignoring context length settings Default context lengths vary by model. For document analysis, you often need a longer context:
ollama run llama4:scout --num-ctx 32768
Without setting this, long documents get truncated silently.
Conclusion
Ollama is the most accessible path to private, local AI. Installation takes five minutes. Running your first model takes one command. The Ollama library gives you access to 4,500+ models — including the current best open-source models in the world, running entirely on your hardware.
The experience in May 2026 is qualitatively different from early local AI — models that genuinely handle professional work, not just demonstrate the concept. Llama 4 Scout on a gaming PC is a meaningful AI tool, not a toy.
Your next step: Install Ollama from ollama.com/download. Run ollama run llama4:scout. Have one real conversation with a genuinely capable local model. That experience will tell you whether local AI is worth pursuing for your specific situation.
📚 Continue the Series:
Last updated: May 2026. Ollama releases updates very frequently — verify the current version at ollama.com and github.com/ollama/ollama/releases. Current version as of this writing: v0.24.0 stable, v0.30.0-rc20 in candidate.
⚠️ Local model performance depends heavily on your hardware. Always verify your GPU is detected by Ollama before drawing conclusions about model quality. A CPU-only run at 3 tokens/second is not representative of the model’s actual capability.