Ollama Masterclass 2026: Install, Run, and Master Local AI in 30 Minutes

Every time you send a message to ChatGPT, Claude, or Gemini, that message travels to a server you do not control. It is processed by infrastructure you do not own. It may be retained, logged, or used to train future models. And it costs you either subscription fees or API credits.

Ollama changes all of this. Ollama is a tool that lets you download, manage, and run AI models entirely on your own computer — no internet connection required during inference, no API keys, no subscriptions, no data leaving your machine. Your prompts and responses stay local.

The tradeoff is real: local models are generally less capable than the frontier cloud models. GPT-5.5 and Claude Opus 4.5 are significantly more capable than anything you can run locally on a consumer laptop today. But the gap has narrowed dramatically in 2026. The current best local models — Llama 4 Scout, Kimi K2.6, Qwen 3.6, Gemma 4 — handle the majority of professional tasks that most people use AI for, with performance that would have required cloud access just 18 months ago.

If you value privacy, want to work offline, need to process sensitive data, or want to eliminate AI subscription costs — Ollama is worth understanding.

This guide installs Ollama, runs your first model, and gives you the complete workflow to replace most cloud AI interactions with private local alternatives.

🔗 Welcome to Ollama Unlocked. This is Post #1 — the foundation. The series covers the model selection guide, hardware requirements, Open WebUI, RAG with Ollama, the Ollama API, building apps, and much more. Read this first; every other post builds on it.

What Ollama Is and How It Works

Ollama is an open-source tool that manages the entire local LLM workflow:

Downloads model files from the Ollama library (4,500+ models as of May 2026)
Manages model storage and VRAM allocation automatically
Serves models through a REST API on your local machine (port 11434 by default)
Provides a command-line interface for running models interactively
Handles quantization, GPU/CPU routing, and memory management without manual configuration

The current version is Ollama v0.24.0 (released May 14, 2026), with v0.30.0 in release candidate. The Python library version is 0.6.2 (April 29, 2026).

Ollama uses GGUF format for model files — the current standard for quantized local models — and uses llama.cpp directly for inference (since the v0.24 architecture change). On Apple Silicon, it uses MLX for hardware-accelerated inference.

Installation: Mac, Windows, and Linux

Mac (Recommended Starting Point)

Ollama on Mac has the best out-of-the-box experience, particularly on Apple Silicon (M1/M2/M3/M4).

Method 1 — Direct download (easiest):

Go to ollama.com/download
Download the Mac installer
Open the .zip and drag Ollama to Applications
Launch Ollama — it runs as a menu bar app
Open Terminal and run: ollama run llama4:scout

Method 2 — Homebrew:

brew install ollama
brew services start ollama

Verify installation:

ollama --version
# ollama version 0.24.0

Apple Silicon advantage: M-series chips have unified memory — the GPU and CPU share the same memory pool. A Mac with 24GB unified memory can run larger models than a Windows PC with a 12GB dedicated VRAM GPU. An M4 Pro Mac with 48GB runs 70B parameter models smoothly.

Windows

Go to ollama.com/download
Download OllamaSetup.exe
Run the installer — it installs Ollama as a Windows service
Open PowerShell or Windows Terminal
Run: ollama run llama4:scout

GPU support: Ollama on Windows supports NVIDIA (CUDA) and AMD (ROCm) GPUs automatically. Detection is automatic — no manual configuration needed.

WSL2 note: Ollama runs natively on Windows now. WSL2 is no longer required, though WSL2 with GPU passthrough still works.

Linux

# Official install script (fetches latest stable)
curl -fsSL https://ollama.com/install.sh | sh

# Verify
ollama --version

# Start as a system service (auto-started by installer)
sudo systemctl status ollama

GPU setup on Linux:

# NVIDIA — ensure drivers are installed
nvidia-smi  # Should show your GPU

# AMD — ROCm support
# Ollama detects ROCm automatically if installed
rocm-smi  # Verify ROCm installation

Updating:

# Re-run the install script — it updates in place
curl -fsSL https://ollama.com/install.sh | sh

# Or via package manager if installed that way
sudo snap refresh ollama

Running Your First Model

With Ollama installed, one command downloads and runs a model:

ollama run llama4:scout

What happens:

Ollama checks if llama4:scout is already downloaded
If not, downloads the quantized model file (~6GB for Q4 quantization of Llama 4 Scout)
Loads the model into memory (GPU VRAM + system RAM as needed)
Opens an interactive chat session in your terminal

You are now running a state-of-the-art AI model entirely locally.

To exit: Type /bye or press Ctrl+D

The Essential Ollama Commands

# Pull a model without running it (pre-download)
ollama pull llama4:scout

# Run a model interactively
ollama run llama4:scout

# Run with a one-shot prompt (non-interactive)
ollama run llama4:scout "Explain quantum entanglement in plain English"

# List all downloaded models
ollama list

# Show model info (size, parameters, quantization)
ollama show llama4:scout

# Remove a model (frees disk space)
ollama rm llama4:scout

# Check what's currently running
ollama ps

# Pull a specific model size/quantization
ollama pull qwen3.6:27b

# Run with custom parameters
ollama run llama4:scout --num-ctx 32768  # Set context window

# Start the Ollama server manually (if not auto-starting)
ollama serve

The 2026 Model Library: What to Download First

The Ollama library contains 4,500+ models as of May 2026. Most users need 3–5 models to cover all use cases. Here are the current best choices:

For General Use (Best First Model)

ollama pull llama4:scout

Llama 4 Scout (Meta, April 2026) — The current best general-purpose local model. Uses Mixture-of-Experts (MoE) architecture: 17B parameters active per token, 109B total. Runs on ~10GB VRAM despite the large total size. Excellent for conversation, writing, analysis, and reasoning.

For Coding

ollama pull kimi-k2.6       # Best overall coding (MIT licensed)
ollama pull qwen3.6:27b     # Best dense coding model, 77.2% SWE-bench
ollama pull devstral:24b    # Best agentic coding

For Reasoning

ollama pull deepseek-r1:7b   # Fast local reasoning (7B)
ollama pull deepseek-r1:14b  # Better reasoning (14B)
ollama pull deepseek-r1:32b  # Best local reasoning (32B)

For Efficiency (Low Hardware Requirements)

ollama pull llama3.2:3b     # Runs on any hardware (3B)
ollama pull gemma3:9b       # Google's efficient model
ollama pull phi3:mini       # Microsoft's tiny but capable model

For Vision (Image Understanding)

ollama pull gemma4:9b       # Vision + tool calling
ollama pull llama3.2-vision:11b  # Llama vision model

Quick Reference: Model Sizes vs Hardware

Model	VRAM Required	Best For
Any 3B model	2–4 GB	Older hardware, basic tasks
Any 7–8B model	6–8 GB	Most consumer GPUs, everyday use
Any 14B model	10–12 GB	RTX 3080/4070, M2 Pro 16GB
Any 27–32B model	18–24 GB	RTX 4090, M3 Max 36GB
Llama 4 Scout (MoE)	~10 GB	Most gaming GPUs (MoE efficiency)
Kimi K2.6 (MoE)	~20 GB	High-end GPUs
Any 70B model	40–48 GB	M4 Max, multi-GPU setups

Running without a GPU: All models can run on CPU — it is slow but functional. A 7B model runs at ~5–10 tokens/second on a modern CPU. Fast enough to use; too slow to feel snappy.

The Ollama API: Your Local OpenAI Replacement

Ollama serves a REST API on http://localhost:11434 that is compatible with the OpenAI API format. Any application built for OpenAI can point to Ollama instead with minimal changes.

Direct API call:

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama4:scout",
    "messages": [
      {"role": "user", "content": "Why is the sky blue?"}
    ]
  }'

Using the OpenAI compatibility endpoint:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4:scout",
    "messages": [
      {"role": "user", "content": "Why is the sky blue?"}
    ]
  }'

This means you can use the OpenAI Python library pointed at your local Ollama server — no code changes needed beyond the base URL.

Adding a User Interface: Open WebUI

The command-line interface works but most users prefer a chat interface. Open WebUI is the best option — it is free, open-source, and runs locally.

Install Open WebUI with Docker:

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 in your browser. You get a ChatGPT-like interface connected to your local Ollama models.

Without Docker (using pip):

pip install open-webui
open-webui serve

Open WebUI is covered in detail in Open WebUI: The Best Interface for Running Ollama (Post #4 in this series).

Ollama Cloud: When Local Hardware Is Not Enough

Ollama v0.24 introduced cloud model offloading — the ability to run massive models (671B+) via Ollama’s cloud service while keeping your local Ollama workflow.

# Cloud models require signing in to ollama.com
ollama pull deepseek-v3.1:671b-cloud
ollama pull gpt-oss:120b-cloud
ollama pull kimi-k2:1t-cloud

Cloud models use the same Ollama API — your code does not change. You pay per token for cloud inference instead of buying hardware. This gives you access to frontier-scale models through the same workflow as your local models.

This is not the privacy-first local AI experience — cloud models send data to Ollama’s servers. But it gives you one consistent tool for both local and cloud AI.

Integrating Ollama With Claude Desktop and Codex

Ollama v0.24.0 added native support for the Codex App (OpenAI’s desktop coding agent) and Claude Desktop:

# Launch Codex App with Ollama backend
ollama launch codex-app

# Restore previous session
ollama launch codex-app --restore

Claude Desktop can connect to Ollama’s local server for private Claude-workflow-compatible interactions. This means using Claude Desktop’s interface with locally running open models — preserving the Claude UI experience while keeping inference local.

This integration is covered fully in Ollama for Developers: The Complete Local AI Dev Environment (Post #13).

Privacy and Security: What “Local” Actually Means

When running models locally with Ollama:

What stays on your machine:

Your prompts and questions
The model’s responses
Conversation history
Any documents or data you provide

What Ollama.com receives (when you pull models):

The model download request (model name + your IP address)
This is equivalent to a package manager download — not your actual prompts

What Ollama.com does NOT receive:

Any of your actual conversations
Any data you process through local models
Any documents, images, or files you analyze

For maximum privacy: Pull models once on a connected network. After pulling, you can run Ollama completely offline — no network connection required for inference.

Common Setup Mistakes

Mistake 1: Choosing a model too large for your hardware Trying to run a 70B model on a machine with 16GB RAM will work — extremely slowly, using disk swap. Match your model size to your available RAM/VRAM. A well-matched 7B model outperforms a swapping 70B model in practice.

Mistake 2: Not verifying GPU detection After installing Ollama, run ollama run llama3.2:3b and check the output. If you see “using CPU” unexpectedly, your GPU drivers may need updating. Run ollama ps to see hardware utilization.

Mistake 3: Using old model names The model library updates constantly. llama2 is outdated — use llama3.2 or llama4:scout. mistral refers to the 7B v0.3 — newer Mistral models have specific names. Check ollama.com/library for current model names.

Mistake 4: Not setting up Open WebUI or another interface The terminal is functional but uncomfortable for regular use. Spending 10 minutes setting up Open WebUI transforms the experience. It is worth the setup.

Mistake 5: Ignoring context length settings Default context lengths vary by model. For document analysis, you often need a longer context:

ollama run llama4:scout --num-ctx 32768

Without setting this, long documents get truncated silently.

Conclusion

Ollama is the most accessible path to private, local AI. Installation takes five minutes. Running your first model takes one command. The Ollama library gives you access to 4,500+ models — including the current best open-source models in the world, running entirely on your hardware.

The experience in May 2026 is qualitatively different from early local AI — models that genuinely handle professional work, not just demonstrate the concept. Llama 4 Scout on a gaming PC is a meaningful AI tool, not a toy.

Your next step: Install Ollama from ollama.com/download. Run ollama run llama4:scout. Have one real conversation with a genuinely capable local model. That experience will tell you whether local AI is worth pursuing for your specific situation.

📚 Continue the Series:

Next → The Local LLM Model Guide: Which Model Should You Run?

Hardware What Computer Do You Actually Need for Local AI?

Interface Open WebUI: The Best GUI for Ollama

API The Ollama API: OpenAI-Compatible Local Server

Last updated: May 2026. Ollama releases updates very frequently — verify the current version at ollama.com and github.com/ollama/ollama/releases. Current version as of this writing: v0.24.0 stable, v0.30.0-rc20 in candidate.

⚠️ Local model performance depends heavily on your hardware. Always verify your GPU is detected by Ollama before drawing conclusions about model quality. A CPU-only run at 3 tokens/second is not representative of the model’s actual capability.

Frequently Asked Questions (FAQ)

Does Ollama work on Windows with a regular laptop (no dedicated GPU)?

Yes. Any model runs on CPU. Performance is slow but functional — expect 3–8 tokens/second on a modern laptop CPU for 7B models. Intel integrated graphics do not accelerate Ollama. For faster performance on Windows without a dedicated GPU, ARM-based Windows devices with NPU support are improving rapidly.

What is the difference between Ollama v0.24 and previous versions?

v0.24 (May 14, 2026) added Codex App support, Claude Desktop integration, and reworked the MLX sampler for Apple Silicon. The architecture now uses llama.cpp directly rather than building on top of GGML, improving compatibility. v0.23 added Flash Attention v2.7 and M4 Metal 3 optimizations.

Can I run multiple models simultaneously?

Yes, if you have enough VRAM/RAM. Ollama loads models into memory on first use and keeps them loaded for reuse. Run `ollama ps` to see which models are currently loaded. RAM pressure causes Ollama to unload models automatically.

How is Ollama different from LM Studio?

Both run local models. Ollama is primarily CLI and API-focused — it is designed for developers and integration into other tools. LM Studio has a richer desktop GUI and is more approachable for non-technical users. Many people use both: LM Studio for model discovery and testing, Ollama for API-compatible integration.

Is Ollama free?

Ollama itself is free and open-source (MIT license). Local model inference is free — you own the compute. Cloud model offloading (for very large models) is usage-billed. Open WebUI and most other Ollama-compatible UIs are also free.