Before installing Ollama, the first question most people ask is: “Will my computer be fast enough?” The honest answer depends entirely on what you are trying to do — and “fast enough” means different things for different use cases.
A 3-year-old gaming PC with 12GB VRAM runs Llama 4 Scout smoothly. A MacBook Air M3 with 16GB unified memory handles most professional tasks. A laptop with integrated graphics runs smaller models at usable speeds. Even a computer with no GPU at all can run local AI — just more slowly.
This guide gives you a clear picture of what to expect from whatever hardware you have, what the upgrade paths look like, and how to squeeze maximum performance from your current setup.
🔗 This is Post #3 in the Ollama Unlocked series. See Ollama Masterclass 2026 (Post #1) for installation and The Local LLM Model Guide (Post #2) for model selection guidance that this hardware guide directly informs.
The Core Hardware Principle for Local AI
Local LLM inference is primarily a memory bandwidth problem, not a compute problem. The model weights need to be loaded into fast memory (VRAM or unified memory) and accessed repeatedly for each token generated.
The two key numbers:
- VRAM/RAM size: Determines which models fit in memory
- Memory bandwidth: Determines how fast tokens generate
A GPU with 24GB VRAM but low bandwidth runs slower than a GPU with 16GB VRAM and high bandwidth on models that fit both. Apple Silicon’s unified memory architecture is particularly strong here because the bandwidth between GPU and memory is exceptionally high.
Hardware Tiers: What to Expect
Tier 1 — Entry Level (CPU Only / Integrated Graphics)
Examples: Any modern laptop without dedicated GPU, older desktops
What runs well:
- Any 3B model:
ollama pull llama3.2:3b - Phi-3 Mini:
ollama pull phi3:mini
Performance: 2–6 tokens/second on CPU, depending on processor Practical use: Slow but functional for occasional use. Not suitable for extended work sessions.
Upgrade recommendation: Even a budget dedicated GPU (RTX 3060 12GB, ~$250 used) transforms the experience.
Tier 2 — Entry Gaming GPU (8–12GB VRAM)
Examples: NVIDIA RTX 3060 (12GB), RTX 3070 (8GB), RTX 4060 (8GB), AMD RX 6700 XT
What runs well:
ollama pull llama4:scout # ~10GB — tight but works on 12GB
ollama pull qwen3:7b # ~5GB — comfortable
ollama pull deepseek-r1:7b # ~6GB — comfortable
ollama pull gemma4:9b # ~7GB — comfortable
ollama pull phi4:14b # ~10GB — works on 12GB
Performance: 20–40 tokens/second on 7B models, 10–20 on 13B Practical use: Excellent for everyday local AI work with 7–13B models. The RTX 3060 12GB is the best value entry point specifically because 12GB VRAM handles significantly more models than 8GB.
Note on 8GB VRAM: 8GB is the current frustration threshold. Llama 4 Scout requires ~10GB, which means you are limited to models 8B and below, or require quantization tricks. If upgrading, prioritize 12GB+ over 8GB.
Tier 3 — Mid-Range Gaming GPU (16–24GB VRAM)
Examples: RTX 3090 (24GB), RTX 4090 (24GB), RTX 4080 (16GB), AMD RX 7900 XTX (24GB)
What runs well:
ollama pull llama4:scout # Comfortable
ollama pull llama4:maverick # ~20-24GB — works on 24GB
ollama pull qwen3.6:27b # ~18GB — comfortable on 24GB
ollama pull kimi-k2.6 # ~20GB — works on 24GB
ollama pull deepseek-r1:32b # ~20GB — comfortable
ollama pull devstral:24b # ~15GB — comfortable
Performance: 30–60 tokens/second on 7B, 15–30 on 27B, 8–15 on 70B (if using quantized) Practical use: This is the sweet spot for professional local AI. A single RTX 4090 runs essentially every practical local model.
The RTX 4090 recommendation: At 24GB VRAM and exceptionally high memory bandwidth (1,008 GB/s), the RTX 4090 is the best single-GPU for local AI in 2026. It costs ~$1,800–2,200 new, ~$1,200–1,500 used.
Tier 4 — Apple Silicon (Unified Memory)
Examples: M1/M2/M3/M4 MacBook Pro, Mac Mini, Mac Studio, Mac Pro
Apple Silicon’s unified memory architecture is uniquely suited to local LLMs. The GPU and CPU share the same memory pool, and memory bandwidth is exceptionally high.
Memory and what it runs:
| Chip + RAM | Effective for LLMs | Best Models |
|---|---|---|
| M3 / M4 8GB | Basic use | Up to 7B models |
| M3 / M4 16GB | Good everyday use | Up to 13B comfortably |
| M3 Pro / M4 Pro 24–36GB | Professional use | Llama 4 Scout, 27B models |
| M3 Max / M4 Max 48–64GB | Excellent | 70B models comfortably |
| M2 Ultra / M4 Ultra 96–192GB | Extraordinary | 671B models (Kimi K2.6 full size) |
Apple Silicon performance:
# M3 Pro 36GB — tested performance
# llama4:scout: ~35-45 tokens/second
# qwen3.6:27b: ~20-28 tokens/second
# deepseek-r1:32b: ~15-20 tokens/second
Key Apple Silicon advantage: Unified memory means “VRAM” and system RAM are the same pool. A MacBook Pro with 48GB memory has the full 48GB available for models — far more effective than a PC with 16GB VRAM + 64GB system RAM (where the model can only use the 16GB VRAM at GPU speed).
Ollama optimizations for Apple Silicon:
- Ollama uses MLX for hardware-accelerated inference on Apple Silicon
- Flash Attention v2.7 support added in v0.23 for M-series chips
- Metal 3 optimizations included in v0.24
Tier 5 — Multi-GPU / Workstation
Examples: Dual RTX 3090 (48GB total), Dual RTX 4090 (48GB), NVIDIA RTX 6000 Ada (48GB)
Ollama supports multi-GPU setups for distributing model layers across multiple GPUs:
# Ollama detects and uses all available NVIDIA GPUs automatically
# Verify GPU detection:
ollama run llama4:maverick # Should use both GPUs if needed
What this enables:
- 70B models at full Q8 quality (requires ~40GB)
- Kimi K2.6 full MoE without quantization
- Multiple models loaded simultaneously
Performance: 40–80 tokens/second on 70B models with dual RTX 4090 Cost: Dual RTX 4090 setup runs $3,500–5,000 new hardware cost
NVIDIA vs AMD vs Intel Arc for Local AI
NVIDIA (Best Overall)
- CUDA ecosystem: Most local AI software is built for CUDA first
- Ollama support: Full support, most tested
- Best options: RTX 4090 (24GB, highest performance), RTX 3090 (24GB, better value), RTX 4080 (16GB)
- Verdict: Best choice if buying new hardware specifically for local AI
AMD
- ROCm: AMD’s compute platform — functional but requires more setup than CUDA
- Ollama support: Supported via ROCm on Linux; Windows support improving but not as mature
- Best options: RX 7900 XTX (24GB), RX 7900 XT (20GB)
- Verdict: Viable on Linux; more friction on Windows; good value if you already own AMD hardware
AMD setup on Linux:
# Verify ROCm is installed and Ollama detects AMD GPU
ollama run llama3.2:3b
# Check output for GPU detection
Intel Arc
- Ollama support is limited as of May 2026
- Not recommended for primary local AI use
- May improve with future Ollama releases
Maximizing Performance on Your Current Hardware
Setting the Right Context Length
Large context windows require more VRAM. If a model runs out of VRAM, it spills to system RAM — dramatically slowing inference.
# Default context (often 2048 or 4096) — fastest
ollama run llama4:scout
# Larger context — slower, uses more VRAM
ollama run llama4:scout --num-ctx 16384
# Check what context length a model is using
ollama show llama4:scout
Rule: Use only the context length you need for the task. Long research tasks need large context. Chat sessions do not.
Running Multiple Models Efficiently
Ollama keeps loaded models in memory until evicted. To manage this:
# See what's currently loaded
ollama ps
# Unload a model to free VRAM
ollama stop llama4:scout
# Or pull a model with a short keepalive (unloads after 30 seconds idle)
OLLAMA_KEEP_ALIVE=30s ollama run qwen3:7b
Using CPU Offloading
When a model does not fully fit in VRAM, Ollama automatically offloads some layers to RAM. This is slower but functional.
# Force specific GPU layer count (tune for your hardware)
OLLAMA_NUM_GPU=35 ollama run qwen3.6:27b
# Higher number = more layers on GPU = faster but more VRAM
# Lower number = more layers on CPU = slower but fits limited VRAM
Environment Variables for Performance
# Set before running Ollama for system-wide effect
# Number of threads for CPU inference
OLLAMA_NUM_THREAD=8 ollama serve
# Flash attention (enabled by default in v0.24)
OLLAMA_FLASH_ATTN=1 ollama serve
# Keep models loaded longer (default is 5 minutes)
OLLAMA_KEEP_ALIVE=30m ollama serve
# Allow running on all available GPUs
CUDA_VISIBLE_DEVICES=0,1 ollama serve # For dual GPU
What to Buy: 2026 Recommendations
If you are purchasing hardware specifically for local AI in May 2026:
Budget Build (~$300–500 additional hardware)
RTX 3060 12GB (~$250–300 used)
- Runs: Most 7–13B models at good speed
- Best value entry point; 12GB VRAM is significantly more useful than 8GB
- Skip the RTX 3060 8GB variant specifically
Mid-Range Build (~$800–1,200)
RTX 3090 24GB (~$700–900 used)
- Runs: Essentially all practical local models
- 24GB at high bandwidth is the sweet spot for 2026 models
- Better value than RTX 4090 for local AI specifically (similar VRAM, lower cost)
High Performance Build (~$1,800–2,500)
RTX 4090 24GB (~$1,800–2,200 new)
- Runs: Everything at maximum speed
- 2x faster than RTX 3090 on most models
- Justified if you use local AI professionally all day
Apple Silicon Alternative
MacBook Pro M4 Pro 48GB (~$3,000)
- Runs: 70B models comfortably
- Best laptop option by a significant margin
- Unified memory eliminates the VRAM ceiling problem entirely
Server / Workstation (Multi-GPU)
2× RTX 3090 (~$1,400–1,800 used)
- 48GB combined VRAM via NVLink
- Runs: Full 70B models at good speed, large MoE models
- Best for those running Ollama as a home server for multiple users
Cloud Offloading: When Your Hardware Is Not Enough
Ollama v0.24 added cloud model offloading for models too large for local hardware:
# Requires ollama.com account
ollama pull kimi-k2:1t-cloud # 1T parameter model via cloud
ollama pull deepseek-v3.1:671b-cloud
This uses your local Ollama workflow but runs inference on Ollama’s servers for oversized models. You get the consistent API experience without the hardware cost — but lose the privacy benefit. Billed per token.
Common Hardware Mistakes
Mistake 1: Buying 8GB VRAM in 2026 8GB was the minimum viable VRAM threshold in 2024. In 2026, with Llama 4 Scout requiring ~10GB and most capable models needing more, 8GB is now the frustration tier. If buying new hardware, 12GB is the new minimum.
Mistake 2: Prioritizing clock speed over memory size A GPU with higher core clocks but 8GB VRAM will be slower for local LLMs than a GPU with lower clocks but 16–24GB VRAM. Memory size determines which models run; bandwidth determines speed.
Mistake 3: Assuming Apple Silicon is not competitive Many Windows users dismiss Mac for AI work. An M4 Pro MacBook Pro with 48GB is competitive with a desktop RTX 4090 for local AI and significantly outperforms it for models larger than 24B.
Mistake 4: Not accounting for total system RAM When model layers spill from VRAM to system RAM, you need fast system RAM. 32GB system RAM is the minimum for comfortable use; 64GB is better if you run large models with CPU offloading.
Conclusion
The hardware you need for useful local AI in 2026 is more accessible than most people assume. Any gaming PC bought in the last three years handles 7B–13B models at usable speeds. An RTX 3090 or 4090 handles everything practical. Apple Silicon with 24GB+ unified memory is the best laptop option.
The honest minimum for professional use: 12GB VRAM GPU or Apple Silicon M-series with 16GB+ unified memory. Below that, you are working with smaller models that are capable but limited.
Your next step: Check your current GPU VRAM. Run nvidia-smi (NVIDIA) or check your Mac’s “About This Mac → Chip” for unified memory. Then use the tier guide above to know exactly which models will run smoothly. Hardware knowledge is the foundation — everything else in this series builds on knowing your actual constraints.
📚 Continue the Series:
- ← Previous The Local LLM Model Guide 2026
- Next → Open WebUI: The Best Interface for Running Ollama
- For specific models Llama 4 Scout: Meta’s MoE Model That Runs on a Gaming PC
Last updated: May 2026. GPU pricing and availability change frequently. NVIDIA Blackwell (RTX 5000 series) is shipping in volume as of mid-2026 — check current pricing before purchasing. Verify Ollama GPU support status at github.com/ollama/ollama.