A 31-billion parameter model that scores 89.2% on AIME 2026 — the same competition mathematics benchmark where GPT-4 class models historically struggled — and ranks in the top three globally on the Arena AI leaderboard above models four times its size. That is the headline for Gemma 4, and it is a verified benchmark result, not marketing language.
Google DeepMind released Gemma 4 on April 2, 2026 under Apache 2.0 — one of the most permissive open-source licenses available. No usage restrictions, no commercial limitations, no royalties. The weights are yours to download, run, fine-tune, and deploy in any application for any purpose.
The family spans four model sizes: E2B and E4B designed for phones and edge devices, a 26B Mixture-of-Experts variant optimized for high-throughput inference on a single GPU, and a 31B dense model that sits at the frontier of what runs on consumer hardware. Every model in the family supports vision input, configurable thinking modes, native function calling, and context windows between 128K and 256K tokens.
Since the first Gemma models launched in early 2024, developers have downloaded Gemma models over 500 million times and created more than 100,000 custom variants. Gemma 4 is the release that makes that ecosystem accelerate.
🔗 This is count #148 in the inkeybit blog series. For running Gemma 4 via Ollama, see Ollama Masterclass 2026 (count 128). For vision model workflows, see Vision Models Locally (count 135). For tool calling integration, see The Ollama API (count 136).
The Four Model Variants
Gemma 4’s model family spans three distinct architectures tailored for specific hardware requirements: Small Sizes (E2B and E4B) built for ultra-mobile, edge, and browser deployment; a Dense 31B parameter model that bridges server-grade performance and local execution; and a Mixture-of-Experts 26B model designed for high-throughput advanced reasoning.
E2B — The Edge and Mobile Model
ollama pull gemma4:e2b
- Parameters: ~2 billion effective
- Architecture: Dense, optimized for edge hardware
- VRAM: 4GB or less — runs on integrated graphics
- Context: 128K tokens
- Multimodal: Text, Image, Audio (audio natively supported on small models)
- Best for: On-device deployment, mobile apps, browser-based AI, embedded applications
- Performance: AIME 2026 score of 42.5% — more than double what Gemma 3 27B achieved
The E2B is not a toy model. 42.5% on AIME 2026 from a 2B model is genuinely remarkable — it demonstrates that Gemma 4’s architectural improvements lift all variants, not just the large ones.
E4B — The Consumer Laptop Model
ollama pull gemma4:e4b
- Parameters: ~4 billion effective
- Architecture: Dense, efficiency-focused
- VRAM: 6–8GB — any recent laptop GPU
- Context: 128K tokens
- Multimodal: Text, Image, Audio
- Best for: Daily driver on consumer laptops, fast local assistant, mobile devices with dedicated GPU
The E4B is the model most consumer laptop users should start with. It handles the majority of everyday tasks at speeds that feel interactive, and it runs on hardware that was not purchased with AI in mind.
26B A4B — The MoE Efficiency Champion
ollama pull gemma4:27b # Ollama uses 27b as the tag for the 26B A4B
- Parameters: 26B total, 4B active per token (MoE)
- Architecture: Mixture-of-Experts — 8 experts, 4 active per token
- VRAM: ~18–20GB at Q4 quantization
- Context: 256K tokens
- Multimodal: Text, Image (variable aspect ratio and resolution)
- AIME 2026: 88.3% — with only 3.8B active parameters per token
- LiveCodeBench v6: 77.1%
- Best for: High-quality work on a mid-range GPU, long-context tasks, reasoning
The 26B MoE reaches 88.3% on AIME 2026 with only 3.8B active parameters. LiveCodeBench v6: 77.1%. This is the efficiency story of Gemma 4 — the MoE architecture delivers near-31B quality at a fraction of the compute.
31B — The Dense Frontier Model
ollama pull gemma4:31b
- Parameters: 31B (dense — all parameters active per token)
- Architecture: Dense transformer
- VRAM: ~20–24GB at Q4 quantization
- Context: 256K tokens
- Multimodal: Text, Image
- AIME 2026: 89.2% — top-three globally on Arena AI
- LiveCodeBench v6: 80.0%
- Best for: Maximum quality on consumer hardware, complex reasoning, frontier-level performance
The 31B scores 1452 on Arena AI (text), placing it top-three globally. AIME 2026: 89.2%. LiveCodeBench v6: 80.0%.
The Benchmark Story: What the Numbers Actually Mean
AIME 2026 Performance — The Most Significant Signal
AIME 2026 is the clearest signal: Gemma 3 27B scored 20.8%. Gemma 4 31B scores 89.2%. The 26B MoE reaches 88.3% with only 3.8B active parameters. Even the tiny E4B hits 42.5%, more than double what the previous full-size model could do.
This is not incremental improvement. Going from 20.8% to 89.2% on competition-level mathematics represents a qualitative leap in reasoning capability — the kind of improvement that changes what the model is actually useful for, not just how it ranks on leaderboards.
Benchmark Comparison Table
| Model | AIME 2026 | LiveCodeBench v6 | Arena AI Text | VRAM |
|---|---|---|---|---|
| Gemma 3 27B (previous) | 20.8% | 29.1% | 1365 | 18GB |
| Gemma 4 E4B | 42.5% | — | — | 6–8GB |
| Gemma 4 26B MoE | 88.3% | 77.1% | 1441 | 18–20GB |
| Gemma 4 31B Dense | 89.2% | 80.0% | 1452 | 20–24GB |
| Llama 4 Scout (reference) | ~65% | ~60% | ~1380 | ~10GB |
| Qwen 3.6 27B (reference) | ~75% | 77.2% | ~1400 | 18GB |
The benchmark table from Google’s official model card shows the scale of improvement across all four Gemma 4 models versus Gemma 3 27B.
What These Benchmarks Mean in Practice
AIME 2026 (89.2%): Competition mathematics requiring multi-step proof construction, algebraic manipulation, and combinatorics. This level of performance means Gemma 4 31B handles advanced professional mathematics that most people would struggle with. Not a benchmark curiosity — a signal of genuine deep reasoning capability.
LiveCodeBench v6 (80.0%): Real-world software engineering tasks on production codebases, not curated examples. 80% resolution rate is frontier-level coding performance. For professional developers, this translates to a model that resolves real bugs and implements real features, not just generates plausible-looking code.
Arena AI 1452 (top-3 globally): Gemma 4 31B has secured a top-three global ranking on the Arena AI leaderboard, outperforming models nearly four times its size. Human preference evaluation across diverse tasks — writing, reasoning, coding, instruction following. Top-three globally means most users prefer its outputs over models with four times the parameter count.
What’s Architecturally New
Gemma 4 introduces key capability and architectural advancements: Reasoning — all models are designed as highly capable reasoners with configurable thinking modes. Extended Multimodalities — processes text and image with variable aspect ratio and resolution support across all models, with video and audio featured natively on E2B and E4B models. Diverse and efficient architectures — offers Dense and MoE variants of different sizes for scalable deployment.
Multi-Token Prediction with Speculative Decoding
All Gemma 4 models — E2B, E4B, 31B, and 26B A4B — include a dedicated draft model for speculative decoding, enabling significantly faster inference with no quality loss.
Speculative decoding is a technique where a small draft model proposes multiple tokens at once, and the main model verifies them in parallel. This produces the same output as standard generation but at 1.5–2× the speed on hardware that can run both models simultaneously. Gemma 4 bakes this in at the architecture level rather than requiring external setup.
Configurable Thinking Modes
Every Gemma 4 model supports three thinking modes, selectable at inference time:
# Thinking mode 1: No thinking (fastest, lowest quality on hard tasks)
# Thinking mode 2: Thinking (balanced — default for most tasks)
# Thinking mode 3: Extended thinking (slowest, highest quality on hard problems)
This is the same architectural approach as OpenAI’s effort controls on GPT-5.5 — you choose how much reasoning compute to apply based on task difficulty. Simple questions get instant answers; complex proofs get extended thinking.
Native Function Calling (Tool Use)
Enhanced Coding and Agentic Capabilities: achieves notable improvements in coding benchmarks alongside built-in function-calling support, powering highly capable autonomous agents.
Function calling in Gemma 4 is native to the architecture — trained in, not bolted on. This produces more reliable structured JSON output for tool invocations than models where tool use was added via fine-tuning after initial training.
Native System Prompt Support
Gemma 4 introduces built-in support for the system role, enabling more structured and controllable conversations.
Previous Gemma versions required workarounds for system prompts. Gemma 4 has a proper system role — meaning Modelfiles, Continue.dev system prompts, and API system messages all work as expected without template hacks.
256K Context Window (Medium Models)
The small models feature a 128K context window, while the medium models support 256K.
256K tokens is approximately 190,000 words — or roughly 350 pages of dense text. For practical use: an entire research paper collection, a large codebase, or a long document history fits in a single context.
Ollama Installation and Setup
Pull Your Preferred Variant
# The right model for your hardware:
# 4–8GB VRAM (laptops, older GPUs)
ollama pull gemma4:e4b
# 18–24GB VRAM (RTX 3090, RTX 4090, M3 Pro+)
ollama pull gemma4:27b # The 26B MoE — best balance of quality and hardware
# 20–24GB VRAM with maximum quality
ollama pull gemma4:31b
# Minimum hardware (2B model for any machine)
ollama pull gemma4:e2b
Verify and Check Model Info
# Confirm the pull completed
ollama list | grep gemma4
# Detailed model info
ollama show gemma4:27b
# Quick sanity test
ollama run gemma4:27b "Briefly explain what makes MoE architectures efficient"
Optimal Runtime Parameters
# Run with optimized context and thinking enabled
ollama run gemma4:27b \
--num-ctx 65536 \ # 64K context (adjust for your VRAM)
--temperature 0.7 \ # Good general-use temperature
--repeat-penalty 1.1 # Reduce repetition in long outputs
# For reasoning/math tasks — lower temperature
ollama run gemma4:31b \
--num-ctx 32768 \
--temperature 0.0 # Deterministic for math/logic
Vision Workflows: Gemma 4’s Standout Feature
Vision support in Gemma 4 is native — not a separate vision adapter, but a unified multimodal architecture. Every model variant processes images, with audio additionally supported on E2B and E4B.
Image Analysis via CLI
# Analyze an image from the command line
ollama run gemma4:27b "What is shown in this image? Describe in detail." /path/to/image.jpg
# Document analysis
ollama run gemma4:27b "Extract all text and data from this document image. Preserve structure." /path/to/document_scan.png
# Code screenshot debugging
ollama run gemma4:27b "What error is shown and how do I fix it?" /path/to/error_screenshot.png
Vision API Integration
import ollama
import base64
from pathlib import Path
def gemma4_vision(image_path: str, prompt: str,
model: str = "gemma4:27b") -> str:
"""Analyze an image with Gemma 4."""
image_data = base64.b64encode(Path(image_path).read_bytes()).decode()
response = ollama.chat(
model=model,
messages=[{
"role": "user",
"content": prompt,
"images": [image_data]
}],
options={
"temperature": 0.1,
"num_ctx": 16384
}
)
return response["message"]["content"]
# Practical use cases
# 1. Diagram analysis
result = gemma4_vision(
"architecture_diagram.png",
"Analyze this system architecture diagram. Identify: "
"components, data flows, potential bottlenecks, and single points of failure."
)
# 2. Form data extraction
result = gemma4_vision(
"invoice_scan.jpg",
"""Extract all data from this invoice as JSON:
{
"invoice_number": "",
"date": "",
"vendor": "",
"line_items": [{"description": "", "quantity": 0, "unit_price": 0, "total": 0}],
"subtotal": 0,
"tax": 0,
"total": 0,
"payment_terms": ""
}
Return only valid JSON."""
)
# 3. UI/UX review
result = gemma4_vision(
"app_screenshot.png",
"Review this UI screenshot. Identify: "
"accessibility issues (contrast, labels), usability problems, "
"visual hierarchy issues, and mobile responsiveness concerns."
)
# 4. Chart interpretation
result = gemma4_vision(
"quarterly_chart.png",
"Analyze this chart. State: chart type, what is measured, "
"time period, key trend, highest and lowest values, "
"and the most important insight for a business decision."
)
Variable Resolution Support
Unlike earlier vision models that resize all images to a fixed resolution, Gemma 4 processes images with variable aspect ratio and resolution support. This means:
- Tall documents (receipts, screenshots) are not squashed into square crops
- Wide panoramic images retain their spatial relationships
- Text-heavy images are processed at resolutions where the text remains readable
- Charts with fine detail are not downsampled into unreadable thumbnails
For practical document processing, this is a significant quality improvement over previous local vision models.
Multi-Image Conversations
import ollama
import base64
from pathlib import Path
def compare_images(img1_path: str, img2_path: str, question: str) -> str:
"""Compare two images with Gemma 4."""
img1 = base64.b64encode(Path(img1_path).read_bytes()).decode()
img2 = base64.b64encode(Path(img2_path).read_bytes()).decode()
response = ollama.chat(
model="gemma4:27b",
messages=[{
"role": "user",
"content": question,
"images": [img1, img2]
}]
)
return response["message"]["content"]
# A/B design comparison
result = compare_images(
"design_v1.png",
"design_v2.png",
"Compare these two UI designs. Which has better visual hierarchy, "
"readability, and user experience? Be specific about what changed and why it matters."
)
# Before/after code review via screenshots
result = compare_images(
"before_refactor.png",
"after_refactor.png",
"A developer refactored this code. "
"Identify what changed and whether the changes improved quality."
)
Tool Calling (Function Calling)
Gemma 4’s native function calling produces structured JSON tool invocations reliably — the architecture was trained with tool use as a first-class capability.
Tool Calling via Ollama API
import ollama
import json
import requests
from datetime import datetime
# Define tools as JSON Schema
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "City name, e.g. 'London'"
},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature units"
}
},
"required": ["city"]
}
}
},
{
"type": "function",
"function": {
"name": "search_knowledge_base",
"description": "Search the company knowledge base for information",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search query"
},
"max_results": {
"type": "integer",
"description": "Maximum number of results to return",
"default": 5
}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "calculate",
"description": "Perform mathematical calculations",
"parameters": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "Mathematical expression to evaluate"
}
},
"required": ["expression"]
}
}
}
]
# Tool implementations
def get_current_weather(city: str, units: str = "celsius") -> str:
try:
resp = requests.get(f"https://wttr.in/{city}?format=j1", timeout=5)
data = resp.json()
temp = data["current_condition"][0]["temp_C"]
if units == "fahrenheit":
temp = float(temp) * 9/5 + 32
desc = data["current_condition"][0]["weatherDesc"][0]["value"]
return json.dumps({"city": city, "temperature": temp,
"units": units, "conditions": desc})
except Exception as e:
return json.dumps({"error": str(e)})
def search_knowledge_base(query: str, max_results: int = 5) -> str:
# Stub — replace with your actual KB search
return json.dumps({"results": [
{"title": f"Article about {query}", "snippet": "...relevant content..."}
]})
def calculate(expression: str) -> str:
import ast, operator
ops = {
ast.Add: operator.add, ast.Sub: operator.sub,
ast.Mult: operator.mul, ast.Div: operator.truediv,
ast.Pow: operator.pow, ast.USub: operator.neg
}
def eval_node(node):
if isinstance(node, ast.Constant):
return node.value
elif isinstance(node, ast.BinOp):
return ops[type(node.op)](eval_node(node.left), eval_node(node.right))
elif isinstance(node, ast.UnaryOp):
return ops[type(node.op)](eval_node(node.operand))
raise ValueError(f"Unsupported: {type(node)}")
try:
result = eval_node(ast.parse(expression, mode='eval').body)
return json.dumps({"expression": expression, "result": result})
except Exception as e:
return json.dumps({"error": str(e)})
tool_functions = {
"get_current_weather": get_current_weather,
"search_knowledge_base": search_knowledge_base,
"calculate": calculate
}
def run_gemma4_with_tools(user_message: str, max_iterations: int = 6) -> str:
"""Run Gemma 4 with tool calling in an agent loop."""
messages = [{"role": "user", "content": user_message}]
for iteration in range(max_iterations):
response = ollama.chat(
model="gemma4:27b",
messages=messages,
tools=tools,
options={"temperature": 0.0}
)
message = response["message"]
messages.append(message)
# No tool calls — we have a final answer
if not message.get("tool_calls"):
return message["content"]
# Execute each tool call
for tool_call in message["tool_calls"]:
fn_name = tool_call["function"]["name"]
fn_args = tool_call["function"]["arguments"]
if isinstance(fn_args, str):
fn_args = json.loads(fn_args)
if fn_name in tool_functions:
result = tool_functions[fn_name](**fn_args)
else:
result = json.dumps({"error": f"Tool '{fn_name}' not found"})
messages.append({
"role": "tool",
"content": result
})
return "Max iterations reached"
# Test the tool-calling agent
print(run_gemma4_with_tools(
"What's the weather in Tokyo and London right now? "
"Also, if one city is 5°C warmer than the other, what is the temperature difference in Fahrenheit?"
))
print(run_gemma4_with_tools(
"What is 15% of 847, and then what is that result squared?"
))
Thinking Mode: Using Configurable Reasoning
Gemma 4’s configurable thinking mode controls how much reasoning the model applies before answering. For Ollama, you activate extended thinking via the system prompt and prompt structure:
import ollama
def gemma4_think(problem: str, model: str = "gemma4:31b",
thinking_depth: str = "extended") -> str:
"""
Use Gemma 4's thinking mode for hard problems.
thinking_depth: "none" | "standard" | "extended"
"""
if thinking_depth == "extended":
system = """Think carefully and systematically before answering.
Work through the problem step by step.
Check your reasoning before providing the final answer.
Show your work clearly."""
temperature = 0.0
max_tokens = 4000
elif thinking_depth == "standard":
system = "Think through this carefully before answering."
temperature = 0.1
max_tokens = 2000
else: # none
system = "Answer directly and concisely."
temperature = 0.3
max_tokens = 800
response = ollama.chat(
model=model,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": problem}
],
options={
"temperature": temperature,
"num_predict": max_tokens,
"num_ctx": 16384
}
)
return response["message"]["content"]
# Competition mathematics
print(gemma4_think(
"Find all integer solutions to: x² + y² = z² where x, y, z are positive "
"integers and x < y < z < 20. How many solutions exist?",
thinking_depth="extended"
))
# Complex business logic
print(gemma4_think(
"A company has 3 products. Product A costs $45 to make and sells for $89. "
"Product B costs $12 to make and sells for $29. Product C costs $78 to make "
"and sells for $149. Fixed costs are $50,000/month. They can make a combined "
"maximum of 2,000 units/month. What product mix maximizes profit assuming "
"current demand allows any distribution?",
thinking_depth="extended"
))
Gemma 4 for Agentic Workflows
Gemma 4 achieves notable improvements in coding benchmarks alongside built-in function-calling support, powering highly capable autonomous agents.
The combination of reliable tool calling, strong reasoning, and native system prompt support makes Gemma 4 well-suited for agentic tasks that require the model to plan, use tools, and complete multi-step objectives.
LangChain Integration
from langchain_ollama import ChatOllama
from langchain_core.tools import tool
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate
# Gemma 4 via LangChain
llm = ChatOllama(
model="gemma4:27b",
temperature=0.0,
num_ctx=32768
)
@tool
def web_search(query: str) -> str:
"""Search the web for current information."""
from duckduckgo_search import DDGS
with DDGS() as ddgs:
results = list(ddgs.text(query, max_results=3))
return "\n".join([f"{r['title']}: {r['body']}" for r in results])
@tool
def read_local_file(filepath: str) -> str:
"""Read a local file."""
from pathlib import Path
try:
return Path(filepath).read_text(encoding="utf-8")[:5000]
except Exception as e:
return f"Error: {e}"
tools = [web_search, read_local_file]
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant with access to web search and local files."),
("human", "{input}"),
("placeholder", "{agent_scratchpad}")
])
agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
result = executor.invoke({
"input": "Research the latest developments in fusion energy from 2026 "
"and summarize the three most significant recent achievements."
})
print(result["output"])
Modelfile: Custom Gemma 4 Assistants
# Professional Research Assistant
cat > Gemma4Research.Modelfile << 'EOF'
FROM gemma4:27b
SYSTEM """You are a research analyst powered by Gemma 4.
ANALYTICAL FRAMEWORK:
For any research question:
1. State your key finding first
2. Provide 3-5 supporting points with evidence
3. Note the strongest counterargument
4. Conclude with a specific actionable insight
INTELLECTUAL STANDARDS:
- Distinguish facts from consensus from your analysis
- State confidence levels explicitly when uncertain
- Never present contested claims as settled
- Note when information may be outdated
THINKING: Apply extended reasoning for complex multi-step problems.
TONE: Think-tank analyst briefing a senior executive."""
PARAMETER temperature 0.2
PARAMETER num_ctx 65536
PARAMETER num_predict 3000
EOF
ollama create Gemma4Research -f Gemma4Research.Modelfile
# Gemma 4 Code Reviewer (uses strong coding benchmarks)
cat > Gemma4Coder.Modelfile << 'EOF'
FROM gemma4:27b
SYSTEM """You are a code review specialist using Gemma 4's advanced coding capabilities.
REVIEW FRAMEWORK:
Grade every issue:
[Critical] — Security vulnerability, data corruption, auth bypass
[High] — Performance under load, missing critical error handling
[Medium] — Maintainability, suboptimal patterns, missing tests
[Low] — Minor improvements
For each issue: location, problem, why it matters, specific fix.
Always identify 2-3 strengths alongside issues.
Overall rating: 1-10 with justification.
THINKING: Use systematic reasoning to trace execution paths and identify subtle bugs."""
PARAMETER temperature 0.0
PARAMETER num_ctx 32768
PARAMETER num_predict 4000
EOF
ollama create Gemma4Coder -f Gemma4Coder.Modelfile
Gemma 4 vs. Comparable Local Models
This comparison is task-focused, not benchmark-focused:
vs. Llama 4 Scout (10GB VRAM MoE)
| Task | Gemma 4 27B | Llama 4 Scout | Winner |
|---|---|---|---|
| Mathematics (hard) | 88.3% AIME | ~65% AIME | Gemma 4 |
| Competitive coding | 77.1% LCB | ~60% LCB | Gemma 4 |
| Context window | 256K | 10M | Llama 4 Scout |
| VRAM required | ~18GB | ~10GB | Llama 4 Scout |
| Vision quality | Excellent | N/A | Gemma 4 |
| Audio input | E2B/E4B only | No | Gemma 4 (edge) |
| Tool calling | Native | Good | Gemma 4 |
Practical guidance: Choose Llama 4 Scout if VRAM is limited or if very long context (>256K) is needed. Choose Gemma 4 27B for harder reasoning, coding, or vision tasks where VRAM supports it.
vs. Qwen 3.6 27B (18GB VRAM Dense)
| Task | Gemma 4 27B MoE | Qwen 3.6 27B | Winner |
|---|---|---|---|
| Mathematics | 88.3% | ~75% | Gemma 4 |
| SWE-bench (real code) | ~70% | 77.2% | Qwen 3.6 |
| Vision | Yes, native | No | Gemma 4 |
| Tool calling | Native | Strong | Comparable |
| Multilingual | Strong | Excellent | Qwen 3.6 |
| Speed (tokens/s) | Faster (MoE) | Standard | Gemma 4 |
Practical guidance: Use Qwen 3.6 27B for pure coding tasks (higher SWE-bench). Use Gemma 4 27B for vision, mathematics, and reasoning-heavy tasks.
vs. DeepSeek-R1 32B (20GB VRAM)
Gemma 4 has thinking modes; DeepSeek-R1 has visible chain-of-thought. Different approaches to reasoning:
- DeepSeek-R1 exposes its thinking trace — useful when you want to see the reasoning
- Gemma 4 thinking produces the reasoning internally — cleaner output
- On hard mathematics, both perform at similar high levels
- Gemma 4 has significantly better vision and coding capabilities
- DeepSeek-R1 thinking trace is more auditable for sensitive decisions
Hardware Requirements and Performance
To run the smallest Gemma 4, you need at least 4 GB of RAM. The largest one may require up to 19 GB.
| Model | Min VRAM | Recommended | Tokens/Second (RTX 4090) |
|---|---|---|---|
| E2B | 4GB | 6GB | 80–120 t/s |
| E4B | 6GB | 8GB | 60–90 t/s |
| 26B MoE | 16GB | 20GB | 25–40 t/s |
| 31B Dense | 20GB | 24GB | 18–30 t/s |
Apple Silicon performance (M4 Pro 48GB):
- 26B MoE: ~35–50 t/s (unified memory advantage on MoE)
- 31B Dense: ~25–38 t/s
Common Gemma 4 Mistakes
Mistake 1: Not using the thinking mode for hard problems Gemma 4’s reasoning capability shines when you explicitly ask it to think through a problem step by step. On difficult mathematics or complex logic, include “Think carefully step by step” in your prompt.
Mistake 2: Using 31B Dense when 26B MoE is sufficient The 26B MoE scores 88.3% vs. 89.2% on AIME — a 0.9% difference — while requiring similar VRAM but running faster per token due to fewer active parameters. For most tasks, the MoE is the better choice.
Mistake 3: Not setting context length for vision tasks
Each image adds significant token overhead. Set --num-ctx 16384 or higher when analyzing images to ensure sufficient context for both the image tokens and your conversation.
Mistake 4: Using E2B for complex reasoning The E2B is designed for edge deployment — it is excellent for simple tasks on constrained hardware. Do not use it for the complex reasoning, coding, and analysis where Gemma 4’s benchmark scores apply. Those numbers are for the 26B and 31B variants.
Conclusion
Gemma 4 is the most significant open model release of 2026 for local AI users. The jump from Gemma 3 27B’s 20.8% AIME score to the 26B MoE’s 88.3% and 31B’s 89.2% is not incremental — it is a qualitative leap in reasoning capability that changes what these models are genuinely useful for.
The combination of that reasoning depth, native vision across all variants, reliable tool calling, 256K context windows, configurable thinking modes, and an Apache 2.0 license that removes all commercial friction makes Gemma 4 the most well-rounded local model available in June 2026.
For users with 18–24GB VRAM: pull gemma4:27b (the 26B MoE). For the maximum quality available on consumer hardware: gemma4:31b. For laptop or edge deployment: gemma4:e4b.
Your next step: ollama pull gemma4:27b. Run the vision workflow from this guide on a document you regularly work with. Then run the mathematics benchmark prompt on a problem you care about. The quality will show you immediately whether Gemma 4 earns a place in your regular workflow.
📚 Related Posts:
- Vision Models Locally: Gemma 4 and Llama 3.2 Vision — Vision workflows in depth
- The Local LLM Model Guide 2026 — How Gemma 4 fits in the full model landscape
- Ollama Masterclass 2026 — Get Ollama running first
- Local LLMs vs Cloud AI: The Honest 2026 Comparison — Where Gemma 4 fits vs. cloud models
- AI Agents With Ollama — Building agents on Gemma 4’s native tool calling
Last updated: June 2026. Gemma 4 was released April 2, 2026. Model card and full technical specifications at ai.google.dev/gemma/docs/core/model_card_4. Apache 2.0 license terms at apache.org/licenses/LICENSE-2.0.
⚠️ Benchmark scores reflect published evaluation results as of April 2026. Real-world performance on your specific tasks may differ from benchmark scores — always test on representative examples from your own use cases.