When Ollama starts, it does not just provide a command-line chat interface. It runs a full REST API server on http://localhost:11434 that any application can call — the same way applications call OpenAI, Anthropic, or any other AI API.
The critical feature: Ollama’s API includes an OpenAI-compatible endpoint at /v1/chat/completions. Any code written for the OpenAI API works with Ollama by changing one line — the base URL. This means you can take an existing application built on GPT-5.5 and run it on Llama 4 Scout locally, with no other changes.
For developers, this is the most important Ollama capability. It means local AI is not a separate integration path — it is the same path you already know, pointed at a local server.
🔗 This is Post #9 in the Ollama Unlocked series. For building full applications on top of the API, see Building AI Apps With Ollama and Python (Post #11). For RAG pipeline integration, see RAG with Ollama (Post #10).
The Two API Formats
Ollama exposes two sets of endpoints:
1. Native Ollama API (/api/*)
Ollama’s own format — slightly different from OpenAI, includes Ollama-specific features.
2. OpenAI-Compatible API (/v1/*)
Drop-in replacement for OpenAI endpoints. Use this for existing OpenAI integrations.
Which to use: If you are starting fresh or integrating with the broader Python/JS ecosystem, use the native Ollama client. If you are replacing OpenAI in an existing application or using a framework built on OpenAI’s format, use the /v1/ endpoints.
The Native Ollama API
Generate Completion (non-streaming)
curl http://localhost:11434/api/generate \
-d '{
"model": "llama4:scout",
"prompt": "Why is the sky blue?",
"stream": false
}'
Response:
{
"model": "llama4:scout",
"response": "The sky appears blue because...",
"done": true,
"total_duration": 3421000000,
"eval_count": 127,
"eval_duration": 2800000000
}
Chat Completion (multi-turn)
curl http://localhost:11434/api/chat \
-d '{
"model": "llama4:scout",
"messages": [
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "What is a REST API?"}
],
"stream": false
}'
Streaming Response
# stream: true (default) — streams tokens as generated
curl http://localhost:11434/api/generate \
-d '{
"model": "llama4:scout",
"prompt": "Write a haiku about programming",
"stream": true
}'
# Each line is a JSON object with partial response
# {"model":"llama4:scout","response":"Silent","done":false}
# {"model":"llama4:scout","response":" keys","done":false}
# ...
# {"model":"llama4:scout","response":"","done":true}
Model Management Endpoints
# List all pulled models
curl http://localhost:11434/api/tags
# Get model info
curl http://localhost:11434/api/show \
-d '{"name": "llama4:scout"}'
# Pull a model
curl http://localhost:11434/api/pull \
-d '{"name": "qwen3:7b"}'
# Delete a model
curl http://localhost:11434/api/delete \
-d '{"name": "llama4:scout"}'
# Check currently running models
curl http://localhost:11434/api/ps
The OpenAI-Compatible API
Drop-In OpenAI Replacement
# BEFORE: Using OpenAI
from openai import OpenAI
client = OpenAI(api_key="sk-...")
# AFTER: Using Ollama with the same code
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required by the library, value doesn't matter
)
# Everything else is identical
response = client.chat.completions.create(
model="llama4:scout", # Use Ollama model name
messages=[
{"role": "user", "content": "Explain REST APIs briefly"}
]
)
print(response.choices[0].message.content)
Streaming With the OpenAI Client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
stream = client.chat.completions.create(
model="llama4:scout",
messages=[{"role": "user", "content": "Write a short story about a robot"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
print() # Final newline
The Native Python Ollama Client
The official ollama Python library provides a cleaner interface than the raw HTTP API:
pip install ollama
Basic Chat
import ollama
response = ollama.chat(
model="llama4:scout",
messages=[
{"role": "user", "content": "What is quantum computing?"}
]
)
print(response["message"]["content"])
Streaming Chat
import ollama
stream = ollama.chat(
model="llama4:scout",
messages=[{"role": "user", "content": "Explain machine learning"}],
stream=True
)
for chunk in stream:
print(chunk["message"]["content"], end="", flush=True)
print()
System Prompts and Multi-Turn Conversations
import ollama
messages = [
{
"role": "system",
"content": "You are a concise technical writer. "
"Responses should be clear, direct, and under 200 words."
}
]
def chat(user_input: str) -> str:
messages.append({"role": "user", "content": user_input})
response = ollama.chat(
model="llama4:scout",
messages=messages
)
assistant_message = response["message"]["content"]
messages.append({"role": "assistant", "content": assistant_message})
return assistant_message
# Multi-turn conversation
print(chat("What is Docker?"))
print(chat("How is it different from a virtual machine?"))
print(chat("When should I use one versus the other?"))
Vision via API
import ollama
import base64
from pathlib import Path
def analyze_image_api(image_path: str, question: str) -> str:
image_data = base64.b64encode(Path(image_path).read_bytes()).decode()
response = ollama.chat(
model="gemma4:9b",
messages=[{
"role": "user",
"content": question,
"images": [image_data] # Pass as list of base64 strings
}]
)
return response["message"]["content"]
Generate Embeddings
import ollama
def get_embedding(text: str, model: str = "nomic-embed-text") -> list[float]:
response = ollama.embeddings(
model=model,
prompt=text
)
return response["embedding"]
# Example
embedding = get_embedding("The quick brown fox jumps over the lazy dog")
print(f"Embedding dimensions: {len(embedding)}") # Typically 768 or 1024
JavaScript / TypeScript Client
npm install ollama
import ollama from 'ollama';
// Basic chat
const response = await ollama.chat({
model: 'llama4:scout',
messages: [{ role: 'user', content: 'Why is the ocean salty?' }],
});
console.log(response.message.content);
// Streaming
const stream = await ollama.chat({
model: 'llama4:scout',
messages: [{ role: 'user', content: 'Write a JavaScript function to reverse a string' }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.message.content);
}
// Embeddings
const embedding = await ollama.embeddings({
model: 'nomic-embed-text',
prompt: 'Text to embed',
});
console.log(embedding.embedding.length);
Advanced API Features
Custom Parameters
Control model behavior via API parameters:
response = ollama.chat(
model="llama4:scout",
messages=[{"role": "user", "content": "Write creative fiction"}],
options={
"temperature": 0.9, # Higher = more creative (0.0-2.0)
"top_p": 0.95, # Nucleus sampling
"top_k": 50, # Top-k sampling
"num_ctx": 32768, # Context window size
"num_predict": 500, # Max tokens to generate
"repeat_penalty": 1.1, # Penalize repetition
"seed": 42, # Reproducible outputs
}
)
Keep-Alive Control
By default, Ollama keeps models loaded for 5 minutes after last use. Control this per-request:
# Keep model loaded indefinitely
response = ollama.chat(
model="llama4:scout",
messages=[...],
keep_alive=-1 # Never unload
)
# Unload immediately after response
response = ollama.chat(
model="llama4:scout",
messages=[...],
keep_alive=0 # Unload now
)
# Keep loaded for specific duration
response = ollama.chat(
model="llama4:scout",
messages=[...],
keep_alive="10m" # Keep loaded 10 minutes
)
Structured JSON Output
import json
import ollama
response = ollama.chat(
model="llama4:scout",
messages=[{
"role": "user",
"content": """Extract the following from this text and return as JSON:
- company_name
- founded_year
- number_of_employees
- headquarters_city
Text: Acme Corp was founded in 1987 in San Francisco.
Today it employs over 5,000 people worldwide."""
}],
format="json" # Enforces JSON output
)
data = json.loads(response["message"]["content"])
print(data)
# {"company_name": "Acme Corp", "founded_year": 1987, ...}
Replacing OpenAI in Existing Frameworks
LangChain
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_core.prompts import ChatPromptTemplate
# Drop-in model replacement
llm = OllamaLLM(model="llama4:scout")
# Or use the ChatOllama for chat models
from langchain_ollama import ChatOllama
chat_model = ChatOllama(model="llama4:scout", temperature=0.1)
# Embeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# Use in chains exactly as you would OpenAI
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("human", "{question}")
])
chain = prompt | chat_model
response = chain.invoke({"question": "What is Ollama?"})
print(response.content)
LlamaIndex
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
# Local LLM
llm = Ollama(model="llama4:scout", request_timeout=120.0)
# Local embeddings
embed_model = OllamaEmbedding(model_name="nomic-embed-text")
# Use in any LlamaIndex pipeline
from llama_index.core import Settings
Settings.llm = llm
Settings.embed_model = embed_model
Vercel AI SDK (JavaScript)
import { createOllama } from 'ollama-ai-provider';
import { generateText } from 'ai';
const ollama = createOllama({
baseURL: 'http://localhost:11434/api',
});
const { text } = await generateText({
model: ollama('llama4:scout'),
prompt: 'Why is the sky blue?',
});
Running Ollama as a Network Server
By default, Ollama listens only on localhost. To make it accessible to other devices:
# Environment variable — set before starting Ollama
OLLAMA_HOST=0.0.0.0:11434 ollama serve
# Or in systemd service
# Edit /etc/systemd/system/ollama.service
# Add: Environment="OLLAMA_HOST=0.0.0.0:11434"
# Then: sudo systemctl daemon-reload && sudo systemctl restart ollama
Other devices on your network can then reach the API at http://[your-machine-IP]:11434.
Security note: Ollama has no authentication by default. Only expose it on a trusted local network, not the public internet.
API Rate and Performance Considerations
Concurrent Requests
Ollama handles concurrent requests but serializes GPU inference — multiple simultaneous requests queue and execute sequentially:
import asyncio
import aiohttp
async def fetch_completion(session, prompt: str) -> str:
async with session.post(
"http://localhost:11434/api/generate",
json={"model": "llama4:scout", "prompt": prompt, "stream": False}
) as response:
data = await response.json()
return data["response"]
async def batch_completions(prompts: list[str]) -> list[str]:
async with aiohttp.ClientSession() as session:
tasks = [fetch_completion(session, p) for p in prompts]
return await asyncio.gather(*tasks)
# Run multiple completions (they will queue on the GPU)
results = asyncio.run(batch_completions([
"Summarize REST APIs",
"Summarize GraphQL",
"Summarize gRPC"
]))
Monitoring API Performance
import time
import ollama
def timed_completion(prompt: str, model: str = "llama4:scout") -> dict:
start = time.time()
response = ollama.generate(model=model, prompt=prompt)
duration = time.time() - start
tokens = response["eval_count"]
return {
"response": response["response"],
"tokens_generated": tokens,
"duration_seconds": round(duration, 2),
"tokens_per_second": round(tokens / duration, 1)
}
result = timed_completion("Explain neural networks in 3 sentences")
print(f"Speed: {result['tokens_per_second']} tokens/second")
Conclusion
The Ollama API makes local AI a drop-in replacement for cloud AI in most developer workflows. The OpenAI-compatible endpoint means your existing code, frameworks, and tooling work immediately — just change the base URL.
The practical path for most developers: start by replacing one specific cloud AI call in an existing application with the Ollama equivalent. Pick something non-critical — a summarization task, a classification step, a content generation function. Measure the quality. Measure the performance. Use that experience to decide which parts of your AI infrastructure can move to local, private inference.
Your next step: If you have any Python code calling the OpenAI API, add the two-line change from the “Drop-In OpenAI Replacement” section. Point it at Ollama. Run it. See if the output quality is sufficient for that specific task.
📚 Continue the Series:
Last updated: May 2026. Verify current Ollama API documentation at github.com/ollama/ollama/blob/main/docs/api.md.