Skip to content
← Back to Blog

The Ollama API: Your OpenAI-Compatible Local Server for Developers

Ollama runs a REST API on localhost:11434 that is fully compatible with the OpenAI API format — meaning any application built for OpenAI can point to...

Featured cover graphic for: The Ollama API: Your OpenAI-Compatible Local Server for Developers

When Ollama starts, it does not just provide a command-line chat interface. It runs a full REST API server on http://localhost:11434 that any application can call — the same way applications call OpenAI, Anthropic, or any other AI API.

The critical feature: Ollama’s API includes an OpenAI-compatible endpoint at /v1/chat/completions. Any code written for the OpenAI API works with Ollama by changing one line — the base URL. This means you can take an existing application built on GPT-5.5 and run it on Llama 4 Scout locally, with no other changes.

For developers, this is the most important Ollama capability. It means local AI is not a separate integration path — it is the same path you already know, pointed at a local server.

🔗 This is Post #9 in the Ollama Unlocked series. For building full applications on top of the API, see Building AI Apps With Ollama and Python (Post #11). For RAG pipeline integration, see RAG with Ollama (Post #10).


The Two API Formats

Ollama exposes two sets of endpoints:

1. Native Ollama API (/api/*)

Ollama’s own format — slightly different from OpenAI, includes Ollama-specific features.

2. OpenAI-Compatible API (/v1/*)

Drop-in replacement for OpenAI endpoints. Use this for existing OpenAI integrations.

Which to use: If you are starting fresh or integrating with the broader Python/JS ecosystem, use the native Ollama client. If you are replacing OpenAI in an existing application or using a framework built on OpenAI’s format, use the /v1/ endpoints.


The Native Ollama API

Generate Completion (non-streaming)

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama4:scout",
    "prompt": "Why is the sky blue?",
    "stream": false
  }'

Response:

{
  "model": "llama4:scout",
  "response": "The sky appears blue because...",
  "done": true,
  "total_duration": 3421000000,
  "eval_count": 127,
  "eval_duration": 2800000000
}

Chat Completion (multi-turn)

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama4:scout",
    "messages": [
      {"role": "system", "content": "You are a concise technical assistant."},
      {"role": "user", "content": "What is a REST API?"}
    ],
    "stream": false
  }'

Streaming Response

# stream: true (default) — streams tokens as generated
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama4:scout",
    "prompt": "Write a haiku about programming",
    "stream": true
  }'

# Each line is a JSON object with partial response
# {"model":"llama4:scout","response":"Silent","done":false}
# {"model":"llama4:scout","response":" keys","done":false}
# ...
# {"model":"llama4:scout","response":"","done":true}

Model Management Endpoints

# List all pulled models
curl http://localhost:11434/api/tags

# Get model info
curl http://localhost:11434/api/show \
  -d '{"name": "llama4:scout"}'

# Pull a model
curl http://localhost:11434/api/pull \
  -d '{"name": "qwen3:7b"}'

# Delete a model
curl http://localhost:11434/api/delete \
  -d '{"name": "llama4:scout"}'

# Check currently running models
curl http://localhost:11434/api/ps

The OpenAI-Compatible API

Drop-In OpenAI Replacement

# BEFORE: Using OpenAI
from openai import OpenAI
client = OpenAI(api_key="sk-...")

# AFTER: Using Ollama with the same code
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required by the library, value doesn't matter
)

# Everything else is identical
response = client.chat.completions.create(
    model="llama4:scout",     # Use Ollama model name
    messages=[
        {"role": "user", "content": "Explain REST APIs briefly"}
    ]
)
print(response.choices[0].message.content)

Streaming With the OpenAI Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

stream = client.chat.completions.create(
    model="llama4:scout",
    messages=[{"role": "user", "content": "Write a short story about a robot"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()  # Final newline

The Native Python Ollama Client

The official ollama Python library provides a cleaner interface than the raw HTTP API:

pip install ollama

Basic Chat

import ollama

response = ollama.chat(
    model="llama4:scout",
    messages=[
        {"role": "user", "content": "What is quantum computing?"}
    ]
)
print(response["message"]["content"])

Streaming Chat

import ollama

stream = ollama.chat(
    model="llama4:scout",
    messages=[{"role": "user", "content": "Explain machine learning"}],
    stream=True
)

for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)
print()

System Prompts and Multi-Turn Conversations

import ollama

messages = [
    {
        "role": "system",
        "content": "You are a concise technical writer. "
                   "Responses should be clear, direct, and under 200 words."
    }
]

def chat(user_input: str) -> str:
    messages.append({"role": "user", "content": user_input})
    
    response = ollama.chat(
        model="llama4:scout",
        messages=messages
    )
    
    assistant_message = response["message"]["content"]
    messages.append({"role": "assistant", "content": assistant_message})
    return assistant_message

# Multi-turn conversation
print(chat("What is Docker?"))
print(chat("How is it different from a virtual machine?"))
print(chat("When should I use one versus the other?"))

Vision via API

import ollama
import base64
from pathlib import Path

def analyze_image_api(image_path: str, question: str) -> str:
    image_data = base64.b64encode(Path(image_path).read_bytes()).decode()
    
    response = ollama.chat(
        model="gemma4:9b",
        messages=[{
            "role": "user",
            "content": question,
            "images": [image_data]  # Pass as list of base64 strings
        }]
    )
    return response["message"]["content"]

Generate Embeddings

import ollama

def get_embedding(text: str, model: str = "nomic-embed-text") -> list[float]:
    response = ollama.embeddings(
        model=model,
        prompt=text
    )
    return response["embedding"]

# Example
embedding = get_embedding("The quick brown fox jumps over the lazy dog")
print(f"Embedding dimensions: {len(embedding)}")  # Typically 768 or 1024

JavaScript / TypeScript Client

npm install ollama
import ollama from 'ollama';

// Basic chat
const response = await ollama.chat({
  model: 'llama4:scout',
  messages: [{ role: 'user', content: 'Why is the ocean salty?' }],
});
console.log(response.message.content);

// Streaming
const stream = await ollama.chat({
  model: 'llama4:scout',
  messages: [{ role: 'user', content: 'Write a JavaScript function to reverse a string' }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.message.content);
}

// Embeddings
const embedding = await ollama.embeddings({
  model: 'nomic-embed-text',
  prompt: 'Text to embed',
});
console.log(embedding.embedding.length);

Advanced API Features

Custom Parameters

Control model behavior via API parameters:

response = ollama.chat(
    model="llama4:scout",
    messages=[{"role": "user", "content": "Write creative fiction"}],
    options={
        "temperature": 0.9,      # Higher = more creative (0.0-2.0)
        "top_p": 0.95,           # Nucleus sampling
        "top_k": 50,             # Top-k sampling
        "num_ctx": 32768,        # Context window size
        "num_predict": 500,      # Max tokens to generate
        "repeat_penalty": 1.1,  # Penalize repetition
        "seed": 42,              # Reproducible outputs
    }
)

Keep-Alive Control

By default, Ollama keeps models loaded for 5 minutes after last use. Control this per-request:

# Keep model loaded indefinitely
response = ollama.chat(
    model="llama4:scout",
    messages=[...],
    keep_alive=-1  # Never unload
)

# Unload immediately after response
response = ollama.chat(
    model="llama4:scout",
    messages=[...],
    keep_alive=0  # Unload now
)

# Keep loaded for specific duration
response = ollama.chat(
    model="llama4:scout",
    messages=[...],
    keep_alive="10m"  # Keep loaded 10 minutes
)

Structured JSON Output

import json
import ollama

response = ollama.chat(
    model="llama4:scout",
    messages=[{
        "role": "user",
        "content": """Extract the following from this text and return as JSON:
- company_name
- founded_year  
- number_of_employees
- headquarters_city

Text: Acme Corp was founded in 1987 in San Francisco. 
Today it employs over 5,000 people worldwide."""
    }],
    format="json"  # Enforces JSON output
)

data = json.loads(response["message"]["content"])
print(data)
# {"company_name": "Acme Corp", "founded_year": 1987, ...}

Replacing OpenAI in Existing Frameworks

LangChain

from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_core.prompts import ChatPromptTemplate

# Drop-in model replacement
llm = OllamaLLM(model="llama4:scout")

# Or use the ChatOllama for chat models
from langchain_ollama import ChatOllama
chat_model = ChatOllama(model="llama4:scout", temperature=0.1)

# Embeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Use in chains exactly as you would OpenAI
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("human", "{question}")
])

chain = prompt | chat_model
response = chain.invoke({"question": "What is Ollama?"})
print(response.content)

LlamaIndex

from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

# Local LLM
llm = Ollama(model="llama4:scout", request_timeout=120.0)

# Local embeddings
embed_model = OllamaEmbedding(model_name="nomic-embed-text")

# Use in any LlamaIndex pipeline
from llama_index.core import Settings
Settings.llm = llm
Settings.embed_model = embed_model

Vercel AI SDK (JavaScript)

import { createOllama } from 'ollama-ai-provider';
import { generateText } from 'ai';

const ollama = createOllama({
  baseURL: 'http://localhost:11434/api',
});

const { text } = await generateText({
  model: ollama('llama4:scout'),
  prompt: 'Why is the sky blue?',
});

Running Ollama as a Network Server

By default, Ollama listens only on localhost. To make it accessible to other devices:

# Environment variable — set before starting Ollama
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Or in systemd service
# Edit /etc/systemd/system/ollama.service
# Add: Environment="OLLAMA_HOST=0.0.0.0:11434"
# Then: sudo systemctl daemon-reload && sudo systemctl restart ollama

Other devices on your network can then reach the API at http://[your-machine-IP]:11434.

Security note: Ollama has no authentication by default. Only expose it on a trusted local network, not the public internet.


API Rate and Performance Considerations

Concurrent Requests

Ollama handles concurrent requests but serializes GPU inference — multiple simultaneous requests queue and execute sequentially:

import asyncio
import aiohttp

async def fetch_completion(session, prompt: str) -> str:
    async with session.post(
        "http://localhost:11434/api/generate",
        json={"model": "llama4:scout", "prompt": prompt, "stream": False}
    ) as response:
        data = await response.json()
        return data["response"]

async def batch_completions(prompts: list[str]) -> list[str]:
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_completion(session, p) for p in prompts]
        return await asyncio.gather(*tasks)

# Run multiple completions (they will queue on the GPU)
results = asyncio.run(batch_completions([
    "Summarize REST APIs",
    "Summarize GraphQL",
    "Summarize gRPC"
]))

Monitoring API Performance

import time
import ollama

def timed_completion(prompt: str, model: str = "llama4:scout") -> dict:
    start = time.time()
    
    response = ollama.generate(model=model, prompt=prompt)
    
    duration = time.time() - start
    tokens = response["eval_count"]
    
    return {
        "response": response["response"],
        "tokens_generated": tokens,
        "duration_seconds": round(duration, 2),
        "tokens_per_second": round(tokens / duration, 1)
    }

result = timed_completion("Explain neural networks in 3 sentences")
print(f"Speed: {result['tokens_per_second']} tokens/second")

Conclusion

The Ollama API makes local AI a drop-in replacement for cloud AI in most developer workflows. The OpenAI-compatible endpoint means your existing code, frameworks, and tooling work immediately — just change the base URL.

The practical path for most developers: start by replacing one specific cloud AI call in an existing application with the Ollama equivalent. Pick something non-critical — a summarization task, a classification step, a content generation function. Measure the quality. Measure the performance. Use that experience to decide which parts of your AI infrastructure can move to local, private inference.

Your next step: If you have any Python code calling the OpenAI API, add the two-line change from the “Drop-In OpenAI Replacement” section. Point it at Ollama. Run it. See if the output quality is sufficient for that specific task.


📚 Continue the Series:


Last updated: May 2026. Verify current Ollama API documentation at github.com/ollama/ollama/blob/main/docs/api.md.

Frequently Asked Questions (FAQ)

Does the OpenAI-compatible endpoint support all OpenAI features?
Core features — chat completions, streaming, system prompts, temperature — are fully supported. Advanced OpenAI-specific features like fine-tuning, assistants, file uploads, and DALL-E image generation are not applicable to Ollama.
Can I use the Ollama API from a browser directly?
Yes, with CORS configured. By default Ollama allows requests from localhost. For browser access from a different origin, set `OLLAMA_ORIGINS=*` or specific origins before starting Ollama. Security risk on public networks.
What is the difference between `/api/generate` and `/api/chat`?
`/api/generate` is for single-turn completion — one prompt, one response. `/api/chat` is for multi-turn conversations with a messages array. Use `/api/chat` for chatbots and conversational applications; `/api/generate` for batch text generation.
How do I handle API errors in production?
Common errors - 404 (model not found — pull it first), 500 (GPU out of memory — use a smaller model or reduce context), timeout (model still loading — retry after a few seconds). Build retry logic with exponential backoff for production robustness.

Disclaimer: The information contained on this blog is for academic and educational purposes only. Unauthorized use and/or duplication of this material without express and written permission from this site's author and/or owner is strictly prohibited. The materials (images, logos, content) contained in this web site are protected by applicable copyright and trademark law.