Fine-Tuning With Ollama: Customizing Models on Your Own Hardware

A Modelfile changes how a model behaves through instructions. Fine-tuning changes what a model knows through training. These are fundamentally different interventions — and understanding the difference determines whether fine-tuning is the right tool for your situation.

Fine-tuning is worth doing when:

You have a large amount of domain-specific text that contains terminology, formats, or knowledge the base model lacks
You need consistent style or format that system prompts alone do not reliably produce
You are running the model in production at high volume where better accuracy has compounding value
You want to remove capabilities or add specific knowledge permanently

Fine-tuning is not worth doing when:

A well-written Modelfile system prompt achieves the behavior you need
You have fewer than a few hundred training examples
Your use case changes frequently (fine-tuned models are static)
You want to add current information (use RAG for this)

This guide covers the full fine-tuning pipeline: data preparation, training with Unsloth on consumer hardware, converting to GGUF for Ollama, and evaluation.

🔗 This is Post #19 in the Ollama Unlocked series. For prompt-based customization without training, see The Modelfile (Post #17). For adding document knowledge without training, see RAG with Ollama (Post #10).

Fine-Tuning vs. RAG vs. Modelfile: Choosing Correctly

QUESTION: Does the model need to know domain-specific facts
          that are not in its training data?
→ YES: Use RAG (add documents to context) — faster, cheaper, updateable
→ NO: Continue...

QUESTION: Does the model need to consistently produce a specific
          format, style, or tone?
→ YES, and system prompt works: Use Modelfile — done
→ YES, but system prompt is unreliable or inconsistent: Fine-tune
→ NO: Continue...

QUESTION: Do you have 500+ high-quality examples of 
          exactly the input-output behavior you want?
→ YES: Fine-tuning is viable
→ NO: Build more examples first, or use Modelfile

QUESTION: Will this model be used at high volume in production?
→ YES: Fine-tuning's quality improvement compounds — worth the investment
→ NO: Modelfile is probably sufficient

Hardware Requirements for Fine-Tuning

Fine-tuning requires more memory than inference. Parameter-efficient methods (LoRA, QLoRA) make it feasible on consumer hardware:

Method	Model Size	Min VRAM	Notes
QLoRA	7B	8 GB	Consumer GPU viable
QLoRA	13B	12 GB	RTX 3060 12GB
QLoRA	27B	20 GB	RTX 3090/4090
LoRA (full precision)	7B	16 GB	Higher quality
LoRA (full precision)	13B	24 GB	RTX 4090
Full fine-tuning	7B	80+ GB	Multi-GPU workstation

Recommendation for most users: QLoRA on a 7B–13B model on a gaming GPU. This produces the most practical fine-tuning experience on consumer hardware.

Apple Silicon: Unsloth has limited MLX support. For fine-tuning on Mac, use the CPU path (slower) or cloud GPU instances for training, then deploy the result locally.

The Fine-Tuning Pipeline

Your Data → Format Dataset → Train with Unsloth → 
Export to GGUF → Import to Ollama → Evaluate → Deploy

Step 1: Prepare Your Dataset

Dataset Format

Fine-tuning requires question-answer pairs (instruction format) or input-output examples:

// dataset.jsonl — one example per line
{"instruction": "Classify this customer email as: billing, technical, general", "input": "My invoice shows the wrong amount for last month", "output": "billing"}
{"instruction": "Classify this customer email as: billing, technical, general", "input": "The app crashes when I try to upload files larger than 100MB", "output": "technical"}
{"instruction": "Classify this customer email as: billing, technical, general", "input": "What are your business hours?", "output": "general"}

Or conversational format:

{"messages": [
    {"role": "system", "content": "You are a helpful assistant for TechCorp."},
    {"role": "user", "content": "How do I reset my password?"},
    {"role": "assistant", "content": "To reset your password, go to Settings → Security → Change Password. You will need your current password or access to your registered email address."}
]}

Dataset Size Guidelines

Use Case	Minimum Examples	Target
Style/format consistency	100–200	500+
Domain terminology	300–500	1,000+
Task-specific fine-tuning	500–1,000	2,000+
Significant behavior change	1,000+	5,000+

Creating Datasets Efficiently

Use your existing local models to help generate training data:

# dataset_generator.py
import ollama
import json
from pathlib import Path

def generate_training_pairs(
    seed_examples: list[dict],
    num_to_generate: int,
    model: str = "llama4:scout"
) -> list[dict]:
    """Generate more training examples from seed examples."""
    
    seed_text = json.dumps(seed_examples[:5], indent=2)
    
    response = ollama.generate(
        model=model,
        prompt=f"""Generate {num_to_generate} more training examples in the same format and style as these examples.

IMPORTANT:
- Match the exact JSON format
- Vary the inputs significantly
- Ensure outputs follow the same patterns
- Do not repeat the seed examples

Seed examples:
{seed_text}

Generate {num_to_generate} new examples as a JSON array:""",
        options={"temperature": 0.8, "num_ctx": 8192, "num_predict": 4000}
    )
    
    import re
    json_match = re.search(r'\[.*\]', response["response"], re.DOTALL)
    if json_match:
        try:
            return json.loads(json_match.group())
        except json.JSONDecodeError:
            return []
    return []

def create_dataset(source_folder: str, output_file: str):
    """Create a fine-tuning dataset from documents in a folder."""
    
    docs = []
    for f in Path(source_folder).rglob("*.txt"):
        docs.append(f.read_text(encoding="utf-8", errors="ignore"))
    
    examples = []
    for doc in docs[:50]:  # Process first 50 documents
        # Generate QA pairs from each document
        response = ollama.generate(
            model="llama4:scout",
            prompt=f"""Create 5 question-answer pairs from this document.
Format as JSON array: [{{"instruction": "question", "output": "answer"}}]
Document: {doc[:3000]}""",
            options={"temperature": 0.3, "num_predict": 2000}
        )
        
        import re
        json_match = re.search(r'\[.*\]', response["response"], re.DOTALL)
        if json_match:
            try:
                pairs = json.loads(json_match.group())
                examples.extend(pairs)
            except json.JSONDecodeError:
                continue
    
    # Save as JSONL
    with open(output_file, 'w') as f:
        for ex in examples:
            f.write(json.dumps(ex) + '\n')
    
    print(f"Created {len(examples)} training examples in {output_file}")
    return examples

Step 2: Fine-Tune With Unsloth

Unsloth is the most efficient local fine-tuning library — it reduces memory usage by 60–70% compared to standard implementations, making training on consumer GPUs practical.

Install Unsloth

# NVIDIA GPU (CUDA)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

# Verify CUDA is available
python -c "import torch; print(torch.cuda.is_available())"

Full Fine-Tuning Script

# fine_tune.py
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch

# ============================================================
# CONFIGURATION — edit these
# ============================================================
BASE_MODEL = "unsloth/Llama-3.1-8B-Instruct"  # Or qwen/Qwen2.5-7B-Instruct
OUTPUT_DIR = "./fine_tuned_model"
DATASET_FILE = "./dataset.jsonl"
MAX_SEQ_LENGTH = 4096
LOAD_IN_4BIT = True          # QLoRA — enables 8GB VRAM training
LORA_RANK = 16               # Higher = more parameters trained, more VRAM
NUM_TRAIN_EPOCHS = 3
BATCH_SIZE = 2               # Reduce if OOM
GRAD_ACCUMULATION = 4        # Effective batch = BATCH_SIZE × GRAD_ACCUMULATION
LEARNING_RATE = 2e-4
# ============================================================

print(f"Training on: {BASE_MODEL}")
print(f"Dataset: {DATASET_FILE}")
print(f"Output: {OUTPUT_DIR}")

# Load the base model with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=BASE_MODEL,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,  # Auto-detect
    load_in_4bit=LOAD_IN_4BIT,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_RANK,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=LORA_RANK,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
    use_rslora=False,
)

# Load dataset
dataset = load_dataset("json", data_files=DATASET_FILE, split="train")
print(f"Loaded {len(dataset)} training examples")

def format_instruction(example):
    """Format example as instruction-following prompt."""
    instruction = example.get("instruction", "")
    input_text = example.get("input", "")
    output = example.get("output", "")
    
    if input_text:
        text = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n{output}"
    else:
        text = f"### Instruction:\n{instruction}\n\n### Response:\n{output}"
    
    return {"text": text + tokenizer.eos_token}

dataset = dataset.map(format_instruction)

# Training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_TRAIN_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUMULATION,
    warmup_steps=5,
    learning_rate=LEARNING_RATE,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=10,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=42,
    save_strategy="epoch",
)

# Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    args=training_args,
)

print("Starting training...")
trainer.train()

# Save the fine-tuned model
print("Saving model...")
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Model saved to {OUTPUT_DIR}")

Monitor Training

# Watch GPU usage during training
watch -n 1 nvidia-smi

# Expected training time on RTX 4090:
# 1000 examples × 3 epochs: ~30-60 minutes
# 5000 examples × 3 epochs: ~3-6 hours

Step 3: Export to GGUF for Ollama

After training, convert the model to GGUF format for use with Ollama:

# export_gguf.py
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./fine_tuned_model",
    max_seq_length=4096,
    dtype=None,
    load_in_4bit=True,
)

# Export to GGUF — choose quantization
# q4_k_m: best balance (recommended)
# q8_0: higher quality, larger file
# f16: full precision, very large file
model.save_pretrained_gguf(
    "./gguf_model",
    tokenizer,
    quantization_method="q4_k_m"
)

print("GGUF export complete: ./gguf_model/model-q4_k_m.gguf")

Step 4: Import Into Ollama

# Create a Modelfile for the fine-tuned model
cat > FineTuned.Modelfile << 'EOF'
FROM ./gguf_model/model-q4_k_m.gguf

SYSTEM """[Your fine-tuned model's system prompt — 
match what was used during training]"""

PARAMETER temperature 0.3
PARAMETER num_ctx 4096
EOF

# Create the Ollama model
ollama create MyFineTunedModel -f FineTuned.Modelfile

# Test it
ollama run MyFineTunedModel "Test prompt here"

Step 5: Evaluate the Fine-Tuned Model

Never assume fine-tuning improved things — measure it.

# evaluate.py
import ollama
import json

# Your test cases — inputs + expected outputs
test_cases = [
    {
        "input": "My payment was charged twice",
        "expected_category": "billing",
    },
    {
        "input": "The login button doesn't work on Chrome",
        "expected_category": "technical",
    },
    # Add 20-50 test cases
]

def evaluate_model(model_name: str, test_cases: list) -> dict:
    """Evaluate model accuracy on test cases."""
    correct = 0
    results = []
    
    for case in test_cases:
        response = ollama.generate(
            model=model_name,
            prompt=case["input"],
            options={"temperature": 0, "num_predict": 50}
        )
        
        actual = response["response"].strip().lower()
        expected = case["expected_category"].lower()
        is_correct = expected in actual
        
        if is_correct:
            correct += 1
        
        results.append({
            "input": case["input"],
            "expected": expected,
            "actual": actual,
            "correct": is_correct
        })
    
    accuracy = correct / len(test_cases)
    return {"accuracy": accuracy, "results": results}

# Compare base model vs fine-tuned
print("Evaluating base model...")
base_results = evaluate_model("llama4:scout", test_cases)

print("Evaluating fine-tuned model...")
ft_results = evaluate_model("MyFineTunedModel", test_cases)

print(f"\nBase model accuracy: {base_results['accuracy']:.1%}")
print(f"Fine-tuned accuracy: {ft_results['accuracy']:.1%}")
print(f"Improvement: {(ft_results['accuracy'] - base_results['accuracy']):.1%}")

# Show failures for the fine-tuned model
failures = [r for r in ft_results["results"] if not r["correct"]]
if failures:
    print(f"\nFailed cases ({len(failures)}):")
    for f in failures[:5]:
        print(f"  Input: {f['input'][:60]}...")
        print(f"  Expected: {f['expected']} | Got: {f['actual'][:30]}")

Cloud GPU Options for Training

If you do not have a suitable GPU, cloud options are practical:

Provider	GPU	Hourly Cost	1000 examples (3 epochs)
RunPod	RTX 4090	~$0.74/hr	~$0.37
Lambda Labs	A10	~$0.76/hr	~$0.50
Google Colab Pro	T4	~$0.45/hr	~$0.45
Vast.ai	RTX 3090	~$0.30/hr	~$0.20

For a small fine-tuning job, cloud GPU costs are minimal — often under $1 for a basic training run.

Common Fine-Tuning Mistakes

Mistake 1: Fine-tuning when a Modelfile would work Fine-tuning is expensive and the result is static. If a system prompt produces the desired behavior 90% of the time, the marginal improvement from fine-tuning rarely justifies the cost.

Mistake 2: Low-quality training data Fine-tuning learns exactly what is in your dataset — including errors and inconsistencies. 100 high-quality, consistent examples outperform 1,000 noisy ones.

Mistake 3: Catastrophic forgetting Fine-tuning on a narrow task can degrade performance on everything else. Test the fine-tuned model on tasks outside your training domain. If general capability drops significantly, reduce epochs or adjust LoRA rank.

Mistake 4: No evaluation baseline Always measure the base model’s performance before fine-tuning. Without a baseline, you cannot know if fine-tuning helped.

Mistake 5: Training on your test set Keep 10–20% of examples out of training for evaluation. If you test on training data, accuracy scores are meaningless.

Conclusion

Fine-tuning with Unsloth on consumer hardware is practical and accessible in 2026. A QLoRA run on an RTX 4090 with 1,000 examples trains in under an hour and produces a model noticeably better at your specific task than the generic base.

The discipline is knowing when fine-tuning is the right tool. Modelfiles first. RAG for knowledge. Fine-tuning only when you have good data, a clear quality gap, and a high-volume use case where the improvement compounds.

Your next step: Identify one task where your current local model is inconsistent despite a well-crafted system prompt. Collect 200 examples of good input-output pairs. Run the fine-tuning script. Measure accuracy before and after. That comparison will tell you whether fine-tuning earns its place in your workflow.

📚 Continue the Series:

← Previous AI Agents With Ollama

Next → The Future of Local AI: Where Ollama and Open Models Are Heading

For prompt-based customization The Modelfile

For knowledge bases RAG with Ollama

Last updated: June 2026. Unsloth, transformers, and related libraries release updates frequently. Verify current installation instructions at github.com/unslothai/unsloth.

⚠️ Fine-tuning modifies model weights permanently. Always keep a reference to the base model. Test thoroughly before deploying to production.

Frequently Asked Questions (FAQ)

Will fine-tuning a 7B model make it as good as a 70B model?

No. Fine-tuning improves task-specific performance but does not increase the model's fundamental capability or knowledge. A fine-tuned 7B model outperforms an untuned 7B model on your specific task; it cannot match a 70B model's breadth.

Can I fine-tune on proprietary data?

Yes — local fine-tuning with Unsloth means your training data never leaves your machine. This is one of the strongest arguments for local fine-tuning versus cloud-based fine-tuning services.

How often should I retrain?

Retrain when your task requirements change significantly or when new base models become available that offer better starting points. For knowledge updates, use RAG instead — it is much cheaper to update a document store than to retrain a model.

What is the difference between LoRA rank 8 and rank 64?

Higher rank = more parameters trained = more capacity to adapt = more VRAM required. For simple style/format tasks, rank 8–16 is sufficient. For complex knowledge adaptation, rank 32–64 gives more capacity. Start with 16 and increase only if quality is insufficient.