Skip to content
← Back to Blog

Fine-Tuning With Ollama: Customizing Models on Your Own Hardware

Fine-tuning adapts a pre-trained model to your specific domain, style, or tasks using your own data — producing a model that understands your...

Featured cover graphic for: Fine-Tuning With Ollama: Customizing Models on Your Own Hardware

A Modelfile changes how a model behaves through instructions. Fine-tuning changes what a model knows through training. These are fundamentally different interventions — and understanding the difference determines whether fine-tuning is the right tool for your situation.

Fine-tuning is worth doing when:

  • You have a large amount of domain-specific text that contains terminology, formats, or knowledge the base model lacks
  • You need consistent style or format that system prompts alone do not reliably produce
  • You are running the model in production at high volume where better accuracy has compounding value
  • You want to remove capabilities or add specific knowledge permanently

Fine-tuning is not worth doing when:

  • A well-written Modelfile system prompt achieves the behavior you need
  • You have fewer than a few hundred training examples
  • Your use case changes frequently (fine-tuned models are static)
  • You want to add current information (use RAG for this)

This guide covers the full fine-tuning pipeline: data preparation, training with Unsloth on consumer hardware, converting to GGUF for Ollama, and evaluation.

🔗 This is Post #19 in the Ollama Unlocked series. For prompt-based customization without training, see The Modelfile (Post #17). For adding document knowledge without training, see RAG with Ollama (Post #10).


Fine-Tuning vs. RAG vs. Modelfile: Choosing Correctly

QUESTION: Does the model need to know domain-specific facts
          that are not in its training data?
→ YES: Use RAG (add documents to context) — faster, cheaper, updateable
→ NO: Continue...

QUESTION: Does the model need to consistently produce a specific
          format, style, or tone?
→ YES, and system prompt works: Use Modelfile — done
→ YES, but system prompt is unreliable or inconsistent: Fine-tune
→ NO: Continue...

QUESTION: Do you have 500+ high-quality examples of 
          exactly the input-output behavior you want?
→ YES: Fine-tuning is viable
→ NO: Build more examples first, or use Modelfile

QUESTION: Will this model be used at high volume in production?
→ YES: Fine-tuning's quality improvement compounds — worth the investment
→ NO: Modelfile is probably sufficient

Hardware Requirements for Fine-Tuning

Fine-tuning requires more memory than inference. Parameter-efficient methods (LoRA, QLoRA) make it feasible on consumer hardware:

Method Model Size Min VRAM Notes
QLoRA 7B 8 GB Consumer GPU viable
QLoRA 13B 12 GB RTX 3060 12GB
QLoRA 27B 20 GB RTX 3090/4090
LoRA (full precision) 7B 16 GB Higher quality
LoRA (full precision) 13B 24 GB RTX 4090
Full fine-tuning 7B 80+ GB Multi-GPU workstation

Recommendation for most users: QLoRA on a 7B–13B model on a gaming GPU. This produces the most practical fine-tuning experience on consumer hardware.

Apple Silicon: Unsloth has limited MLX support. For fine-tuning on Mac, use the CPU path (slower) or cloud GPU instances for training, then deploy the result locally.


The Fine-Tuning Pipeline

Your Data → Format Dataset → Train with Unsloth → 
Export to GGUF → Import to Ollama → Evaluate → Deploy

Step 1: Prepare Your Dataset

Dataset Format

Fine-tuning requires question-answer pairs (instruction format) or input-output examples:

// dataset.jsonl  one example per line
{"instruction": "Classify this customer email as: billing, technical, general", "input": "My invoice shows the wrong amount for last month", "output": "billing"}
{"instruction": "Classify this customer email as: billing, technical, general", "input": "The app crashes when I try to upload files larger than 100MB", "output": "technical"}
{"instruction": "Classify this customer email as: billing, technical, general", "input": "What are your business hours?", "output": "general"}

Or conversational format:

{"messages": [
    {"role": "system", "content": "You are a helpful assistant for TechCorp."},
    {"role": "user", "content": "How do I reset my password?"},
    {"role": "assistant", "content": "To reset your password, go to Settings → Security → Change Password. You will need your current password or access to your registered email address."}
]}

Dataset Size Guidelines

Use Case Minimum Examples Target
Style/format consistency 100–200 500+
Domain terminology 300–500 1,000+
Task-specific fine-tuning 500–1,000 2,000+
Significant behavior change 1,000+ 5,000+

Creating Datasets Efficiently

Use your existing local models to help generate training data:

# dataset_generator.py
import ollama
import json
from pathlib import Path

def generate_training_pairs(
    seed_examples: list[dict],
    num_to_generate: int,
    model: str = "llama4:scout"
) -> list[dict]:
    """Generate more training examples from seed examples."""
    
    seed_text = json.dumps(seed_examples[:5], indent=2)
    
    response = ollama.generate(
        model=model,
        prompt=f"""Generate {num_to_generate} more training examples in the same format and style as these examples.

IMPORTANT:
- Match the exact JSON format
- Vary the inputs significantly
- Ensure outputs follow the same patterns
- Do not repeat the seed examples

Seed examples:
{seed_text}

Generate {num_to_generate} new examples as a JSON array:""",
        options={"temperature": 0.8, "num_ctx": 8192, "num_predict": 4000}
    )
    
    import re
    json_match = re.search(r'\[.*\]', response["response"], re.DOTALL)
    if json_match:
        try:
            return json.loads(json_match.group())
        except json.JSONDecodeError:
            return []
    return []

def create_dataset(source_folder: str, output_file: str):
    """Create a fine-tuning dataset from documents in a folder."""
    
    docs = []
    for f in Path(source_folder).rglob("*.txt"):
        docs.append(f.read_text(encoding="utf-8", errors="ignore"))
    
    examples = []
    for doc in docs[:50]:  # Process first 50 documents
        # Generate QA pairs from each document
        response = ollama.generate(
            model="llama4:scout",
            prompt=f"""Create 5 question-answer pairs from this document.
Format as JSON array: [{{"instruction": "question", "output": "answer"}}]
Document: {doc[:3000]}""",
            options={"temperature": 0.3, "num_predict": 2000}
        )
        
        import re
        json_match = re.search(r'\[.*\]', response["response"], re.DOTALL)
        if json_match:
            try:
                pairs = json.loads(json_match.group())
                examples.extend(pairs)
            except json.JSONDecodeError:
                continue
    
    # Save as JSONL
    with open(output_file, 'w') as f:
        for ex in examples:
            f.write(json.dumps(ex) + '\n')
    
    print(f"Created {len(examples)} training examples in {output_file}")
    return examples

Step 2: Fine-Tune With Unsloth

Unsloth is the most efficient local fine-tuning library — it reduces memory usage by 60–70% compared to standard implementations, making training on consumer GPUs practical.

Install Unsloth

# NVIDIA GPU (CUDA)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

# Verify CUDA is available
python -c "import torch; print(torch.cuda.is_available())"

Full Fine-Tuning Script

# fine_tune.py
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch

# ============================================================
# CONFIGURATION — edit these
# ============================================================
BASE_MODEL = "unsloth/Llama-3.1-8B-Instruct"  # Or qwen/Qwen2.5-7B-Instruct
OUTPUT_DIR = "./fine_tuned_model"
DATASET_FILE = "./dataset.jsonl"
MAX_SEQ_LENGTH = 4096
LOAD_IN_4BIT = True          # QLoRA — enables 8GB VRAM training
LORA_RANK = 16               # Higher = more parameters trained, more VRAM
NUM_TRAIN_EPOCHS = 3
BATCH_SIZE = 2               # Reduce if OOM
GRAD_ACCUMULATION = 4        # Effective batch = BATCH_SIZE × GRAD_ACCUMULATION
LEARNING_RATE = 2e-4
# ============================================================

print(f"Training on: {BASE_MODEL}")
print(f"Dataset: {DATASET_FILE}")
print(f"Output: {OUTPUT_DIR}")

# Load the base model with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=BASE_MODEL,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,  # Auto-detect
    load_in_4bit=LOAD_IN_4BIT,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_RANK,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=LORA_RANK,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
    use_rslora=False,
)

# Load dataset
dataset = load_dataset("json", data_files=DATASET_FILE, split="train")
print(f"Loaded {len(dataset)} training examples")

def format_instruction(example):
    """Format example as instruction-following prompt."""
    instruction = example.get("instruction", "")
    input_text = example.get("input", "")
    output = example.get("output", "")
    
    if input_text:
        text = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n{output}"
    else:
        text = f"### Instruction:\n{instruction}\n\n### Response:\n{output}"
    
    return {"text": text + tokenizer.eos_token}

dataset = dataset.map(format_instruction)

# Training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_TRAIN_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUMULATION,
    warmup_steps=5,
    learning_rate=LEARNING_RATE,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=10,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=42,
    save_strategy="epoch",
)

# Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    args=training_args,
)

print("Starting training...")
trainer.train()

# Save the fine-tuned model
print("Saving model...")
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Model saved to {OUTPUT_DIR}")

Monitor Training

# Watch GPU usage during training
watch -n 1 nvidia-smi

# Expected training time on RTX 4090:
# 1000 examples × 3 epochs: ~30-60 minutes
# 5000 examples × 3 epochs: ~3-6 hours

Step 3: Export to GGUF for Ollama

After training, convert the model to GGUF format for use with Ollama:

# export_gguf.py
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./fine_tuned_model",
    max_seq_length=4096,
    dtype=None,
    load_in_4bit=True,
)

# Export to GGUF — choose quantization
# q4_k_m: best balance (recommended)
# q8_0: higher quality, larger file
# f16: full precision, very large file
model.save_pretrained_gguf(
    "./gguf_model",
    tokenizer,
    quantization_method="q4_k_m"
)

print("GGUF export complete: ./gguf_model/model-q4_k_m.gguf")

Step 4: Import Into Ollama

# Create a Modelfile for the fine-tuned model
cat > FineTuned.Modelfile << 'EOF'
FROM ./gguf_model/model-q4_k_m.gguf

SYSTEM """[Your fine-tuned model's system prompt — 
match what was used during training]"""

PARAMETER temperature 0.3
PARAMETER num_ctx 4096
EOF

# Create the Ollama model
ollama create MyFineTunedModel -f FineTuned.Modelfile

# Test it
ollama run MyFineTunedModel "Test prompt here"

Step 5: Evaluate the Fine-Tuned Model

Never assume fine-tuning improved things — measure it.

# evaluate.py
import ollama
import json

# Your test cases — inputs + expected outputs
test_cases = [
    {
        "input": "My payment was charged twice",
        "expected_category": "billing",
    },
    {
        "input": "The login button doesn't work on Chrome",
        "expected_category": "technical",
    },
    # Add 20-50 test cases
]

def evaluate_model(model_name: str, test_cases: list) -> dict:
    """Evaluate model accuracy on test cases."""
    correct = 0
    results = []
    
    for case in test_cases:
        response = ollama.generate(
            model=model_name,
            prompt=case["input"],
            options={"temperature": 0, "num_predict": 50}
        )
        
        actual = response["response"].strip().lower()
        expected = case["expected_category"].lower()
        is_correct = expected in actual
        
        if is_correct:
            correct += 1
        
        results.append({
            "input": case["input"],
            "expected": expected,
            "actual": actual,
            "correct": is_correct
        })
    
    accuracy = correct / len(test_cases)
    return {"accuracy": accuracy, "results": results}

# Compare base model vs fine-tuned
print("Evaluating base model...")
base_results = evaluate_model("llama4:scout", test_cases)

print("Evaluating fine-tuned model...")
ft_results = evaluate_model("MyFineTunedModel", test_cases)

print(f"\nBase model accuracy: {base_results['accuracy']:.1%}")
print(f"Fine-tuned accuracy: {ft_results['accuracy']:.1%}")
print(f"Improvement: {(ft_results['accuracy'] - base_results['accuracy']):.1%}")

# Show failures for the fine-tuned model
failures = [r for r in ft_results["results"] if not r["correct"]]
if failures:
    print(f"\nFailed cases ({len(failures)}):")
    for f in failures[:5]:
        print(f"  Input: {f['input'][:60]}...")
        print(f"  Expected: {f['expected']} | Got: {f['actual'][:30]}")

Cloud GPU Options for Training

If you do not have a suitable GPU, cloud options are practical:

Provider GPU Hourly Cost 1000 examples (3 epochs)
RunPod RTX 4090 ~$0.74/hr ~$0.37
Lambda Labs A10 ~$0.76/hr ~$0.50
Google Colab Pro T4 ~$0.45/hr ~$0.45
Vast.ai RTX 3090 ~$0.30/hr ~$0.20

For a small fine-tuning job, cloud GPU costs are minimal — often under $1 for a basic training run.


Common Fine-Tuning Mistakes

Mistake 1: Fine-tuning when a Modelfile would work Fine-tuning is expensive and the result is static. If a system prompt produces the desired behavior 90% of the time, the marginal improvement from fine-tuning rarely justifies the cost.

Mistake 2: Low-quality training data Fine-tuning learns exactly what is in your dataset — including errors and inconsistencies. 100 high-quality, consistent examples outperform 1,000 noisy ones.

Mistake 3: Catastrophic forgetting Fine-tuning on a narrow task can degrade performance on everything else. Test the fine-tuned model on tasks outside your training domain. If general capability drops significantly, reduce epochs or adjust LoRA rank.

Mistake 4: No evaluation baseline Always measure the base model’s performance before fine-tuning. Without a baseline, you cannot know if fine-tuning helped.

Mistake 5: Training on your test set Keep 10–20% of examples out of training for evaluation. If you test on training data, accuracy scores are meaningless.


Conclusion

Fine-tuning with Unsloth on consumer hardware is practical and accessible in 2026. A QLoRA run on an RTX 4090 with 1,000 examples trains in under an hour and produces a model noticeably better at your specific task than the generic base.

The discipline is knowing when fine-tuning is the right tool. Modelfiles first. RAG for knowledge. Fine-tuning only when you have good data, a clear quality gap, and a high-volume use case where the improvement compounds.

Your next step: Identify one task where your current local model is inconsistent despite a well-crafted system prompt. Collect 200 examples of good input-output pairs. Run the fine-tuning script. Measure accuracy before and after. That comparison will tell you whether fine-tuning earns its place in your workflow.


📚 Continue the Series:


Last updated: June 2026. Unsloth, transformers, and related libraries release updates frequently. Verify current installation instructions at github.com/unslothai/unsloth.

⚠️ Fine-tuning modifies model weights permanently. Always keep a reference to the base model. Test thoroughly before deploying to production.

Frequently Asked Questions (FAQ)

Will fine-tuning a 7B model make it as good as a 70B model?
No. Fine-tuning improves task-specific performance but does not increase the model's fundamental capability or knowledge. A fine-tuned 7B model outperforms an untuned 7B model on your specific task; it cannot match a 70B model's breadth.
Can I fine-tune on proprietary data?
Yes — local fine-tuning with Unsloth means your training data never leaves your machine. This is one of the strongest arguments for local fine-tuning versus cloud-based fine-tuning services.
How often should I retrain?
Retrain when your task requirements change significantly or when new base models become available that offer better starting points. For knowledge updates, use RAG instead — it is much cheaper to update a document store than to retrain a model.
What is the difference between LoRA rank 8 and rank 64?
Higher rank = more parameters trained = more capacity to adapt = more VRAM required. For simple style/format tasks, rank 8–16 is sufficient. For complex knowledge adaptation, rank 32–64 gives more capacity. Start with 16 and increase only if quality is insufficient.

Disclaimer: The information contained on this blog is for academic and educational purposes only. Unauthorized use and/or duplication of this material without express and written permission from this site's author and/or owner is strictly prohibited. The materials (images, logos, content) contained in this web site are protected by applicable copyright and trademark law.