A Modelfile changes how a model behaves through instructions. Fine-tuning changes what a model knows through training. These are fundamentally different interventions — and understanding the difference determines whether fine-tuning is the right tool for your situation.
Fine-tuning is worth doing when:
- You have a large amount of domain-specific text that contains terminology, formats, or knowledge the base model lacks
- You need consistent style or format that system prompts alone do not reliably produce
- You are running the model in production at high volume where better accuracy has compounding value
- You want to remove capabilities or add specific knowledge permanently
Fine-tuning is not worth doing when:
- A well-written Modelfile system prompt achieves the behavior you need
- You have fewer than a few hundred training examples
- Your use case changes frequently (fine-tuned models are static)
- You want to add current information (use RAG for this)
This guide covers the full fine-tuning pipeline: data preparation, training with Unsloth on consumer hardware, converting to GGUF for Ollama, and evaluation.
🔗 This is Post #19 in the Ollama Unlocked series. For prompt-based customization without training, see The Modelfile (Post #17). For adding document knowledge without training, see RAG with Ollama (Post #10).
Fine-Tuning vs. RAG vs. Modelfile: Choosing Correctly
QUESTION: Does the model need to know domain-specific facts
that are not in its training data?
→ YES: Use RAG (add documents to context) — faster, cheaper, updateable
→ NO: Continue...
QUESTION: Does the model need to consistently produce a specific
format, style, or tone?
→ YES, and system prompt works: Use Modelfile — done
→ YES, but system prompt is unreliable or inconsistent: Fine-tune
→ NO: Continue...
QUESTION: Do you have 500+ high-quality examples of
exactly the input-output behavior you want?
→ YES: Fine-tuning is viable
→ NO: Build more examples first, or use Modelfile
QUESTION: Will this model be used at high volume in production?
→ YES: Fine-tuning's quality improvement compounds — worth the investment
→ NO: Modelfile is probably sufficient
Hardware Requirements for Fine-Tuning
Fine-tuning requires more memory than inference. Parameter-efficient methods (LoRA, QLoRA) make it feasible on consumer hardware:
| Method | Model Size | Min VRAM | Notes |
|---|---|---|---|
| QLoRA | 7B | 8 GB | Consumer GPU viable |
| QLoRA | 13B | 12 GB | RTX 3060 12GB |
| QLoRA | 27B | 20 GB | RTX 3090/4090 |
| LoRA (full precision) | 7B | 16 GB | Higher quality |
| LoRA (full precision) | 13B | 24 GB | RTX 4090 |
| Full fine-tuning | 7B | 80+ GB | Multi-GPU workstation |
Recommendation for most users: QLoRA on a 7B–13B model on a gaming GPU. This produces the most practical fine-tuning experience on consumer hardware.
Apple Silicon: Unsloth has limited MLX support. For fine-tuning on Mac, use the CPU path (slower) or cloud GPU instances for training, then deploy the result locally.
The Fine-Tuning Pipeline
Your Data → Format Dataset → Train with Unsloth →
Export to GGUF → Import to Ollama → Evaluate → Deploy
Step 1: Prepare Your Dataset
Dataset Format
Fine-tuning requires question-answer pairs (instruction format) or input-output examples:
// dataset.jsonl — one example per line
{"instruction": "Classify this customer email as: billing, technical, general", "input": "My invoice shows the wrong amount for last month", "output": "billing"}
{"instruction": "Classify this customer email as: billing, technical, general", "input": "The app crashes when I try to upload files larger than 100MB", "output": "technical"}
{"instruction": "Classify this customer email as: billing, technical, general", "input": "What are your business hours?", "output": "general"}
Or conversational format:
{"messages": [
{"role": "system", "content": "You are a helpful assistant for TechCorp."},
{"role": "user", "content": "How do I reset my password?"},
{"role": "assistant", "content": "To reset your password, go to Settings → Security → Change Password. You will need your current password or access to your registered email address."}
]}
Dataset Size Guidelines
| Use Case | Minimum Examples | Target |
|---|---|---|
| Style/format consistency | 100–200 | 500+ |
| Domain terminology | 300–500 | 1,000+ |
| Task-specific fine-tuning | 500–1,000 | 2,000+ |
| Significant behavior change | 1,000+ | 5,000+ |
Creating Datasets Efficiently
Use your existing local models to help generate training data:
# dataset_generator.py
import ollama
import json
from pathlib import Path
def generate_training_pairs(
seed_examples: list[dict],
num_to_generate: int,
model: str = "llama4:scout"
) -> list[dict]:
"""Generate more training examples from seed examples."""
seed_text = json.dumps(seed_examples[:5], indent=2)
response = ollama.generate(
model=model,
prompt=f"""Generate {num_to_generate} more training examples in the same format and style as these examples.
IMPORTANT:
- Match the exact JSON format
- Vary the inputs significantly
- Ensure outputs follow the same patterns
- Do not repeat the seed examples
Seed examples:
{seed_text}
Generate {num_to_generate} new examples as a JSON array:""",
options={"temperature": 0.8, "num_ctx": 8192, "num_predict": 4000}
)
import re
json_match = re.search(r'\[.*\]', response["response"], re.DOTALL)
if json_match:
try:
return json.loads(json_match.group())
except json.JSONDecodeError:
return []
return []
def create_dataset(source_folder: str, output_file: str):
"""Create a fine-tuning dataset from documents in a folder."""
docs = []
for f in Path(source_folder).rglob("*.txt"):
docs.append(f.read_text(encoding="utf-8", errors="ignore"))
examples = []
for doc in docs[:50]: # Process first 50 documents
# Generate QA pairs from each document
response = ollama.generate(
model="llama4:scout",
prompt=f"""Create 5 question-answer pairs from this document.
Format as JSON array: [{{"instruction": "question", "output": "answer"}}]
Document: {doc[:3000]}""",
options={"temperature": 0.3, "num_predict": 2000}
)
import re
json_match = re.search(r'\[.*\]', response["response"], re.DOTALL)
if json_match:
try:
pairs = json.loads(json_match.group())
examples.extend(pairs)
except json.JSONDecodeError:
continue
# Save as JSONL
with open(output_file, 'w') as f:
for ex in examples:
f.write(json.dumps(ex) + '\n')
print(f"Created {len(examples)} training examples in {output_file}")
return examples
Step 2: Fine-Tune With Unsloth
Unsloth is the most efficient local fine-tuning library — it reduces memory usage by 60–70% compared to standard implementations, making training on consumer GPUs practical.
Install Unsloth
# NVIDIA GPU (CUDA)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
# Verify CUDA is available
python -c "import torch; print(torch.cuda.is_available())"
Full Fine-Tuning Script
# fine_tune.py
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch
# ============================================================
# CONFIGURATION — edit these
# ============================================================
BASE_MODEL = "unsloth/Llama-3.1-8B-Instruct" # Or qwen/Qwen2.5-7B-Instruct
OUTPUT_DIR = "./fine_tuned_model"
DATASET_FILE = "./dataset.jsonl"
MAX_SEQ_LENGTH = 4096
LOAD_IN_4BIT = True # QLoRA — enables 8GB VRAM training
LORA_RANK = 16 # Higher = more parameters trained, more VRAM
NUM_TRAIN_EPOCHS = 3
BATCH_SIZE = 2 # Reduce if OOM
GRAD_ACCUMULATION = 4 # Effective batch = BATCH_SIZE × GRAD_ACCUMULATION
LEARNING_RATE = 2e-4
# ============================================================
print(f"Training on: {BASE_MODEL}")
print(f"Dataset: {DATASET_FILE}")
print(f"Output: {OUTPUT_DIR}")
# Load the base model with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=BASE_MODEL,
max_seq_length=MAX_SEQ_LENGTH,
dtype=None, # Auto-detect
load_in_4bit=LOAD_IN_4BIT,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=LORA_RANK,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=LORA_RANK,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
use_rslora=False,
)
# Load dataset
dataset = load_dataset("json", data_files=DATASET_FILE, split="train")
print(f"Loaded {len(dataset)} training examples")
def format_instruction(example):
"""Format example as instruction-following prompt."""
instruction = example.get("instruction", "")
input_text = example.get("input", "")
output = example.get("output", "")
if input_text:
text = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n{output}"
else:
text = f"### Instruction:\n{instruction}\n\n### Response:\n{output}"
return {"text": text + tokenizer.eos_token}
dataset = dataset.map(format_instruction)
# Training arguments
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
num_train_epochs=NUM_TRAIN_EPOCHS,
per_device_train_batch_size=BATCH_SIZE,
gradient_accumulation_steps=GRAD_ACCUMULATION,
warmup_steps=5,
learning_rate=LEARNING_RATE,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=10,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=42,
save_strategy="epoch",
)
# Train
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=MAX_SEQ_LENGTH,
args=training_args,
)
print("Starting training...")
trainer.train()
# Save the fine-tuned model
print("Saving model...")
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Model saved to {OUTPUT_DIR}")
Monitor Training
# Watch GPU usage during training
watch -n 1 nvidia-smi
# Expected training time on RTX 4090:
# 1000 examples × 3 epochs: ~30-60 minutes
# 5000 examples × 3 epochs: ~3-6 hours
Step 3: Export to GGUF for Ollama
After training, convert the model to GGUF format for use with Ollama:
# export_gguf.py
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="./fine_tuned_model",
max_seq_length=4096,
dtype=None,
load_in_4bit=True,
)
# Export to GGUF — choose quantization
# q4_k_m: best balance (recommended)
# q8_0: higher quality, larger file
# f16: full precision, very large file
model.save_pretrained_gguf(
"./gguf_model",
tokenizer,
quantization_method="q4_k_m"
)
print("GGUF export complete: ./gguf_model/model-q4_k_m.gguf")
Step 4: Import Into Ollama
# Create a Modelfile for the fine-tuned model
cat > FineTuned.Modelfile << 'EOF'
FROM ./gguf_model/model-q4_k_m.gguf
SYSTEM """[Your fine-tuned model's system prompt —
match what was used during training]"""
PARAMETER temperature 0.3
PARAMETER num_ctx 4096
EOF
# Create the Ollama model
ollama create MyFineTunedModel -f FineTuned.Modelfile
# Test it
ollama run MyFineTunedModel "Test prompt here"
Step 5: Evaluate the Fine-Tuned Model
Never assume fine-tuning improved things — measure it.
# evaluate.py
import ollama
import json
# Your test cases — inputs + expected outputs
test_cases = [
{
"input": "My payment was charged twice",
"expected_category": "billing",
},
{
"input": "The login button doesn't work on Chrome",
"expected_category": "technical",
},
# Add 20-50 test cases
]
def evaluate_model(model_name: str, test_cases: list) -> dict:
"""Evaluate model accuracy on test cases."""
correct = 0
results = []
for case in test_cases:
response = ollama.generate(
model=model_name,
prompt=case["input"],
options={"temperature": 0, "num_predict": 50}
)
actual = response["response"].strip().lower()
expected = case["expected_category"].lower()
is_correct = expected in actual
if is_correct:
correct += 1
results.append({
"input": case["input"],
"expected": expected,
"actual": actual,
"correct": is_correct
})
accuracy = correct / len(test_cases)
return {"accuracy": accuracy, "results": results}
# Compare base model vs fine-tuned
print("Evaluating base model...")
base_results = evaluate_model("llama4:scout", test_cases)
print("Evaluating fine-tuned model...")
ft_results = evaluate_model("MyFineTunedModel", test_cases)
print(f"\nBase model accuracy: {base_results['accuracy']:.1%}")
print(f"Fine-tuned accuracy: {ft_results['accuracy']:.1%}")
print(f"Improvement: {(ft_results['accuracy'] - base_results['accuracy']):.1%}")
# Show failures for the fine-tuned model
failures = [r for r in ft_results["results"] if not r["correct"]]
if failures:
print(f"\nFailed cases ({len(failures)}):")
for f in failures[:5]:
print(f" Input: {f['input'][:60]}...")
print(f" Expected: {f['expected']} | Got: {f['actual'][:30]}")
Cloud GPU Options for Training
If you do not have a suitable GPU, cloud options are practical:
| Provider | GPU | Hourly Cost | 1000 examples (3 epochs) |
|---|---|---|---|
| RunPod | RTX 4090 | ~$0.74/hr | ~$0.37 |
| Lambda Labs | A10 | ~$0.76/hr | ~$0.50 |
| Google Colab Pro | T4 | ~$0.45/hr | ~$0.45 |
| Vast.ai | RTX 3090 | ~$0.30/hr | ~$0.20 |
For a small fine-tuning job, cloud GPU costs are minimal — often under $1 for a basic training run.
Common Fine-Tuning Mistakes
Mistake 1: Fine-tuning when a Modelfile would work Fine-tuning is expensive and the result is static. If a system prompt produces the desired behavior 90% of the time, the marginal improvement from fine-tuning rarely justifies the cost.
Mistake 2: Low-quality training data Fine-tuning learns exactly what is in your dataset — including errors and inconsistencies. 100 high-quality, consistent examples outperform 1,000 noisy ones.
Mistake 3: Catastrophic forgetting Fine-tuning on a narrow task can degrade performance on everything else. Test the fine-tuned model on tasks outside your training domain. If general capability drops significantly, reduce epochs or adjust LoRA rank.
Mistake 4: No evaluation baseline Always measure the base model’s performance before fine-tuning. Without a baseline, you cannot know if fine-tuning helped.
Mistake 5: Training on your test set Keep 10–20% of examples out of training for evaluation. If you test on training data, accuracy scores are meaningless.
Conclusion
Fine-tuning with Unsloth on consumer hardware is practical and accessible in 2026. A QLoRA run on an RTX 4090 with 1,000 examples trains in under an hour and produces a model noticeably better at your specific task than the generic base.
The discipline is knowing when fine-tuning is the right tool. Modelfiles first. RAG for knowledge. Fine-tuning only when you have good data, a clear quality gap, and a high-volume use case where the improvement compounds.
Your next step: Identify one task where your current local model is inconsistent despite a well-crafted system prompt. Collect 200 examples of good input-output pairs. Run the fine-tuning script. Measure accuracy before and after. That comparison will tell you whether fine-tuning earns its place in your workflow.
📚 Continue the Series:
- ← Previous AI Agents With Ollama
- Next → The Future of Local AI: Where Ollama and Open Models Are Heading
- For prompt-based customization The Modelfile
- For knowledge bases RAG with Ollama
Last updated: June 2026. Unsloth, transformers, and related libraries release updates frequently. Verify current installation instructions at github.com/unslothai/unsloth.
⚠️ Fine-tuning modifies model weights permanently. Always keep a reference to the base model. Test thoroughly before deploying to production.