How to Fine-Tune an LLM Locally: Complete Guide (2026)
Fine-tuning is the nuclear option. It's powerful, time-consuming, and — in 2026 — often unnecessary. Base models like Qwen 3.5, Llama 4, and Gemma 3 handle tasks out of the box that required fine-tuning 18 months ago. But when you genuinely need a model to speak your domain's language, match a speci
How to Fine-Tune an LLM Locally: Complete Guide (2026)
Fine-tuning is the nuclear option. It's powerful, time-consuming, and — in 2026 — often unnecessary.
Base models like Qwen 3.5, Llama 4, and Gemma 3 handle tasks out of the box that required fine-tuning 18 months ago. But when you genuinely need a model to speak your domain's language, match a specific output format, or squeeze big-model quality into something that runs on your hardware — nothing else comes close.
This guide covers the full pipeline: deciding whether you actually need fine-tuning, picking the right tool, preparing your data, running QLoRA training on a consumer GPU, and exporting the result to Ollama. No cloud required.
Do You Actually Need Fine-Tuning?
Before you burn GPU hours, walk through this:
Use Prompt Engineering When:
- Your task can be described in natural language
- You have fewer than 50 examples
- You're still figuring out what you want the model to do
- Latency isn't critical
Use RAG When:
- The model needs to reference specific, changing documents
- Factual accuracy on proprietary data matters more than style
- Your knowledge base updates frequently
- You need attribution and source traceability
Use Fine-Tuning When:
- You need a specific output format the model consistently botches
- You're optimizing for latency and want a smaller model that punches above its weight
- You have domain-specific language (medical, legal, internal company jargon)
- You have 1,000+ high-quality training examples
- You want to distill a large model's capabilities into something you can self-host cheaply
The honest rule: if you can solve it with a better prompt or RAG, do that first. Fine-tuning is for the problems that remain after you've exhausted those options.
The Tooling Landscape: 4 Frameworks Compared
Four open-source frameworks dominate local fine-tuning in March 2026:
| Framework | GitHub Stars | Best For | Learning Curve |
|---|---|---|---|
| LLaMA-Factory | 68.4K | GUI-first, broadest model support | Low |
| Unsloth | 53.9K | Speed & VRAM optimization | Low-Medium |
| TRL | 17.6K | RLHF/GRPO, Hugging Face ecosystem | Medium-High |
| Axolotl | 11.4K | Config-driven production pipelines | Medium |
Our Pick: Unsloth
For local fine-tuning on consumer GPUs, Unsloth wins on the metric that matters most: it uses 60-80% less VRAM than standard training and runs 2-5x faster. The same 8B model fine-tuning that takes 9.4 hours on raw PyTorch takes 0.8 hours on Unsloth.
Why this matters for local users: Unsloth is the reason you can fine-tune a 14B model on a single 24GB GPU. Without it, you'd need 48GB+ or cloud GPUs.
When to pick something else:
- LLaMA-Factory → You want a web UI and don't care about squeezing every last MB of VRAM
- TRL → You need RLHF/DPO/GRPO alignment training (Unsloth supports these too, but TRL has deeper integration)
- Axolotl → You're running production fine-tuning pipelines with YAML configs and need reproducibility
For this guide, we'll use Unsloth.
Hardware Requirements
Fine-tuning is more GPU-intensive than inference. Here's what you need:
| Model Size | QLoRA VRAM | LoRA VRAM | Full Fine-Tune | Recommended GPU |
|---|---|---|---|---|
| 3-4B | ~4GB | ~8GB | ~16GB | RTX 4060 Ti 16GB |
| 7-8B | ~6GB | ~14GB | ~32GB | RTX 3090 / 4090 |
| 13-14B | ~10GB | ~24GB | ~56GB | RTX 4090 |
| 30-34B | ~20GB | ~48GB | ~120GB | 2x RTX 4090 or A100 |
| 70B | ~40GB | ~80GB | ~280GB | Multi-A100 (cloud) |
Key insight: QLoRA uses roughly 4x less VRAM than full fine-tuning. A $1,599 RTX 4090 with 24GB can fine-tune anything up to 14B parameters — which covers the most useful local models.
If you're serious about regular fine-tuning, the RTX 5090 with 32GB gives you headroom for 30B+ models with QLoRA.
Disclosure: GPU links are Amazon affiliate links. We earn a commission at no extra cost to you.
Step-by-Step: QLoRA Fine-Tuning with Unsloth
Step 1: Install Unsloth
# Create a fresh environment
conda create -n finetune python=3.11 -y
conda activate finetune
pip install unsloth
Unsloth requires an NVIDIA GPU with CUDA support. If you're on Apple Silicon, use MLX instead (different workflow, not covered here).
Step 2: Prepare Your Dataset
Fine-tuning quality depends entirely on data quality. The format matters:
Chat/Instruction format (most common):
[
{
"conversations": [
{"role": "system", "content": "You are a helpful medical assistant."},
{"role": "user", "content": "What are the symptoms of type 2 diabetes?"},
{"role": "assistant", "content": "The key symptoms of type 2 diabetes include..."}
]
}
]
Guidelines:
- Minimum 1,000 examples for meaningful improvement (500 can work for narrow tasks)
- Quality over quantity — 1,000 perfect examples beat 10,000 sloppy ones
- Match your use case — if you want the model to write medical reports, train on medical reports
- Include edge cases — 10-20% of your data should cover unusual inputs
- No contamination — never include test data in training
Save your dataset as dataset.json in the project directory.
Step 3: Configure and Run Training
from unsloth import FastLanguageModel
import torch
Load base model with 4-bit quantization (QLoRA)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3-14B", # Or any HF model
max_seq_length=4096,
dtype=None, # Auto-detect
load_in_4bit=True, # QLoRA
)
Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank (8-64, higher = more capacity)
target_modules=["q_proj", "k_proj", "v_proj",
"o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth", # 30% less VRAM
)
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
Load your dataset
dataset = load_dataset("json", data_files="dataset.json", split="train")
Training config
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=200, # Adjust based on dataset size
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
output_dir="outputs",
logging_steps=10,
optim="adamw_8bit",
seed=42,
),
)
Start training
trainer.train()
What these parameters mean:
r=16— LoRA rank. Higher values learn more but use more VRAM. 16 is the sweet spot for most tasks.per_device_train_batch_size=2— Lower this to 1 if you run out of VRAM.max_steps=200— For 1,000 examples, this gives ~3 epochs. Watch the loss curve — stop if it plateaus.learning_rate=2e-4— Standard for QLoRA. Lower (1e-4) if the model is "forgetting" its general capabilities.
Training time estimates (RTX 4090, QLoRA):
| Model | Dataset Size | Approximate Time |
|---|---|---|
| Qwen 3 8B | 1,000 examples | ~15 minutes |
| Qwen 3 14B | 1,000 examples | ~30 minutes |
| Qwen 3 14B | 5,000 examples | ~2 hours |
| Llama 4 Scout 17B | 1,000 examples | ~45 minutes |
Step 4: Export to GGUF for Ollama
The whole point of local fine-tuning is running the result locally. Unsloth has built-in GGUF export:
# Save as GGUF for Ollama
model.save_pretrained_gguf(
"my-finetuned-model",
tokenizer,
quantization_method="q4_k_m", # Best quality/size balance
)
Quantization options:
q4_k_m— Recommended. Best balance of quality and file size.q5_k_m— Slightly better quality, ~25% larger.q8_0— Near-lossless, but large files.f16— Full precision, no quality loss, very large.
Step 5: Run in Ollama
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./my-finetuned-model/unsloth.Q4_K_M.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>"""
PARAMETER stop "<|im_end|>"
PARAMETER temperature 0.7
EOF
Create the Ollama model
ollama create my-model -f Modelfile
Run it
ollama run my-model
That's the full pipeline: raw data → QLoRA training → GGUF export → Ollama. No cloud, no API keys, no subscription.
Common Mistakes (and How to Avoid Them)
1. Training on too little data. Under 500 examples, you're likely overfitting. The model memorizes your examples instead of learning the pattern.
2. Training too long. Watch the training loss. If it drops below 0.5 and keeps going, you're overfitting. Save checkpoints and test each one.
3. Bad data quality. "Garbage in, garbage out" is literally true for fine-tuning. One incorrect example can poison hundreds of correct ones. Curate aggressively.
4. Wrong base model. Don't fine-tune a 70B model to do what a fine-tuned 8B can do. Start with the smallest model that gets close to your needs, then fine-tune that.
5. Forgetting evaluation. Always hold out 10-20% of your data for evaluation. A model that scores well on training data but fails on held-out data has learned nothing useful.
Which Base Model to Fine-Tune?
| Use Case | Recommended Base | Why |
|---|---|---|
| General assistant | Qwen 3 14B | Best all-round quality at this size |
| Coding | Qwen 3 Coder 14B | Built for code, strong baseline |
| Multilingual | Qwen 3 14B | Native multilingual support |
| Reasoning-heavy | DeepSeek R1 14B | Explicit chain-of-thought |
| Lightweight/edge | Gemma 3 4B | Excellent quality-per-parameter |
See our Open Source LLM Leaderboard for the full ranking, or our Best Ollama Models guide for models ready to run immediately.
The Bottom Line
Fine-tuning locally in 2026 is genuinely accessible. A single RTX 4090, Unsloth, and a weekend of data preparation is all you need to create a model that speaks your domain's language and runs entirely on your hardware.
The pipeline:
1. Collect 1,000+ quality examples
2. Install Unsloth
3. Run QLoRA training (~30 min for a 14B model)
4. Export to GGUF
Start small. Fine-tune an 8B model on 1,000 examples. If the results justify it, scale up to 14B or more data. Most teams never need to go beyond that.
Related: Best Ollama Models 2026 | Open Source LLM Leaderboard 2026 | Best GPU for AI 2026 | vLLM vs Ollama vs TGI
FAQ
Can you fine-tune an LLM on a consumer GPU?
Yes — QLoRA (Quantized Low-Rank Adaptation) enables fine-tuning 7B models on 8GB VRAM, 13B on 16GB, and 30B+ on 24GB. A full fine-tune is impractical locally, but LoRA and QLoRA adapters achieve near-full-tune quality at 5-10% of the memory cost.
What is LoRA fine-tuning?
LoRA (Low-Rank Adaptation) adds small trainable adapter layers to a frozen base model. Instead of updating all billions of parameters, you train ~0.1-1% of them. The resulting adapter file is tiny (10-500MB) and can be loaded on top of the base model at inference time.
What dataset size do I need for fine-tuning?
For LoRA on a specific style or domain: 100-500 examples produce noticeable results. For reliable behavior changes: 1,000-5,000 examples. For a complete domain expert: 10,000+ quality examples. More data helps, but quality matters more than quantity — 500 excellent examples beat 5,000 mediocre ones.
What tools should I use to fine-tune a local LLM?
LLaMA Factory is the most user-friendly — GUI and CLI support, works with most model architectures. Axolotl is more flexible for custom training setups. Unsloth (2× faster training, 60% less VRAM) is excellent for Llama/Qwen/Mistral families on consumer hardware. All are open-source and free.
How long does fine-tuning take on an RTX 4090?
LoRA fine-tune on 7B model: 1,000 examples × 3 epochs = ~30-60 minutes on RTX 4090. 13B: 1-2 hours. Training time scales linearly with dataset size and epochs. QLoRA is ~2× slower than LoRA due to quantization overhead. Use gradient checkpointing to reduce VRAM at the cost of ~20% more training time.
Frequently Asked Questions
Can you fine-tune an LLM on a consumer GPU?
What is LoRA fine-tuning?
What dataset size do I need for fine-tuning?
What tools should I use to fine-tune a local LLM?
How long does fine-tuning take on an RTX 4090?
🔧 Tools in This Article
All tools →Related Guides
All guides →Open Source LLM Leaderboard 2026: The 12 Best Models Right Now
The open source LLM landscape in March 2026 barely resembles what it looked like a year ago. Chinese labs now hold most top positions. Models from Moonshot, Zhipu, and Alibaba consistently match or beat GPT-4o on major benchmarks. And the "small" models are getting scary good — Qwen 3.5 27B threaten
8 min read
Local LLMHow to Run DeepSeek R1 Locally: Complete Setup Guide (2026)
DeepSeek R1 is the most capable open-source reasoning model available. Its chain-of-thought approach — where the model explicitly shows its thinking before answering — beats GPT-4o on math, science, and coding benchmarks. And unlike closed-source alternatives, you can run it on your own hardware. Th
7 min read
ComparisonOllama vs LM Studio vs llama.cpp: Which Should You Use in 2026?
Three tools, one goal: run AI locally. Ollama for simplicity, LM Studio for a GUI, llama.cpp for power users. Here is how to choose.
10 min read