Local LLM

How to Fine-Tune an LLM Locally: Complete Guide (2026)

Fine-tuning is the nuclear option. It's powerful, time-consuming, and — in 2026 — often unnecessary. Base models like Qwen 3.5, Llama 4, and Gemma 3 handle tasks out of the box that required fine-tuning 18 months ago. But when you genuinely need a model to speak your domain's language, match a speci

March 16, 2026·9 min read·1,804 words

How to Fine-Tune an LLM Locally: Complete Guide (2026)

Fine-tuning is the nuclear option. It's powerful, time-consuming, and — in 2026 — often unnecessary.

Base models like Qwen 3.5, Llama 4, and Gemma 3 handle tasks out of the box that required fine-tuning 18 months ago. But when you genuinely need a model to speak your domain's language, match a specific output format, or squeeze big-model quality into something that runs on your hardware — nothing else comes close.

This guide covers the full pipeline: deciding whether you actually need fine-tuning, picking the right tool, preparing your data, running QLoRA training on a consumer GPU, and exporting the result to Ollama. No cloud required.

Do You Actually Need Fine-Tuning?

Before you burn GPU hours, walk through this:

Use Prompt Engineering When:

Your task can be described in natural language

You have fewer than 50 examples

You're still figuring out what you want the model to do

Latency isn't critical

Use RAG When:

The model needs to reference specific, changing documents

Factual accuracy on proprietary data matters more than style

Your knowledge base updates frequently

You need attribution and source traceability

Use Fine-Tuning When:

You need a specific output format the model consistently botches

You're optimizing for latency and want a smaller model that punches above its weight

You have domain-specific language (medical, legal, internal company jargon)

You have 1,000+ high-quality training examples

You want to distill a large model's capabilities into something you can self-host cheaply

The honest rule: if you can solve it with a better prompt or RAG, do that first. Fine-tuning is for the problems that remain after you've exhausted those options.

The Tooling Landscape: 4 Frameworks Compared

Four open-source frameworks dominate local fine-tuning in March 2026:

Framework	GitHub Stars	Best For	Learning Curve
LLaMA-Factory	68.4K	GUI-first, broadest model support	Low
Unsloth	53.9K	Speed & VRAM optimization	Low-Medium
TRL	17.6K	RLHF/GRPO, Hugging Face ecosystem	Medium-High
Axolotl	11.4K	Config-driven production pipelines	Medium

Our Pick: Unsloth

For local fine-tuning on consumer GPUs, Unsloth wins on the metric that matters most: it uses 60-80% less VRAM than standard training and runs 2-5x faster. The same 8B model fine-tuning that takes 9.4 hours on raw PyTorch takes 0.8 hours on Unsloth.

Why this matters for local users: Unsloth is the reason you can fine-tune a 14B model on a single 24GB GPU. Without it, you'd need 48GB+ or cloud GPUs.

When to pick something else:

LLaMA-Factory → You want a web UI and don't care about squeezing every last MB of VRAM
TRL → You need RLHF/DPO/GRPO alignment training (Unsloth supports these too, but TRL has deeper integration)
Axolotl → You're running production fine-tuning pipelines with YAML configs and need reproducibility

For this guide, we'll use Unsloth.

Hardware Requirements

Fine-tuning is more GPU-intensive than inference. Here's what you need:

Model Size	QLoRA VRAM	LoRA VRAM	Full Fine-Tune	Recommended GPU
3-4B	~4GB	~8GB	~16GB	RTX 4060 Ti 16GB
7-8B	~6GB	~14GB	~32GB	RTX 3090 / 4090
13-14B	~10GB	~24GB	~56GB	RTX 4090
30-34B	~20GB	~48GB	~120GB	2x RTX 4090 or A100
70B	~40GB	~80GB	~280GB	Multi-A100 (cloud)

Key insight: QLoRA uses roughly 4x less VRAM than full fine-tuning. A $1,599 RTX 4090 with 24GB can fine-tune anything up to 14B parameters — which covers the most useful local models.

If you're serious about regular fine-tuning, the RTX 5090 with 32GB gives you headroom for 30B+ models with QLoRA.

Disclosure: GPU links are Amazon affiliate links. We earn a commission at no extra cost to you.

Step-by-Step: QLoRA Fine-Tuning with Unsloth

Step 1: Install Unsloth

# Create a fresh environment conda create -n finetune python=3.11 -y conda activate finetune pip install unsloth

Unsloth requires an NVIDIA GPU with CUDA support. If you're on Apple Silicon, use MLX instead (different workflow, not covered here).

Step 2: Prepare Your Dataset

Fine-tuning quality depends entirely on data quality. The format matters:

Chat/Instruction format (most common):

[
  {
    "conversations": [
      {"role": "system", "content": "You are a helpful medical assistant."},
      {"role": "user", "content": "What are the symptoms of type 2 diabetes?"},
      {"role": "assistant", "content": "The key symptoms of type 2 diabetes include..."}
    ]
  }
]

Guidelines:

Minimum 1,000 examples for meaningful improvement (500 can work for narrow tasks)
Quality over quantity — 1,000 perfect examples beat 10,000 sloppy ones
Match your use case — if you want the model to write medical reports, train on medical reports
Include edge cases — 10-20% of your data should cover unusual inputs
No contamination — never include test data in training

Save your dataset as dataset.json in the project directory.

Step 3: Configure and Run Training

from unsloth import FastLanguageModel
import torch
Load base model with 4-bit quantization (QLoRA)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-14B",  # Or any HF model
    max_seq_length=4096,
    dtype=None,  # Auto-detect
    load_in_4bit=True,  # QLoRA
)
Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,              # LoRA rank (8-64, higher = more capacity)
    target_modules=["q_proj", "k_proj", "v_proj",
                     "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # 30% less VRAM
)

from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
Load your dataset
dataset = load_dataset("json", data_files="dataset.json", split="train")
Training config
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        max_steps=200,           # Adjust based on dataset size
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        output_dir="outputs",
        logging_steps=10,
        optim="adamw_8bit",
        seed=42,
    ),
)
Start training
trainer.train()

What these parameters mean:

r=16 — LoRA rank. Higher values learn more but use more VRAM. 16 is the sweet spot for most tasks.
per_device_train_batch_size=2 — Lower this to 1 if you run out of VRAM.
max_steps=200 — For 1,000 examples, this gives ~3 epochs. Watch the loss curve — stop if it plateaus.
learning_rate=2e-4 — Standard for QLoRA. Lower (1e-4) if the model is "forgetting" its general capabilities.

Training time estimates (RTX 4090, QLoRA):

Model	Dataset Size	Approximate Time
Qwen 3 8B	1,000 examples	~15 minutes
Qwen 3 14B	1,000 examples	~30 minutes
Qwen 3 14B	5,000 examples	~2 hours
Llama 4 Scout 17B	1,000 examples	~45 minutes

Step 4: Export to GGUF for Ollama

The whole point of local fine-tuning is running the result locally. Unsloth has built-in GGUF export:

# Save as GGUF for Ollama model.save_pretrained_gguf( "my-finetuned-model", tokenizer, quantization_method="q4_k_m", # Best quality/size balance )

Quantization options:

q4_k_m — Recommended. Best balance of quality and file size.
q5_k_m — Slightly better quality, ~25% larger.
q8_0 — Near-lossless, but large files.
f16 — Full precision, no quality loss, very large.

Step 5: Run in Ollama

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./my-finetuned-model/unsloth.Q4_K_M.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>"""
PARAMETER stop "<|im_end|>"
PARAMETER temperature 0.7
EOF
Create the Ollama model
ollama create my-model -f Modelfile
Run it
ollama run my-model

That's the full pipeline: raw data → QLoRA training → GGUF export → Ollama. No cloud, no API keys, no subscription.

Common Mistakes (and How to Avoid Them)

1. Training on too little data. Under 500 examples, you're likely overfitting. The model memorizes your examples instead of learning the pattern.

2. Training too long. Watch the training loss. If it drops below 0.5 and keeps going, you're overfitting. Save checkpoints and test each one.

3. Bad data quality. "Garbage in, garbage out" is literally true for fine-tuning. One incorrect example can poison hundreds of correct ones. Curate aggressively.

4. Wrong base model. Don't fine-tune a 70B model to do what a fine-tuned 8B can do. Start with the smallest model that gets close to your needs, then fine-tune that.

5. Forgetting evaluation. Always hold out 10-20% of your data for evaluation. A model that scores well on training data but fails on held-out data has learned nothing useful.

Which Base Model to Fine-Tune?

Use Case	Recommended Base	Why
General assistant	Qwen 3 14B	Best all-round quality at this size
Coding	Qwen 3 Coder 14B	Built for code, strong baseline
Multilingual	Qwen 3 14B	Native multilingual support
Reasoning-heavy	DeepSeek R1 14B	Explicit chain-of-thought
Lightweight/edge	Gemma 3 4B	Excellent quality-per-parameter

See our Open Source LLM Leaderboard for the full ranking, or our Best Ollama Models guide for models ready to run immediately.

The Bottom Line

Fine-tuning locally in 2026 is genuinely accessible. A single RTX 4090, Unsloth, and a weekend of data preparation is all you need to create a model that speaks your domain's language and runs entirely on your hardware.

The pipeline:

1. Collect 1,000+ quality examples

2. Install Unsloth

3. Run QLoRA training (~30 min for a 14B model)

4. Export to GGUF

5. Load in Ollama or vLLM

Start small. Fine-tune an 8B model on 1,000 examples. If the results justify it, scale up to 14B or more data. Most teams never need to go beyond that.

FAQ

Can you fine-tune an LLM on a consumer GPU?

Yes — QLoRA (Quantized Low-Rank Adaptation) enables fine-tuning 7B models on 8GB VRAM, 13B on 16GB, and 30B+ on 24GB. A full fine-tune is impractical locally, but LoRA and QLoRA adapters achieve near-full-tune quality at 5-10% of the memory cost.

What is LoRA fine-tuning?

LoRA (Low-Rank Adaptation) adds small trainable adapter layers to a frozen base model. Instead of updating all billions of parameters, you train ~0.1-1% of them. The resulting adapter file is tiny (10-500MB) and can be loaded on top of the base model at inference time.

What dataset size do I need for fine-tuning?

For LoRA on a specific style or domain: 100-500 examples produce noticeable results. For reliable behavior changes: 1,000-5,000 examples. For a complete domain expert: 10,000+ quality examples. More data helps, but quality matters more than quantity — 500 excellent examples beat 5,000 mediocre ones.

What tools should I use to fine-tune a local LLM?

LLaMA Factory is the most user-friendly — GUI and CLI support, works with most model architectures. Axolotl is more flexible for custom training setups. Unsloth (2× faster training, 60% less VRAM) is excellent for Llama/Qwen/Mistral families on consumer hardware. All are open-source and free.

How long does fine-tuning take on an RTX 4090?

LoRA fine-tune on 7B model: 1,000 examples × 3 epochs = ~30-60 minutes on RTX 4090. 13B: 1-2 hours. Training time scales linearly with dataset size and epochs. QLoRA is ~2× slower than LoRA due to quantization overhead. Use gradient checkpointing to reduce VRAM at the cost of ~20% more training time.

Frequently Asked Questions

Can you fine-tune an LLM on a consumer GPU?

What is LoRA fine-tuning?

LoRA (Low-Rank Adaptation) adds small trainable adapter layers to a frozen base model. Instead of updating all billions of parameters, you train 0.1-1% of them. The resulting adapter file is tiny (10-500MB) and can be loaded on top of the base model at inference time.

What dataset size do I need for fine-tuning?

What tools should I use to fine-tune a local LLM?

How long does fine-tuning take on an RTX 4090?

LoRA fine-tune on 7B model: 1,000 examples × 3 epochs = 30-60 minutes on RTX 4090. 13B: 1-2 hours. Training time scales linearly with dataset size and epochs. QLoRA is 2× slower than LoRA due to quantization overhead. Use gradient checkpointing to reduce VRAM at the cost of 20% more training time.

🔧 Tools in This Article

Hugging Face

Ollama

vLLM

Related Guides

All guides →

Local LLM

Open Source LLM Leaderboard 2026: The 12 Best Models Right Now

The open source LLM landscape in March 2026 barely resembles what it looked like a year ago. Chinese labs now hold most top positions. Models from Moonshot, Zhipu, and Alibaba consistently match or beat GPT-4o on major benchmarks. And the "small" models are getting scary good — Qwen 3.5 27B threaten

8 min read

Local LLM

How to Run DeepSeek R1 Locally: Complete Setup Guide (2026)

DeepSeek R1 is the most capable open-source reasoning model available. Its chain-of-thought approach — where the model explicitly shows its thinking before answering — beats GPT-4o on math, science, and coding benchmarks. And unlike closed-source alternatives, you can run it on your own hardware. Th

7 min read

Comparison

Ollama vs LM Studio vs llama.cpp: Which Should You Use in 2026?

Three tools, one goal: run AI locally. Ollama for simplicity, LM Studio for a GUI, llama.cpp for power users. Here is how to choose.

10 min read

#local-llm#guide