Local LLM

How to Fine-Tune an LLM Locally: Complete Guide (2026)

Fine-tuning is the nuclear option. It's powerful, time-consuming, and — in 2026 — often unnecessary. Base models like Qwen 3.5, Llama 4, and Gemma 3 handle tasks out of the box that required fine-tuning 18 months ago. But when you genuinely need a model to speak your domain's language, match a speci

March 16, 2026·9 min read·1,804 words

How to Fine-Tune an LLM Locally: Complete Guide (2026)

Fine-tuning is the nuclear option. It's powerful, time-consuming, and — in 2026 — often unnecessary.

Base models like Qwen 3.5, Llama 4, and Gemma 3 handle tasks out of the box that required fine-tuning 18 months ago. But when you genuinely need a model to speak your domain's language, match a specific output format, or squeeze big-model quality into something that runs on your hardware — nothing else comes close.

This guide covers the full pipeline: deciding whether you actually need fine-tuning, picking the right tool, preparing your data, running QLoRA training on a consumer GPU, and exporting the result to Ollama. No cloud required.

Do You Actually Need Fine-Tuning?

Before you burn GPU hours, walk through this:

Use Prompt Engineering When:

  • Your task can be described in natural language
  • You have fewer than 50 examples
  • You're still figuring out what you want the model to do
  • Latency isn't critical

Use RAG When:

  • The model needs to reference specific, changing documents
  • Factual accuracy on proprietary data matters more than style
  • Your knowledge base updates frequently
  • You need attribution and source traceability

Use Fine-Tuning When:

  • You need a specific output format the model consistently botches
  • You're optimizing for latency and want a smaller model that punches above its weight
  • You have domain-specific language (medical, legal, internal company jargon)
  • You have 1,000+ high-quality training examples
  • You want to distill a large model's capabilities into something you can self-host cheaply

The honest rule: if you can solve it with a better prompt or RAG, do that first. Fine-tuning is for the problems that remain after you've exhausted those options.

The Tooling Landscape: 4 Frameworks Compared

Four open-source frameworks dominate local fine-tuning in March 2026:

Framework GitHub Stars Best For Learning Curve
LLaMA-Factory 68.4K GUI-first, broadest model support Low
Unsloth 53.9K Speed & VRAM optimization Low-Medium
TRL 17.6K RLHF/GRPO, Hugging Face ecosystem Medium-High
Axolotl 11.4K Config-driven production pipelines Medium

Our Pick: Unsloth

For local fine-tuning on consumer GPUs, Unsloth wins on the metric that matters most: it uses 60-80% less VRAM than standard training and runs 2-5x faster. The same 8B model fine-tuning that takes 9.4 hours on raw PyTorch takes 0.8 hours on Unsloth.

Why this matters for local users: Unsloth is the reason you can fine-tune a 14B model on a single 24GB GPU. Without it, you'd need 48GB+ or cloud GPUs.

When to pick something else:

  • LLaMA-Factory → You want a web UI and don't care about squeezing every last MB of VRAM
  • TRL → You need RLHF/DPO/GRPO alignment training (Unsloth supports these too, but TRL has deeper integration)
  • Axolotl → You're running production fine-tuning pipelines with YAML configs and need reproducibility

For this guide, we'll use Unsloth.

Hardware Requirements

Fine-tuning is more GPU-intensive than inference. Here's what you need:

Model Size QLoRA VRAM LoRA VRAM Full Fine-Tune Recommended GPU
3-4B ~4GB ~8GB ~16GB RTX 4060 Ti 16GB
7-8B ~6GB ~14GB ~32GB RTX 3090 / 4090
13-14B ~10GB ~24GB ~56GB RTX 4090
30-34B ~20GB ~48GB ~120GB 2x RTX 4090 or A100
70B ~40GB ~80GB ~280GB Multi-A100 (cloud)

Key insight: QLoRA uses roughly 4x less VRAM than full fine-tuning. A $1,599 RTX 4090 with 24GB can fine-tune anything up to 14B parameters — which covers the most useful local models.

If you're serious about regular fine-tuning, the RTX 5090 with 32GB gives you headroom for 30B+ models with QLoRA.

Disclosure: GPU links are Amazon affiliate links. We earn a commission at no extra cost to you.

Step-by-Step: QLoRA Fine-Tuning with Unsloth

Step 1: Install Unsloth

# Create a fresh environment

conda create -n finetune python=3.11 -y

conda activate finetune

pip install unsloth

Unsloth requires an NVIDIA GPU with CUDA support. If you're on Apple Silicon, use MLX instead (different workflow, not covered here).

Step 2: Prepare Your Dataset

Fine-tuning quality depends entirely on data quality. The format matters:

Chat/Instruction format (most common):

[

{

"conversations": [

{"role": "system", "content": "You are a helpful medical assistant."},

{"role": "user", "content": "What are the symptoms of type 2 diabetes?"},

{"role": "assistant", "content": "The key symptoms of type 2 diabetes include..."}

]

}

]

Guidelines:

  • Minimum 1,000 examples for meaningful improvement (500 can work for narrow tasks)
  • Quality over quantity — 1,000 perfect examples beat 10,000 sloppy ones
  • Match your use case — if you want the model to write medical reports, train on medical reports
  • Include edge cases — 10-20% of your data should cover unusual inputs
  • No contamination — never include test data in training

Save your dataset as dataset.json in the project directory.

Step 3: Configure and Run Training

from unsloth import FastLanguageModel

import torch

Load base model with 4-bit quantization (QLoRA)

model, tokenizer = FastLanguageModel.from_pretrained(

model_name="unsloth/Qwen3-14B", # Or any HF model

max_seq_length=4096,

dtype=None, # Auto-detect

load_in_4bit=True, # QLoRA

)

Add LoRA adapters

model = FastLanguageModel.get_peft_model(

model,

r=16, # LoRA rank (8-64, higher = more capacity)

target_modules=["q_proj", "k_proj", "v_proj",

"o_proj", "gate_proj", "up_proj", "down_proj"],

lora_alpha=16,

lora_dropout=0,

bias="none",

use_gradient_checkpointing="unsloth", # 30% less VRAM

)

from trl import SFTTrainer

from transformers import TrainingArguments

from datasets import load_dataset

Load your dataset

dataset = load_dataset("json", data_files="dataset.json", split="train")

Training config

trainer = SFTTrainer(

model=model,

tokenizer=tokenizer,

train_dataset=dataset,

args=TrainingArguments(

per_device_train_batch_size=2,

gradient_accumulation_steps=4,

warmup_steps=10,

max_steps=200, # Adjust based on dataset size

learning_rate=2e-4,

fp16=not torch.cuda.is_bf16_supported(),

bf16=torch.cuda.is_bf16_supported(),

output_dir="outputs",

logging_steps=10,

optim="adamw_8bit",

seed=42,

),

)

Start training

trainer.train()

What these parameters mean:

  • r=16 — LoRA rank. Higher values learn more but use more VRAM. 16 is the sweet spot for most tasks.
  • per_device_train_batch_size=2 — Lower this to 1 if you run out of VRAM.
  • max_steps=200 — For 1,000 examples, this gives ~3 epochs. Watch the loss curve — stop if it plateaus.
  • learning_rate=2e-4 — Standard for QLoRA. Lower (1e-4) if the model is "forgetting" its general capabilities.

Training time estimates (RTX 4090, QLoRA):

Model Dataset Size Approximate Time
Qwen 3 8B 1,000 examples ~15 minutes
Qwen 3 14B 1,000 examples ~30 minutes
Qwen 3 14B 5,000 examples ~2 hours
Llama 4 Scout 17B 1,000 examples ~45 minutes

Step 4: Export to GGUF for Ollama

The whole point of local fine-tuning is running the result locally. Unsloth has built-in GGUF export:

# Save as GGUF for Ollama

model.save_pretrained_gguf(

"my-finetuned-model",

tokenizer,

quantization_method="q4_k_m", # Best quality/size balance

)

Quantization options:

  • q4_k_mRecommended. Best balance of quality and file size.
  • q5_k_m — Slightly better quality, ~25% larger.
  • q8_0 — Near-lossless, but large files.
  • f16 — Full precision, no quality loss, very large.

Step 5: Run in Ollama

# Create a Modelfile

cat > Modelfile << 'EOF'

FROM ./my-finetuned-model/unsloth.Q4_K_M.gguf

TEMPLATE """{{ if .System }}<|im_start|>system

{{ .System }}<|im_end|>

{{ end }}<|im_start|>user

{{ .Prompt }}<|im_end|>

<|im_start|>assistant

{{ .Response }}<|im_end|>"""

PARAMETER stop "<|im_end|>"

PARAMETER temperature 0.7

EOF

Create the Ollama model

ollama create my-model -f Modelfile

Run it

ollama run my-model

That's the full pipeline: raw data → QLoRA training → GGUF export → Ollama. No cloud, no API keys, no subscription.

Common Mistakes (and How to Avoid Them)

1. Training on too little data. Under 500 examples, you're likely overfitting. The model memorizes your examples instead of learning the pattern.

2. Training too long. Watch the training loss. If it drops below 0.5 and keeps going, you're overfitting. Save checkpoints and test each one.

3. Bad data quality. "Garbage in, garbage out" is literally true for fine-tuning. One incorrect example can poison hundreds of correct ones. Curate aggressively.

4. Wrong base model. Don't fine-tune a 70B model to do what a fine-tuned 8B can do. Start with the smallest model that gets close to your needs, then fine-tune that.

5. Forgetting evaluation. Always hold out 10-20% of your data for evaluation. A model that scores well on training data but fails on held-out data has learned nothing useful.

Which Base Model to Fine-Tune?

Use Case Recommended Base Why
General assistant Qwen 3 14B Best all-round quality at this size
Coding Qwen 3 Coder 14B Built for code, strong baseline
Multilingual Qwen 3 14B Native multilingual support
Reasoning-heavy DeepSeek R1 14B Explicit chain-of-thought
Lightweight/edge Gemma 3 4B Excellent quality-per-parameter

See our Open Source LLM Leaderboard for the full ranking, or our Best Ollama Models guide for models ready to run immediately.

The Bottom Line

Fine-tuning locally in 2026 is genuinely accessible. A single RTX 4090, Unsloth, and a weekend of data preparation is all you need to create a model that speaks your domain's language and runs entirely on your hardware.

The pipeline:

1. Collect 1,000+ quality examples

2. Install Unsloth

3. Run QLoRA training (~30 min for a 14B model)

4. Export to GGUF

5. Load in Ollama or vLLM

Start small. Fine-tune an 8B model on 1,000 examples. If the results justify it, scale up to 14B or more data. Most teams never need to go beyond that.


Related: Best Ollama Models 2026 | Open Source LLM Leaderboard 2026 | Best GPU for AI 2026 | vLLM vs Ollama vs TGI

FAQ

Can you fine-tune an LLM on a consumer GPU?

Yes — QLoRA (Quantized Low-Rank Adaptation) enables fine-tuning 7B models on 8GB VRAM, 13B on 16GB, and 30B+ on 24GB. A full fine-tune is impractical locally, but LoRA and QLoRA adapters achieve near-full-tune quality at 5-10% of the memory cost.

What is LoRA fine-tuning?

LoRA (Low-Rank Adaptation) adds small trainable adapter layers to a frozen base model. Instead of updating all billions of parameters, you train ~0.1-1% of them. The resulting adapter file is tiny (10-500MB) and can be loaded on top of the base model at inference time.

What dataset size do I need for fine-tuning?

For LoRA on a specific style or domain: 100-500 examples produce noticeable results. For reliable behavior changes: 1,000-5,000 examples. For a complete domain expert: 10,000+ quality examples. More data helps, but quality matters more than quantity — 500 excellent examples beat 5,000 mediocre ones.

What tools should I use to fine-tune a local LLM?

LLaMA Factory is the most user-friendly — GUI and CLI support, works with most model architectures. Axolotl is more flexible for custom training setups. Unsloth (2× faster training, 60% less VRAM) is excellent for Llama/Qwen/Mistral families on consumer hardware. All are open-source and free.

How long does fine-tuning take on an RTX 4090?

LoRA fine-tune on 7B model: 1,000 examples × 3 epochs = ~30-60 minutes on RTX 4090. 13B: 1-2 hours. Training time scales linearly with dataset size and epochs. QLoRA is ~2× slower than LoRA due to quantization overhead. Use gradient checkpointing to reduce VRAM at the cost of ~20% more training time.

Frequently Asked Questions

Can you fine-tune an LLM on a consumer GPU?
Yes — QLoRA (Quantized Low-Rank Adaptation) enables fine-tuning 7B models on 8GB VRAM, 13B on 16GB, and 30B+ on 24GB. A full fine-tune is impractical locally, but LoRA and QLoRA adapters achieve near-full-tune quality at 5-10% of the memory cost.
What is LoRA fine-tuning?
LoRA (Low-Rank Adaptation) adds small trainable adapter layers to a frozen base model. Instead of updating all billions of parameters, you train 0.1-1% of them. The resulting adapter file is tiny (10-500MB) and can be loaded on top of the base model at inference time.
What dataset size do I need for fine-tuning?
For LoRA on a specific style or domain: 100-500 examples produce noticeable results. For reliable behavior changes: 1,000-5,000 examples. For a complete domain expert: 10,000+ quality examples. More data helps, but quality matters more than quantity — 500 excellent examples beat 5,000 mediocre ones.
What tools should I use to fine-tune a local LLM?
LLaMA Factory is the most user-friendly — GUI and CLI support, works with most model architectures. Axolotl is more flexible for custom training setups. Unsloth (2× faster training, 60% less VRAM) is excellent for Llama/Qwen/Mistral families on consumer hardware. All are open-source and free.
How long does fine-tuning take on an RTX 4090?
LoRA fine-tune on 7B model: 1,000 examples × 3 epochs = 30-60 minutes on RTX 4090. 13B: 1-2 hours. Training time scales linearly with dataset size and epochs. QLoRA is 2× slower than LoRA due to quantization overhead. Use gradient checkpointing to reduce VRAM at the cost of 20% more training time.

🔧 Tools in This Article

All tools →

Related Guides

All guides →
#local-llm#guide