Guide

What is Quantization? A Practical Guide for Local LLMs (2026)

Q: What is quantization in LLMs?

Quantization reduces the precision of model weights from 16-bit (FP16) to lower formats like 4-bit (Q4). This shrinks file sizes by 2-4× and reduces VRAM usage, enabling larger models on consumer hardware with minimal quality loss.

Q: What is the difference between Q4, Q5, and Q8?

Q4: 4 bits per weight (2× compression vs FP16). Q5: 5 bits. Q8: 8 bits — nearly identical to FP16 quality. Q4 is the sweet spot: minimal quality loss with significant memory savings. Q2 causes noticeable degradation.

Q: Does quantization reduce LLM quality?

Yes, but minimally for Q4 and above. Perplexity increases 1-3% with Q4 vs FP16. In practice, Q4 Llama 3 8B is virtually indistinguishable from FP16 for most everyday tasks. Q2 quantization is generally not recommended.

Q: What VRAM do I need for a 70B quantized model?

70B at Q4: 40GB VRAM. 70B at Q3: 30GB. 70B at Q2: 20GB (but quality suffers). A single RTX 4090 (24GB) can't fit 70B Q4 — you need two 24GB GPUs or 40GB+ unified memory.

Q: What is GGUF format?

GGUF (GPT-Generated Unified Format) is the standard file format for quantized LLMs used by llama.cpp, Ollama, and LM Studio. It stores model weights, tokenizer, and metadata in a single file, replacing the older GGML format.

Quantization is crucial for running large language models locally without memory issues. Understand it to choose the right model and format for your GPU.

February 24, 2026·12 min read·2,212 words

Quantization is the key to running large language models (LLMs) locally without crashing due to memory constraints. Without it, even a 12B parameter model at full precision can exceed the VRAM of most consumer GPUs. Understanding quantization can help you choose the right model and format for your hardware.

What is Quantization, Really?

Quantization reduces the precision of a model's weights, similar to how audio compression shrinks file sizes. A language model consists of billions of numbers (weights). In their original form, each weight is a 32-bit floating-point number (FP32). Quantization lowers this precision, making the model smaller and faster but at the cost of some quality.

Here's the hierarchy:

Precision	Bits per weight	Relative size	Quality
FP32	32 bits	4x baseline	Reference (overkill for inference)
FP16 / BF16	16 bits	2x baseline	Essentially identical to FP32
INT8 / Q8	8 bits	1x baseline	Near-perfect, tiny quality loss
INT4 / Q4	4 bits	0.5x baseline	Sweet spot — surprisingly good
INT2 / Q2	2 bits	0.25x baseline	Noticeable degradation

A 70-billion parameter model at FP16 requires about 140 GB of memory. At Q4 quantization, it fits in 35-40 GB. At Q8, it's around 70 GB. The smaller the number, the less VRAM you need, but quality may suffer.

Why Quantization Matters for Local LLMs

Running models locally requires managing VRAM efficiently. Here's what you're working with in 2026:

Budget GPU (RTX 4060): 8 GB VRAM
Mid-range (RTX 4070 Ti Super): 16 GB VRAM
Enthusiast (RTX 3090 / 4090): 24 GB VRAM
Pro (dual GPUs or Mac Studio): 48-192 GB

Without quantization, even the enthusiast tier can only run a 12B parameter model at full precision. With Q4 quantization, a 24 GB card can run a 70B model — comparable to GPT-3.5 in capability.

Quantization Formats Explained

Different tools use various quantization approaches, affecting compatibility, speed, and quality.

GGUF (llama.cpp) — The People's Choice

Best for: Single GPU, CPU+GPU hybrid, Ollama users, beginners

GGUF is the dominant format for local LLM inference in 2026. Developed alongside llama.cpp, GGUF files work on Windows, Mac, and Linux. They support CPU+GPU split inference, meaning if your model is too big for your GPU alone, it spills over to system RAM. Slower, but it works.

If you're using Ollama, you're already using GGUF. When you run ollama pull qwen2.5:32b, Ollama downloads a pre-quantized GGUF file and handles everything.

GGUF files come in multiple quality tiers, allowing you to pick the right trade-off for your hardware.

GPTQ — The GPU Purist

Best for: GPU-only inference, batch processing, vLLM/TGI servers

GPTQ is optimized for pure GPU inference. It's faster than GGUF when running entirely on GPU, especially for serving multiple users simultaneously. The downside? No CPU fallback. If it doesn't fit in VRAM, it doesn't run.

Use GPTQ if you're setting up a local API server with tools like vLLM or text-generation-inference and your model fits entirely in VRAM.

AWQ — The Quality King

Best for: GPU-only inference where quality matters most

AWQ (Activation-Aware Weight Quantization) is smarter about which weights to compress. It identifies the most important weights and preserves them at higher precision, while aggressively compressing less important ones. The result: better quality than GPTQ at the same file size.

Ecosystem support has caught up in 2026, with vLLM, TGI, and transformers all supporting AWQ natively.

EXL2 — The Enthusiast's Pick

Best for: Maximum quality per bit, ExLlamaV2 users

EXL2 uses variable bit-width quantization. Instead of quantizing every layer to the same precision, EXL2 assigns different bit depths to different parts of the model based on their importance. The result is the best quality-per-byte of any format.

The catch? It only works with ExLlamaV2, and quantizing a model yourself requires significant time and compute. Most users download pre-made EXL2 quants from HuggingFace.

BitsAndBytes (NF4/INT8) — The Easy Button

Best for: HuggingFace users, quick experiments, fine-tuning

BitsAndBytes integrates directly into the HuggingFace transformers library. Load any model in 4-bit or 8-bit with a single flag: load_in_4bit=True. It's the easiest way to get started, but inference is slower than GGUF or GPTQ, and the quality at 4-bit (NF4) is slightly worse.

Its real strength is QLoRA fine-tuning — training a model at reduced precision to save memory.

Which Format Should You Choose?

For most users: GGUF via Ollama. It's the simplest, most compatible, and has the best tooling. Only consider alternatives if you're running a multi-user server (GPTQ/AWQ) or chasing maximum quality (EXL2).

GGUF Quality Levels: The Complete Breakdown

Most local LLM users end up with GGUF files. Here are the quality tiers.

Q2_K — Don't Bother

Size: ~25% of FP16 | Quality loss: 10-20% perplexity increase

This is the lowest quality level. The model runs, but responses are noticeably worse — more hallucinations, worse reasoning, garbled outputs on complex tasks. Use Q2_K only if you need to squeeze a model into limited VRAM and there's no smaller model available.

We removed Q2_K recommendations from the ToolHalla LLM Finder for this reason.

Q3_K_M — Testing Only

Size: ~35% of FP16 | Quality loss: 5-10% perplexity increase

Usable for quick testing and experimentation, but you'll notice degradation on harder tasks. Think of it as a preview quality level. If Q3_K_M is the best your hardware can handle, consider dropping down a model size.

Q4_K_M — The Sweet Spot ⭐

Size: ~45% of FP16 | Quality loss: 0.5-2% perplexity increase

Q4_K_M delivers 95-99% of the original model's quality at less than half the size. For most tasks — coding, writing, analysis, conversation — you genuinely cannot tell the difference between Q4_K_M and FP16. Benchmarks confirm this: on standard evaluations, the gap is typically under 2 percentage points.

If you're only going to remember one quantization level, make it this one. Q4_K_M is the default recommendation for almost everyone.

Q5_K_M — Premium Quality

Size: ~55% of FP16 | Quality loss: 0.2-0.5% perplexity increase

If you have the VRAM headroom, Q5_K_M is worth the upgrade. The quality gap between Q4_K_M and Q5_K_M is small but measurable, especially on tasks requiring precise reasoning or factual recall. It's the FLAC to Q4_K_M's high-quality MP3.

Q6_K — Near-Lossless

Size: ~65% of FP16 | Quality loss: <0.2% perplexity increase

At this point, you need benchmarks to detect any quality difference. Q6_K is for perfectionists who have VRAM to spare.

Q8_0 — Virtually Perfect

Size: ~85% of FP16 | Quality loss: Negligible

Q8_0 is practically identical to running the model at full FP16 precision. The file is significantly smaller, but the quality is so close to original that even automated benchmarks struggle to find differences. Use this if you have a 48GB+ card and want the best possible quality without the full FP16 memory cost.

Choose the Right Quantization for Your GPU

Find your VRAM, pick your model.

Your VRAM	Best model tier	Recommended quant	Example setup
6 GB	7-8B	Q4_K_M	Qwen 2.5 7B, Llama 3.1 8B, Gemma 3 4B (Q6)
8 GB	8-14B	Q4_K_M	Qwen 2.5 14B (Q4), Gemma 3 12B (Q4), Phi-4 14B (Q4)
12 GB	14-27B	Q4_K_M	Gemma 3 27B (Q4), Mistral Small 24B (Q4)
16 GB	14-32B	Q4_K_M / Q5_K_M	Qwen 2.5 32B (Q4), QwQ 32B (Q4)
24 GB	32-70B	Q4_K_M / Q6_K	Qwen 2.5 32B (Q6), Llama 3.3 70B (Q4), DeepSeek R1 32B (Q5)
48 GB	70B+	Q5_K_M / Q8_0	Llama 3.3 70B (Q8), DeepSeek R1 70B (Q5)

Important caveat: These estimates assume the model weights are the only thing in VRAM. In practice, you also need VRAM for the KV cache (which grows with context length) and any other applications using the GPU. A safe rule: leave 2-3 GB of headroom. If a model "fits" in exactly 24 GB, it will likely crash during longer conversations.

Use the ToolHalla LLM Finder to filter models by your exact VRAM and use case — it shows quantization options and estimated memory usage for each model.

How to Download the Right File

HuggingFace model names follow a pattern. Let's decode one:


Qwen2.5-32B-Instruct-Q4_K_M.gguf
│       │    │         │
│       │    │         └─ Quantization level
│       │    └─ Variant (Instruct = chat-tuned)
│       └─ Parameter count
└─ Model family

Trusted Quantizers

Not everyone quantizing models does a good job. These uploaders are consistently reliable:

bartowski — Wide selection, quality GGUF quants, detailed benchmarks
unsloth — Fast quantization, good for latest models
mradermacher — Huge catalog, systematic approach
TheBloke — The OG quantizer (less active in 2026 but archives remain useful)

The Easy Way: Ollama

If this all feels like too much, just use Ollama:


# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (automatically picks the right quantization)
ollama pull qwen2.5:32b

# Start chatting
ollama run qwen2.5:32b

Ollama picks a sensible default quantization (usually Q4_K_M) and handles everything. No hunting through HuggingFace, no manual file management.

Common Mistakes to Avoid

1. Squeezing a too-large model at Q2 instead of using a smaller model at Q5

A 70B model at Q2_K is almost always worse than a 32B model at Q5_K_M. The larger model has more parameters, but extreme compression destroys its advantage. Model quality × quantization quality is what matters, not parameter count alone.

2. Forgetting about context length

Loading a model is only half the VRAM equation. A 32K context window can use 2-4 GB of additional VRAM for the KV cache, depending on the model architecture. If you load a model that uses 22 GB of your 24 GB card, you'll run out of memory during longer conversations.

Reduce context length if you need headroom: ollama run qwen2.5:32b --num-ctx 8192

3. Using GPTQ when GGUF would work better

GPTQ was the go-to format in 2023-2024, but GGUF has caught up and surpassed it for most single-user scenarios. GGUF offers better tooling (Ollama, LM Studio), CPU fallback, and comparable speed. Only use GPTQ if you're specifically running vLLM or need batched inference.

4. Not checking which quantization Ollama actually downloaded

When you ollama pull llama3.3:70b, Ollama picks a default quant — usually Q4_K_M. But sometimes you want Q5 or Q8 for better quality. Check available tags:


# See all available sizes/quants
ollama show llama3.3:70b --modelfile

5. Ignoring CPU offloading as an option

If your model is slightly too large for VRAM, GGUF can offload some layers to system RAM. It's slower — maybe 2-5x for the offloaded layers — but it works. A 70B model at Q4 on a 24GB card with 32 layers on GPU and 8 layers on CPU still generates at usable speeds (5-10 tokens/sec). Better than not running it at all.

The Bottom Line

Quantization is essential for practical local LLM use. Without it, you'd need a $10,000+ server to run anything interesting. With it, a $300 used RTX 3090 can run models that rival cloud-based AI.

Here's everything you need to remember:

1. Q4_K_M is the default answer. When in doubt, use it.

2. GGUF via Ollama is the easiest path. It just works.

3. Match your model size to your VRAM, not the other way around. A smaller model at higher quality beats a larger model at Q2.

4. Leave 2-3 GB of VRAM headroom for context length and KV cache.

5. Use the LLM Finder to find exactly which models fit your hardware.

Not sure which GPU to buy? Check our complete hardware guide for local LLMs.

New to AI terminology? Browse the ToolHalla Glossary for plain-language definitions of every term in this article.

*Last updated: February 2026. Know something we missed? Let us know.*

FAQ

What is quantization in LLMs?

Quantization reduces the precision of model weights from 16-bit (FP16) to lower formats like 4-bit (Q4). This shrinks file sizes by 2-4× and reduces VRAM usage, enabling larger models on consumer hardware with minimal quality loss.

What is the difference between Q4, Q5, and Q8?

Q4: 4 bits per weight (2× compression vs FP16). Q5: 5 bits. Q8: 8 bits — nearly identical to FP16 quality. Q4 is the sweet spot: minimal quality loss with significant memory savings. Q2 causes noticeable degradation.

Does quantization reduce LLM quality?

Yes, but minimally for Q4 and above. Perplexity increases ~1-3% with Q4 vs FP16. In practice, Q4 Llama 3 8B is virtually indistinguishable from FP16 for most everyday tasks. Q2 quantization is generally not recommended.

What VRAM do I need for a 70B quantized model?

70B at Q4: ~40GB VRAM. 70B at Q3: ~30GB. 70B at Q2: ~20GB (but quality suffers). A single RTX 4090 (24GB) can't fit 70B Q4 — you need two 24GB GPUs or 40GB+ unified memory.

What is GGUF format?

GGUF (GPT-Generated Unified Format) is the standard file format for quantized LLMs used by llama.cpp, Ollama, and LM Studio. It stores model weights, tokenizer, and metadata in a single file, replacing the older GGML format.

Recommended Hardware

Frequently Asked Questions

What is quantization in LLMs?

What is the difference between Q4, Q5, and Q8?

Does quantization reduce LLM quality?

Yes, but minimally for Q4 and above. Perplexity increases 1-3% with Q4 vs FP16. In practice, Q4 Llama 3 8B is virtually indistinguishable from FP16 for most everyday tasks. Q2 quantization is generally not recommended.

What VRAM do I need for a 70B quantized model?

70B at Q4: 40GB VRAM. 70B at Q3: 30GB. 70B at Q2: 20GB (but quality suffers). A single RTX 4090 (24GB) can't fit 70B Q4 — you need two 24GB GPUs or 40GB+ unified memory.

What is GGUF format?

🔧 Tools in This Article

Make (Integromat)

LlamaIndex

Perplexity

LangChain

LM Studio

Haystack

Ollama

vLLM

Related Guides

All guides →

Guide

Best Local LLMs for RTX 5080 in 2026

Complete guide to running LLMs on the NVIDIA RTX 5080 (16GB GDDR7). Covers Qwen 2.5, Phi-4, DeepSeek R1, Mistral Nemo, and more — with VRAM tables, speed comparisons, and Ollama setup.

9 min read

Guide

Best LLMs for 24GB GPUs: RTX 3090 & 4090 Guide (2026)

24GB of VRAM is ideal for running 32B parameter models locally in 2026, offering high-quality quantization for real-world use.

10 min read

Guide

How to Build a Home AI Server in 2026: The Complete Guide

For the price of a few months of API subscriptions, you can build a home AI server that runs 24/7, processes everything locally, and never sends a byte of your data anywhere.

11 min read

#quantization#gguf#local-llm#vram#ollama#guide#beginners

What is Quantization, Really?

Why Quantization Matters for Local LLMs

Quantization Formats Explained

GGUF (llama.cpp) — The People's Choice

GPTQ — The GPU Purist

AWQ — The Quality King

EXL2 — The Enthusiast's Pick

BitsAndBytes (NF4/INT8) — The Easy Button

Which Format Should You Choose?

GGUF Quality Levels: The Complete Breakdown

Q2_K — Don't Bother

Q3_K_M — Testing Only

Q4_K_M — The Sweet Spot ⭐

Q5_K_M — Premium Quality

Q6_K — Near-Lossless

Q8_0 — Virtually Perfect

Choose the Right Quantization for Your GPU

How to Download the Right File

Trusted Quantizers

The Easy Way: Ollama

Common Mistakes to Avoid

The Bottom Line

Related Articles

FAQ

What is quantization in LLMs?

What is the difference between Q4, Q5, and Q8?

Does quantization reduce LLM quality?

What VRAM do I need for a 70B quantized model?

What is GGUF format?

Recommended Hardware

Recommended Products

Frequently Asked Questions

🔧 Tools in This Article

Related Guides

Best Local LLMs for RTX 5080 in 2026

Best LLMs for 24GB GPUs: RTX 3090 & 4090 Guide (2026)

How to Build a Home AI Server in 2026: The Complete Guide