What is Quantization? A Practical Guide for Local LLMs (2026)
Quantization is crucial for running large language models locally without memory issues. Understand it to choose the right model and format for your GPU.
Quantization is the key to running large language models (LLMs) locally without crashing due to memory constraints. Without it, even a 12B parameter model at full precision can exceed the VRAM of most consumer GPUs. Understanding quantization can help you choose the right model and format for your hardware.
What is Quantization, Really?
Quantization reduces the precision of a model's weights, similar to how audio compression shrinks file sizes. A language model consists of billions of numbers (weights). In their original form, each weight is a 32-bit floating-point number (FP32). Quantization lowers this precision, making the model smaller and faster but at the cost of some quality.
Here's the hierarchy:
| Precision | Bits per weight | Relative size | Quality |
|---|---|---|---|
| FP32 | 32 bits | 4x baseline | Reference (overkill for inference) |
| FP16 / BF16 | 16 bits | 2x baseline | Essentially identical to FP32 |
| INT8 / Q8 | 8 bits | 1x baseline | Near-perfect, tiny quality loss |
| INT4 / Q4 | 4 bits | 0.5x baseline | Sweet spot — surprisingly good |
| INT2 / Q2 | 2 bits | 0.25x baseline | Noticeable degradation |
A 70-billion parameter model at FP16 requires about 140 GB of memory. At Q4 quantization, it fits in 35-40 GB. At Q8, it's around 70 GB. The smaller the number, the less VRAM you need, but quality may suffer.
Why Quantization Matters for Local LLMs
Running models locally requires managing VRAM efficiently. Here's what you're working with in 2026:
- Budget GPU (RTX 4060): 8 GB VRAM
- Mid-range (RTX 4070 Ti Super): 16 GB VRAM
- Enthusiast (RTX 3090 / 4090): 24 GB VRAM
- Pro (dual GPUs or Mac Studio): 48-192 GB
Without quantization, even the enthusiast tier can only run a 12B parameter model at full precision. With Q4 quantization, a 24 GB card can run a 70B model — comparable to GPT-3.5 in capability.
Quantization Formats Explained
Different tools use various quantization approaches, affecting compatibility, speed, and quality.
GGUF (llama.cpp) — The People's Choice
Best for: Single GPU, CPU+GPU hybrid, Ollama users, beginners
GGUF is the dominant format for local LLM inference in 2026. Developed alongside llama.cpp, GGUF files work on Windows, Mac, and Linux. They support CPU+GPU split inference, meaning if your model is too big for your GPU alone, it spills over to system RAM. Slower, but it works.
If you're using Ollama, you're already using GGUF. When you run ollama pull qwen2.5:32b, Ollama downloads a pre-quantized GGUF file and handles everything.
GGUF files come in multiple quality tiers, allowing you to pick the right trade-off for your hardware.
GPTQ — The GPU Purist
Best for: GPU-only inference, batch processing, vLLM/TGI servers
GPTQ is optimized for pure GPU inference. It's faster than GGUF when running entirely on GPU, especially for serving multiple users simultaneously. The downside? No CPU fallback. If it doesn't fit in VRAM, it doesn't run.
Use GPTQ if you're setting up a local API server with tools like vLLM or text-generation-inference and your model fits entirely in VRAM.
AWQ — The Quality King
Best for: GPU-only inference where quality matters most
AWQ (Activation-Aware Weight Quantization) is smarter about which weights to compress. It identifies the most important weights and preserves them at higher precision, while aggressively compressing less important ones. The result: better quality than GPTQ at the same file size.
Ecosystem support has caught up in 2026, with vLLM, TGI, and transformers all supporting AWQ natively.
EXL2 — The Enthusiast's Pick
Best for: Maximum quality per bit, ExLlamaV2 users
EXL2 uses variable bit-width quantization. Instead of quantizing every layer to the same precision, EXL2 assigns different bit depths to different parts of the model based on their importance. The result is the best quality-per-byte of any format.
The catch? It only works with ExLlamaV2, and quantizing a model yourself requires significant time and compute. Most users download pre-made EXL2 quants from HuggingFace.
BitsAndBytes (NF4/INT8) — The Easy Button
Best for: HuggingFace users, quick experiments, fine-tuning
BitsAndBytes integrates directly into the HuggingFace transformers library. Load any model in 4-bit or 8-bit with a single flag: load_in_4bit=True. It's the easiest way to get started, but inference is slower than GGUF or GPTQ, and the quality at 4-bit (NF4) is slightly worse.
Its real strength is QLoRA fine-tuning — training a model at reduced precision to save memory.
Which Format Should You Choose?
For most users: GGUF via Ollama. It's the simplest, most compatible, and has the best tooling. Only consider alternatives if you're running a multi-user server (GPTQ/AWQ) or chasing maximum quality (EXL2).
GGUF Quality Levels: The Complete Breakdown
Most local LLM users end up with GGUF files. Here are the quality tiers.
Q2_K — Don't Bother
Size: ~25% of FP16 | Quality loss: 10-20% perplexity increase
This is the lowest quality level. The model runs, but responses are noticeably worse — more hallucinations, worse reasoning, garbled outputs on complex tasks. Use Q2_K only if you need to squeeze a model into limited VRAM and there's no smaller model available.
We removed Q2_K recommendations from the ToolHalla LLM Finder for this reason.
Q3_K_M — Testing Only
Size: ~35% of FP16 | Quality loss: 5-10% perplexity increase
Usable for quick testing and experimentation, but you'll notice degradation on harder tasks. Think of it as a preview quality level. If Q3_K_M is the best your hardware can handle, consider dropping down a model size.
Q4_K_M — The Sweet Spot ⭐
Size: ~45% of FP16 | Quality loss: 0.5-2% perplexity increase
Q4_K_M delivers 95-99% of the original model's quality at less than half the size. For most tasks — coding, writing, analysis, conversation — you genuinely cannot tell the difference between Q4_K_M and FP16. Benchmarks confirm this: on standard evaluations, the gap is typically under 2 percentage points.
If you're only going to remember one quantization level, make it this one. Q4_K_M is the default recommendation for almost everyone.
Q5_K_M — Premium Quality
Size: ~55% of FP16 | Quality loss: 0.2-0.5% perplexity increase
If you have the VRAM headroom, Q5_K_M is worth the upgrade. The quality gap between Q4_K_M and Q5_K_M is small but measurable, especially on tasks requiring precise reasoning or factual recall. It's the FLAC to Q4_K_M's high-quality MP3.
Q6_K — Near-Lossless
Size: ~65% of FP16 | Quality loss: <0.2% perplexity increase
At this point, you need benchmarks to detect any quality difference. Q6_K is for perfectionists who have VRAM to spare.
Q8_0 — Virtually Perfect
Size: ~85% of FP16 | Quality loss: Negligible
Q8_0 is practically identical to running the model at full FP16 precision. The file is significantly smaller, but the quality is so close to original that even automated benchmarks struggle to find differences. Use this if you have a 48GB+ card and want the best possible quality without the full FP16 memory cost.
Choose the Right Quantization for Your GPU
Find your VRAM, pick your model.
| Your VRAM | Best model tier | Recommended quant | Example setup |
|---|---|---|---|
| 6 GB | 7-8B | Q4_K_M | Qwen 2.5 7B, Llama 3.1 8B, Gemma 3 4B (Q6) |
| 8 GB | 8-14B | Q4_K_M | Qwen 2.5 14B (Q4), Gemma 3 12B (Q4), Phi-4 14B (Q4) |
| 12 GB | 14-27B | Q4_K_M | Gemma 3 27B (Q4), Mistral Small 24B (Q4) |
| 16 GB | 14-32B | Q4_K_M / Q5_K_M | Qwen 2.5 32B (Q4), QwQ 32B (Q4) |
| 24 GB | 32-70B | Q4_K_M / Q6_K | Qwen 2.5 32B (Q6), Llama 3.3 70B (Q4), DeepSeek R1 32B (Q5) |
| 48 GB | 70B+ | Q5_K_M / Q8_0 | Llama 3.3 70B (Q8), DeepSeek R1 70B (Q5) |
Important caveat: These estimates assume the model weights are the only thing in VRAM. In practice, you also need VRAM for the KV cache (which grows with context length) and any other applications using the GPU. A safe rule: leave 2-3 GB of headroom. If a model "fits" in exactly 24 GB, it will likely crash during longer conversations.
Use the ToolHalla LLM Finder to filter models by your exact VRAM and use case — it shows quantization options and estimated memory usage for each model.
How to Download the Right File
HuggingFace model names follow a pattern. Let's decode one:
Qwen2.5-32B-Instruct-Q4_K_M.gguf
│ │ │ │
│ │ │ └─ Quantization level
│ │ └─ Variant (Instruct = chat-tuned)
│ └─ Parameter count
└─ Model family
Trusted Quantizers
Not everyone quantizing models does a good job. These uploaders are consistently reliable:
- bartowski — Wide selection, quality GGUF quants, detailed benchmarks
- unsloth — Fast quantization, good for latest models
- mradermacher — Huge catalog, systematic approach
- TheBloke — The OG quantizer (less active in 2026 but archives remain useful)
The Easy Way: Ollama
If this all feels like too much, just use Ollama:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model (automatically picks the right quantization)
ollama pull qwen2.5:32b
# Start chatting
ollama run qwen2.5:32b
Ollama picks a sensible default quantization (usually Q4_K_M) and handles everything. No hunting through HuggingFace, no manual file management.
Common Mistakes to Avoid
1. Squeezing a too-large model at Q2 instead of using a smaller model at Q5
A 70B model at Q2_K is almost always worse than a 32B model at Q5_K_M. The larger model has more parameters, but extreme compression destroys its advantage. Model quality × quantization quality is what matters, not parameter count alone.
2. Forgetting about context length
Loading a model is only half the VRAM equation. A 32K context window can use 2-4 GB of additional VRAM for the KV cache, depending on the model architecture. If you load a model that uses 22 GB of your 24 GB card, you'll run out of memory during longer conversations.
Reduce context length if you need headroom: ollama run qwen2.5:32b --num-ctx 8192
3. Using GPTQ when GGUF would work better
GPTQ was the go-to format in 2023-2024, but GGUF has caught up and surpassed it for most single-user scenarios. GGUF offers better tooling (Ollama, LM Studio), CPU fallback, and comparable speed. Only use GPTQ if you're specifically running vLLM or need batched inference.
4. Not checking which quantization Ollama actually downloaded
When you ollama pull llama3.3:70b, Ollama picks a default quant — usually Q4_K_M. But sometimes you want Q5 or Q8 for better quality. Check available tags:
# See all available sizes/quants
ollama show llama3.3:70b --modelfile
5. Ignoring CPU offloading as an option
If your model is slightly too large for VRAM, GGUF can offload some layers to system RAM. It's slower — maybe 2-5x for the offloaded layers — but it works. A 70B model at Q4 on a 24GB card with 32 layers on GPU and 8 layers on CPU still generates at usable speeds (5-10 tokens/sec). Better than not running it at all.
The Bottom Line
Quantization is essential for practical local LLM use. Without it, you'd need a $10,000+ server to run anything interesting. With it, a $300 used RTX 3090 can run models that rival cloud-based AI.
Here's everything you need to remember:
1. Q4_K_M is the default answer. When in doubt, use it.
2. GGUF via Ollama is the easiest path. It just works.
3. Match your model size to your VRAM, not the other way around. A smaller model at higher quality beats a larger model at Q2.
4. Leave 2-3 GB of VRAM headroom for context length and KV cache.
5. Use the LLM Finder to find exactly which models fit your hardware.
Not sure which GPU to buy? Check our complete hardware guide for local LLMs.
New to AI terminology? Browse the ToolHalla Glossary for plain-language definitions of every term in this article.
*Last updated: February 2026. Know something we missed? Let us know.*
Related Articles
- LangChain vs LlamaIndex vs Haystack in 2026: Best RAG Framework?
- Qwen 3.5 vs Qwen 2.5: Local LLM Comparison 2026
FAQ
What is quantization in LLMs?
Quantization reduces the precision of model weights from 16-bit (FP16) to lower formats like 4-bit (Q4). This shrinks file sizes by 2-4× and reduces VRAM usage, enabling larger models on consumer hardware with minimal quality loss.
What is the difference between Q4, Q5, and Q8?
Q4: 4 bits per weight (2× compression vs FP16). Q5: 5 bits. Q8: 8 bits — nearly identical to FP16 quality. Q4 is the sweet spot: minimal quality loss with significant memory savings. Q2 causes noticeable degradation.
Does quantization reduce LLM quality?
Yes, but minimally for Q4 and above. Perplexity increases ~1-3% with Q4 vs FP16. In practice, Q4 Llama 3 8B is virtually indistinguishable from FP16 for most everyday tasks. Q2 quantization is generally not recommended.
What VRAM do I need for a 70B quantized model?
70B at Q4: ~40GB VRAM. 70B at Q3: ~30GB. 70B at Q2: ~20GB (but quality suffers). A single RTX 4090 (24GB) can't fit 70B Q4 — you need two 24GB GPUs or 40GB+ unified memory.
What is GGUF format?
GGUF (GPT-Generated Unified Format) is the standard file format for quantized LLMs used by llama.cpp, Ollama, and LM Studio. It stores model weights, tokenizer, and metadata in a single file, replacing the older GGML format.
Recommended Hardware
Recommended Products
- NVIDIA RTX 4090 GPU — Perfect for running large language models locally with its 24 GB VRAM, making it suitable for Q4 quantization of 70B parameter models.
- ASUS ROG Strix X670E-E Gaming WiFi D4 — A high-performance motherboard that supports the latest GPUs, ideal for building a powerful local LLM setup.
- Corsair RM1000x 1000W 80+ Platinum Fully Modular ATX Power Supply — Provides reliable power for high-end GPUs and other components in your local LLM workstation.
Frequently Asked Questions
What is quantization in LLMs?
What is the difference between Q4, Q5, and Q8?
Does quantization reduce LLM quality?
What VRAM do I need for a 70B quantized model?
What is GGUF format?
🔧 Tools in This Article
All tools →Related Guides
All guides →Best Local LLMs for RTX 5080 in 2026
Complete guide to running LLMs on the NVIDIA RTX 5080 (16GB GDDR7). Covers Qwen 2.5, Phi-4, DeepSeek R1, Mistral Nemo, and more — with VRAM tables, speed comparisons, and Ollama setup.
9 min read
GuideBest LLMs for 24GB GPUs: RTX 3090 & 4090 Guide (2026)
24GB of VRAM is ideal for running 32B parameter models locally in 2026, offering high-quality quantization for real-world use.
10 min read
GuideHow to Build a Home AI Server in 2026: The Complete Guide
For the price of a few months of API subscriptions, you can build a home AI server that runs 24/7, processes everything locally, and never sends a byte of your data anywhere.
11 min read