TurboQuant: 6x KV-cache Compression for Local Inference
KV-cache is the silent budget breaker in local LLM inference. Not the weights—they can be aggressively quantized with GGUF, AWQ, or GPTQ. It is the KV-cache tha
KV-cache is the silent budget breaker in local LLM inference. Not the weights—they can be aggressively quantized with GGUF, AWQ, or GPTQ. It is the KV-cache that explodes the VRAM budget when you try to run a 128K-context model locally. For instance, if you're looking to run 70B+ LLMs on a budget, the AMD Strix Halo can be a game-changer, but even then, managing the KV-cache efficiently is crucial.
Google Research presented TurboQuant at ICLR 2026, and the numbers are concrete: 6x reduction in KV-cache size, 3-bit quantization without fine-tuning, and up to 8x faster attention computation on H100 (4-bit configuration).
What is KV-cache, and why is it a problem?
When an LLM processes a long context, it stores intermediate values (keys and values from the attention mechanism) in the KV-cache. The cache grows linearly with context length. For a 7B model with a 128K context, the KV-cache can easily consume 20–30 GB of VRAM—meaning the entire context window won’t fit on a 24 GB RTX 4090 alone. The weights are not the problem. The KV-cache is. If you're considering hardware upgrades, the Intel Arc Pro B70 offers a compelling 32GB GPU option for local AI tasks.
How TurboQuant works
TurboQuant combines two techniques in sequence:
PolarQuant (Stage 1): Converts vector data from Cartesian coordinates to polar coordinates—radius (magnitude) and angle (direction). Data maps to a predictable circular grid instead of a variable square grid, eliminating costly normalization and drastically cutting memory requirements.
QJL — Quantized Johnson-Lindenstrauss (Stage 2): A mathematical transformation that reduces remaining errors to simple sign bits (+1 or −1). A specialized estimator balances high-precision queries against low-entropy data to preserve accuracy in attention score computations.
The result is 3-bit quantization of the KV-cache—without the model needing fine-tuning. This kind of optimization is particularly useful when choosing an inference server; for example, comparing vLLM vs Ollama vs TGI can help you understand which setup might benefit most from TurboQuant.
The numbers
Tested on Llama-3.1-8B-Instruct, Gemma, and Mistral against benchmarks including LongBench, Needle In A Haystack, RULER, and ZeroSCROLLS:
- 6x reduction in KV-cache size with preserved accuracy
- 8x faster attention on H100 GPU (4-bit config)
- Outperforms KIVI, PQ, and
By leveraging TurboQuant, you can push the boundaries of what's possible with local inference, making it feasible to run larger models on more accessible hardware like the RTX 50-series GPUs, as detailed in the Best Local LLMs for Every RTX 50-Series GPU guide.
RabbiQ on vector search recall (GloVe dataset)
- No training requirements—drop-in for existing models
TurboQuant solves a fundamentally different problem than GPTQ, AWQ, and GGUF. They compress *model weights*. TurboQuant compresses *KV-cache during runtime*—two separate bottlenecks with two separate solutions.
What this means for you
Do you have an RTX 5090 with 32 GB VRAM? With TurboQuant, 128K-context models that normally require 24+ GB just for the KV-cache can suddenly run with good margin. See best LLMs for RTX 5090 for which models are relevant.
Do you have 24 GB VRAM (RTX 4090 / RTX 3090 Ti)? Models that today are practically limited to 32K–64K context can potentially handle 128K–192K with TurboQuant. It’s the difference between analyzing a chapter and a whole book in one run.
No fine-tuning means the technique is theoretically compatible with existing GGUF-quantized models from llama.cpp, Ollama, and LM Studio—assuming the framework implements it. Consider renting an RTX 5090 on Amazon to test locally.
When will this be in llama.cpp and vLLM?
As of March 2026, TurboQuant has been presented at ICLR but not implemented in any of the major inference frameworks. The paper is available on arXiv with formal theoretical proofs. The QJL component was previously published with AAAI, suggesting that the foundational work is current.
Historically, it takes 3–9 months from a technique being published until it appears as a flag in llama.cpp or vLLM. Keep an eye on the llama.cpp GitHub and vLLM release notes over summer 2026.
Want to run experiments on H100 where TurboQuant shows 8x speedup already? Vast.ai is the fastest way to access the hardware.
Conclusion
TurboQuant is not an incremental improvement. A 6x reduction in KV-cache size with 3-bit quantization without fine-tuning addresses a real bottleneck that standard weight quantization does not solve.
The question is not *if* this lands in local inference frameworks—it is *when*.
*Source: Google Research — TurboQuant, presented at ICLR 2026.*
Practical Examples
Let’s look at a concrete scenario where TurboQuant can make a difference. Suppose you have a 7B model with a 128K context that you want to run locally on an RTX 4090 with 24 GB VRAM. Without compression, the KV-cache would likely use up to 20-30 GB of VRAM, which exceeds the GPU’s capacity. With TurboQuant, you can reduce this size to just 3-5 GB, making it possible to run the model with full context length.
Benchmark Tests
TurboQuant was tested on multiple models and benchmark tests to evaluate its effectiveness. Here are some specific results:
- Llama-3.1-8B-Instruct: The model was tested against LongBench, Needle In A Haystack, RULER, and ZeroSCROLLS. TurboQuant demonstrated a 6x reduction in KV-cache size without compromising on accuracy.
- Gemma and Mistral: These models were also included in the benchmark tests, and TurboQuant proved effective in reducing memory usage while maintaining high accuracy.
How to Implement TurboQuant
Implementing TurboQuant can be technically challenging, but here are some steps to get you started:
1. Update to the latest model version: Ensure you have access to a version of the model that supports TurboQuant. Google Research has added support for TurboQuant in version 2.6 of their model library.
2. Install necessary libraries: You need to install libraries that support quantization and compression. For example, you can use torch for PyTorch models.
3. Configure TurboQuant: Follow the instructions from Google Research to configure TurboQuant in your model. This includes setting up PolarQuant and QJL transformations.
4. Test and adjust: After implementation, test the model and make any necessary adjustments to ensure optimal performance.
🔧 Tools in This Article
All tools →Related Guides
All guides →Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now
Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now Meta's April 14, 2026 announcement of an expanded Broadcom partnership is a useful reminder that AI competition is increasingly fought below the API layer. Meta said it...
2 min read
AI ToolsMeta Muse Spark April 2026: What It Means for Consumer AI Assistants
Meta Muse Spark April 2026: What It Means for Consumer AI Assistants Meta's April 8, 2026 announcement of Muse Spark matters because it is not just another model launch. Meta is trying to reposition Meta AI around multimodal perception,...
2 min read
AI ToolsProject Glasswing April 2026: The AI Cybersecurity Shift Is Here
Project Glasswing April 2026: The AI Cybersecurity Shift Is Here Anthropic's April 7, 2026 announcement of Project Glasswing is one of the clearest recent signs that frontier AI labs now see cybersecurity as a central deployment battleground, not a...
2 min read