Guide

Best Local LLMs for RTX 5090 in 2026

Guide to running LLMs on the RTX 5090 (32GB GDDR7). The only consumer GPU that runs 32B models at Q5 K M quality. Covers Qwen 2.5, DeepSeek R1, Phi-4, and the 70B stretch pick.

February 23, 2026·8 min read·1,666 words

The NVIDIA RTX 5090 is the new consumer king for local AI. With 32GB of GDDR7 memory and Blackwell's upgraded tensor cores, it sits in a sweet spot that no other consumer GPU touches — powerful enough to run 32B parameter models at high quantization, while offering blazing-fast inference speeds.

Why RTX 5090 for Local AI?

  • 32GB GDDR7 — The only consumer GPU with 32GB. Unlocks 32B models at Q5_K_M+ quality, which 24GB cards can only run at Q4.
  • Blackwell tensor cores — Massive speedup for quantized inference compared to Ada Lovelace (RTX 40-series).
  • Memory bandwidth — GDDR7 pushes significantly more data than GDDR6X, directly translating to faster tok/s.
  • CUDA ecosystem — Full compatibility with Ollama, llama.cpp, vLLM, TensorRT-LLM, and everything else.
  • Single card simplicity — No multi-GPU hassle. One card, one slot, done.

The 32GB sweet spot means you run 32B models the way 24GB cards run 14B models — comfortably, with headroom for long context.

Quick Start


curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:32b
ollama run qwen2.5:32b

The 32B Sweet Spot

The RTX 5090's killer advantage is running 32B parameter models at Q5_K_M or Q8_0 — quantizations where quality loss is minimal. On 24GB cards, these models are limited to Q4_K_M or Q3. That difference matters.

Quantization 32B VRAM Quality RTX 5090 Fit
Q8_0 ~38GB Near-perfect ⚠️ Tight with offloading
Q5_K_M ~27GB Excellent ✅ Comfortable
Q4_K_M ~22GB Very good ✅ Lots of headroom
Q3_K_M ~18GB Good ✅ Room for large context

Top Models for RTX 5090 (32GB VRAM)

🏆 1. Qwen 2.5 32B — The Daily Driver

Spec Value
Parameters 32B
Best Quant Q5_K_M (27GB) — sweet spot for 32GB
Context Window 33K
License Apache 2.0
Speed (5090) ~20-30 tok/s

On 24GB cards, you run this at Q4_K_M. On the 5090, you get Q5_K_M — a noticeable quality bump, especially on nuanced reasoning and creative tasks. This is the model that justifies the 32GB premium.


ollama pull qwen2.5:32b

💻 2. Qwen 2.5 Coder 32B — The Coding Powerhouse

Spec Value
Parameters 32B
Best Quant Q5_K_M (27GB)
Context Window 33K
License Apache 2.0
Speed (5090) ~20-30 tok/s

At Q5_K_M, the Coder variant produces cleaner, more accurate code than it does at Q4 on lesser cards. For professional development work, this quality difference compounds across a full coding session.


ollama pull qwen2.5-coder:32b

🧮 3. DeepSeek R1 Distill 14B — Reasoning at Full Precision

Spec Value
Parameters 14B
Best Quant FP16 (28GB) — full precision!
Context Window 33K
License MIT
Speed (5090) ~25-35 tok/s

With 32GB, you can run the 14B DeepSeek R1 at full FP16 precision — zero quantization loss. Chain-of-thought reasoning at its absolute best. This is the setup for math competitions and research problems.


ollama pull deepseek-r1:14b

⚡ 4. Phi-4 14B — Full Precision + 128K Context

Spec Value
Parameters 14B
Best Quant FP16 (28GB)
Context Window 128K
License MIT
Speed (5090) ~25-35 tok/s

Phi-4 at FP16 with its 128K context window is the ultimate document processing setup. Load entire books, codebases, or research papers — at full model precision. No other consumer card can do this.


ollama pull phi4:14b

🎨 5. Gemma 2 27B — Creative Writing Champion

Spec Value
Parameters 27B
Best Quant Q8_0 (~32GB — tight fit) or Q5_K_M (23.5GB)
Context Window 8K
License Gemma Terms of Use
Speed (5090) ~18-25 tok/s

Google's Gemma 2 at Q5_K_M (23.5GB) runs beautifully with 8.5GB to spare. For creative writing and natural conversation, Gemma's output quality is arguably the most "human" of any open-source model.


ollama pull gemma2:27b

🧠 6. Yi 1.5 34B — Multilingual Powerhouse

Spec Value
Parameters 34B
Best Quant Q5_K_M (~29GB)
Context Window 33K
License Apache 2.0
Speed (5090) ~18-25 tok/s

Yi 34B at Q5_K_M fits perfectly in 32GB. Excellent for bilingual (English/Chinese) work and general-purpose tasks. An underrated model that benefits enormously from the 32GB headroom.


ollama pull yi:34b

🏋️ 7. Llama 3.3 70B — Stretch Pick

Spec Value
Parameters 70B
Best Quant Q3_K_M (~32GB — very tight)
Context Window 128K
License Llama 3.3 Community
Speed (5090) ~6-10 tok/s

The RTX 5090 can technically fit Llama 3.3 70B at Q3_K_M. It's slow and quality is noticeably reduced at this quantization, but it works for batch processing or non-interactive tasks where you need maximum intelligence.


ollama pull llama3.3:70b

Caveat: Q3 on 70B is usable but not ideal. If you frequently need 70B models, consider the Mac Studio with 128GB+ unified memory.


RTX 5090 vs Other GPUs

Model RTX 5080 (16GB) RTX 3090/4090 (24GB) RTX 5090 (32GB)
14B Q5_K_M ✅ 30-40 tok/s ✅ 25-45 tok/s ✅ 25-35 tok/s (FP16!)
32B Q5_K_M ❌ Won't fit ❌ Won't fit ✅ 20-30 tok/s
32B Q4_K_M ❌ Won't fit ✅ 12-28 tok/s ✅ 22-32 tok/s
70B Q3_K_M ❌ Won't fit ❌ Won't fit ⚠️ 6-10 tok/s

The 5090's unique value: 32B models at Q5_K_M. No other consumer card can do this.


# The RTX 5090 toolkit
ollama pull qwen2.5:32b          # General purpose (Q5_K_M)
ollama pull qwen2.5-coder:32b    # Coding (Q5_K_M)
ollama pull phi4:14b              # Long docs (FP16, 128K context)
ollama pull deepseek-r1:14b       # Math/reasoning (FP16)
ollama pull mistral-nemo:12b      # Quick Q&A (fastest)

Conclusion

The RTX 5090 carves out a unique position in the local AI landscape. Its 32GB GDDR7 lets you run 32B models at quality levels that 24GB cards simply can't match, while Blackwell's architecture delivers best-in-class inference speed.

Is it worth the premium over a used RTX 3090? If you primarily run 14B models — probably not. But if you want 32B models at Q5_K_M quality, the RTX 5090 is currently the only consumer GPU that can deliver. That's a compelling niche.


*Match your GPU to the perfect model at ToolHalla.ai/models — filter by VRAM and use case.*


FAQ

What is the best LLM for an RTX 5090?

RTX 5090's 32GB enables: Qwen 3 32B at BF16 (~30-35 tok/s), Llama 3.3 70B at Q3 (~20 tok/s), or Qwen 2.5 Coder 32B for coding. The extra 8GB over the 4090 mainly benefits 30-40B models.

Is RTX 5090 worth it over RTX 4090 for local AI?

If you primarily use models under 24B, the 4090 is still excellent value. The 5090 makes sense for 30B+ models — the extra 8GB enables Q4 vs Q3 quality. Also ~30% faster for all models due to higher bandwidth.

What VRAM does RTX 5090 have?

32GB GDDR7 with ~1.8TB/s memory bandwidth — 80% more bandwidth than the 4090 (1.0TB/s). This bandwidth advantage translates directly to tokens/second for LLM inference.

Can RTX 5090 run a 70B model?

70B at Q4 needs ~40GB — exceeds the 5090's 32GB. But 70B at Q3 (~30GB) fits with headroom. Q3 quality is noticeably better than Q2 (what 4090 users need for 70B). The 5090 hits the sweet spot.

What is the RTX 5090 price?

MSRP $1,999, but street prices have been $2,200-2,500 due to limited availability. Used RTX 4090s are $1,600-1,800. The 5090 premium over 4090 is roughly 20-40%.

🏆 2. LLaMA 2 70B — Pushing the Boundaries

Spec Value
Parameters 70B
Best Quant Q5_K_M (38GB) — requires offloading
Context Window 8192
License Meta License
Speed (5090) ~10-15 tok/s

While the RTX 5090 is primarily suited for 32B models, it can still handle larger models like LLaMA 2 70B with some offloading. This setup is ideal for users who occasionally need the capabilities of a larger model without upgrading hardware.

How to Run LLaMA 2 70B with Offloading

1. Install llama.cpp:

`bash

git clone https://github.com/ggerganov/llama.cpp

cd llama.cpp

make

`

2. Download and Convert Model:

`bash

wget https://huggingface.co/meta-llama/Llama-2-70B-chat-hf/resolve/main/pytorch_model.bin.index.json

python3 convert.py meta-llama/Llama-2-70B-chat-hf

`

3. Run with Offloading:

`bash

./main -m models/llama-2-70b-chat-hf/ggml-model-q5_K_M.bin -ngl 1

`

🏆 3. ChatGPT-J 6B — Balanced Performance

Spec Value
Parameters 6B
Best Quant Q5_K_M (6GB) — ample headroom
Context Window 8192
License MIT
Speed (5090) ~50-60 tok/s

ChatGPT-J 6B is a great choice for users who need a balance between performance and resource usage. It runs smoothly on the RTX 5090 with plenty of VRAM to spare, making it ideal for real-time applications.

Practical Example: Setting Up ChatGPT-J 6B

1. Install vLLM:

`bash

pip install vllm

`

2. Download Model:

`bash

wget https://huggingface.co/togethercomputer/ChatGPT-J-6B/resolve/main/pytorch_model.bin

`

3. Run the Model:

`bash

vllm serve --model togethercomputer/ChatGPT-J-6B

`

🏆 4. Falcon 7B — Efficient and Fast

Spec Value
Parameters 7B
Best Quant Q5_K_M (7GB) — fits comfortably
Context Window 8192
License Apache 2.0
Speed (5090) ~45-55 tok/s

Falcon 7B is another excellent choice for local AI applications. Its efficient design and compatibility with Q5_K_M quantization make it a top pick for the RTX 5090.

Benchmarking Falcon 7B

To benchmark Falcon 7B, you can use the vLLM tool to measure throughput and latency:

1. Install vLLM:

`bash

pip install vllm

`

2. Download Model:

`bash

wget https://huggingface.co/tiiuae/falcon-7b-instruct/resolve/main/pytorch_model.bin

`

3. Run Benchmark:

`bash

vllm serve --model tiiuae/falcon-7b-instruct

`

Use a tool like ab (Apache Bench) to send requests and measure the response times.

Key Takeaways

  • The RTX 5090 excels in running 32B parameter models with minimal quality loss at Q5_K_M quantization.
  • For larger models like LLaMA 2 70B, offloading is necessary to fit within the 32GB VRAM limit.
  • Smaller models such as ChatGPT-J 6B and Falcon 7B offer balanced performance and are easy to set up on the RTX 5090.
  • Always consider the quantization level and VRAM requirements when selecting a model for your RTX 5090.

For more detailed guides on setting up and optimizing local LLMs, check out our comprehensive guide on local AI setups.


By leveraging the RTX 5090's capabilities, users can achieve high-quality AI performance with a single, powerful GPU. Whether you're a developer, researcher, or enthusiast, the RTX 5090 provides the perfect balance of power and efficiency for local AI applications in 2026.

  • NVIDIA RTX 5090 GPU — The ideal GPU for running 32B parameter models with high quantization, offering unmatched performance and compatibility with local AI tools.
  • Corsair RMx Series 1000W Power Supply — A high-capacity power supply that ensures stable and efficient power delivery, crucial for the demanding workload of the RTX 5090.
  • Fractal Design Meshify C Mid-Tower ATX Gaming Case — A sleek and spacious case that provides excellent airflow and ample space for building a powerful local AI workstation around the RTX 5090.

Frequently Asked Questions

What is the best LLM for an RTX 5090?
RTX 5090's 32GB enables: Qwen 3 32B at BF16 ( 30-35 tok/s), Llama 3.3 70B at Q3 ( 20 tok/s), or Qwen 2.5 Coder 32B for coding. The extra 8GB over the 4090 mainly benefits 30-40B models.
Is RTX 5090 worth it over RTX 4090 for local AI?
If you primarily use models under 24B, the 4090 is still excellent value. The 5090 makes sense for 30B+ models — the extra 8GB enables Q4 vs Q3 quality. Also 30% faster for all models due to higher bandwidth.
What VRAM does RTX 5090 have?
32GB GDDR7 with 1.8TB/s memory bandwidth — 80% more bandwidth than the 4090 (1.0TB/s). This bandwidth advantage translates directly to tokens/second for LLM inference.
Can RTX 5090 run a 70B model?
70B at Q4 needs 40GB — exceeds the 5090's 32GB. But 70B at Q3 ( 30GB) fits with headroom. Q3 quality is noticeably better than Q2 (what 4090 users need for 70B). The 5090 hits the sweet spot.
What is the RTX 5090 price?
MSRP $1,999, but street prices have been $2,200-2,500 due to limited availability. Used RTX 4090s are $1,600-1,800. The 5090 premium over 4090 is roughly 20-40%.

🔧 Tools in This Article

All tools →

Related Guides

All guides →
#local-llm#rtx-5090#nvidia#32gb-vram#ollama#guide