Guide

Dual GPU Setup Guide for Local LLMs (2026): Double Your VRAM

Two RTX 3090s give you 48 GB of VRAM for the price of one RTX 4090. Here is everything you need to know about running local LLMs on dual GPUs — hardware, software, models, and troubleshooting.

February 24, 2026·10 min read·2,079 words

You've been running local LLMs on a single GPU and hitting the wall. Your 24 GB card handles 32B models beautifully, but that 70B model you really want? It needs Q3 quantization to barely fit, and the quality drop is noticeable. The solution isn't a more expensive GPU — it's a second one.

Two RTX 3090s give you 48 GB of VRAM for roughly the same price as a single RTX 4090, which only has 24 GB. That's double the VRAM, which means bigger models at higher quantization levels — and that translates directly to smarter AI on your desk.

Here's everything you need to know about running local LLMs on dual GPUs.


Why Dual GPU?

The math is compelling:

Setup VRAM Approx. Cost (2026) Best 70B Quant
Single RTX 3090 24 GB $800 used Q4_K_M (tight)
Single RTX 4090 24 GB $1,800 Q4_K_M (tight)
2x RTX 3090 48 GB $1,600 used Q5_K_M or Q6_K
2x RTX 4090 48 GB $3,600 Q5_K_M or Q6_K

Two 3090s cost the same as one 4090 but give you double the VRAM. The 4090 is faster per card, but for LLM inference, VRAM capacity matters more than raw compute. A 70B model at Q6 on 48 GB will produce better outputs than the same model at Q4 on 24 GB — every time.

The sweet spot in 2026 is clear: two used RTX 3090s is the best value per gigabyte of VRAM you can buy.


What You Need

Motherboard

You need two PCIe x16 slots, ideally running at x8/x8 when both are populated. Almost every ATX motherboard from the last five years has this. Here's the thing most people don't realize: PCIe bandwidth barely matters for LLM inference.

LLM inference is VRAM-bound, not bandwidth-bound. The model weights sit in GPU memory and stay there. The only data crossing the PCIe bus is the token output (tiny) and the initial model load (one-time). Running at x8 instead of x16 has virtually zero impact on tokens per second.

What does matter:

  • Both slots must be CPU-connected (not chipset). Check your motherboard manual.
  • Sufficient physical spacing for two 3-slot cards. Some boards put the slots too close together.
  • BIOS support for "Above 4G Decoding" and "Resizable BAR" (both should be enabled).

Power Supply

Two RTX 3090s can each draw 350W under load. Combined with the rest of your system, you're looking at:

Component Power Draw
2x RTX 3090 600-700W
CPU + RAM 65-150W
Storage + fans 30-50W
Total 700-900W

Minimum: 1000W PSU. Recommended: 1200W. Don't cheap out here — transient power spikes from modern GPUs can trip overcurrent protection on undersized PSUs, causing random shutdowns during inference.

You'll need two separate 8-pin (or 12-pin) power cables per GPU. Never use a single cable with a daisy-chain adapter for a 3090.

Case and Cooling

Two 3090s generate serious heat. You need:

  • A case that fits two 3-slot GPUs with space between them (Lian Li O11D, Fractal Torrent, Corsair 7000D are all popular choices)
  • Strong front-to-back airflow
  • Consider undervolting both cards (reduces heat by 20-30% with minimal performance loss for inference)

A realistic thermal setup: both GPUs will sit at 65-75°C during inference. That's fine for 24/7 operation but you'll hear the fans.

No. NVLink bridges (which physically connect two GPUs for direct memory access) are technically supported on RTX 3090, but llama.cpp and Ollama don't use NVLink. They split the model across GPUs using PCIe, which works perfectly for inference.

NVLink matters for training and some CUDA-specific workloads. For running local LLMs, save your $40-80 and skip it.


Best Dual GPU Combinations

🏆 2x RTX 3090 — Best Value

  • VRAM: 48 GB (24 + 24)
  • Cost: ~$1,600 total (used)
  • Memory bandwidth: 936 GB/s × 2
  • Why: Unbeatable VRAM per dollar. The 24 GB per card is the magic number.

2x RTX 3090 Ti — Slightly Faster

  • VRAM: 48 GB (24 + 24)
  • Cost: ~$1,800 total (used)
  • Same VRAM but ~10% faster compute. Not worth the premium for most users.

RTX 3090 + RTX 4090 — Asymmetric

  • VRAM: 48 GB (24 + 24)
  • Cost: ~$2,600
  • Works with --tensor-split to balance load. The 4090 is faster, so you'd give it more layers. Functional but unnecessary complexity.

2x RTX 4090 — Overkill

  • VRAM: 48 GB (24 + 24)
  • Cost: ~$3,600
  • Same VRAM as 2x 3090 at more than double the price. Faster inference, but the 3090 pair is already fast enough for interactive use. Only justified if you also game or do creative work on these cards.

2x Tesla P40 — Budget King

  • VRAM: 48 GB (24 + 24)
  • Cost: ~$300-400 total
  • Catch: Much slower (Pascal architecture, no FP16 tensor cores). Good for experimentation and batch processing where tokens per second doesn't matter. No video output. Needs blower-style cooling solution.

Honorable Mention: Mac Studio M4 Ultra

  • Unified memory: 128-192 GB
  • Cost: $4,000-8,000
  • Not dual GPU, but worth mentioning. Apple Silicon's unified memory means you can run truly massive models (120B+ at high quantization) on a single machine with zero hassle. The trade-off: lower tokens per second compared to NVIDIA GPUs.

Software Setup

The good news: multi-GPU support in 2026 is mature. Most tools handle it automatically or with minimal configuration.

Ollama — Automatic

Ollama detects multiple GPUs and splits models across them automatically. No configuration needed.


# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a 70B model — Ollama will use both GPUs
ollama pull llama3.3:70b

# Start chatting
ollama run llama3.3:70b

To verify both GPUs are being used:


nvidia-smi  # Should show VRAM usage on both GPUs

llama.cpp — Manual Control

llama.cpp gives you fine-grained control over how the model is split across GPUs:


# Equal split (24GB + 24GB)
llama-server -m model.gguf -ngl 999 --tensor-split 24,24

# Asymmetric split (if GPUs have different VRAM)
llama-server -m model.gguf -ngl 999 --tensor-split 14,24

# With flash attention and optimized KV cache
llama-server -m model.gguf -ngl 999 --tensor-split 24,24 \
  --flash-attn --ctx-size 32768 \
  --cache-type-k q8_0 --cache-type-v q8_0

The --tensor-split values are proportional weights, not absolute sizes. 24,24 means "split evenly." 14,24 means "give 37% to GPU 0 and 63% to GPU 1."

vLLM — Production Serving

For serving models to multiple users:


# Tensor parallel across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95

LM Studio — GUI

In LM Studio settings, select both GPUs under the GPU configuration panel. LM Studio handles the split automatically when you load a model that needs more VRAM than a single GPU provides.


What Can You Run on 48 GB?

Here's the upgrade from 24 GB to 48 GB — the models that were impossible or compromised before are now comfortable:

Model Quant VRAM Used Quality Notes
Llama 3.3 70B Q5_K_M ~47 GB ★★★★★ Near-lossless. The gold standard.
Qwen 2.5 72B Q4_K_M ~49 GB ★★★★☆ Sweet spot. Excellent for coding.
DeepSeek R1 70B Q5_K_M ~47 GB ★★★★★ Best reasoning model you can run locally.
MiniMax M2.5 Q3_K_XL ~45 GB ★★★★☆ MoE (230B/10B active). Frontier quality.
Qwen 2.5 32B Q8_0 ~34 GB ★★★★★ Virtually perfect quality. Headroom for huge context.
Command-R+ 104B Q3_K_M ~44 GB ★★★☆☆ Massive model, but Q3 limits quality.
Llama 4 Scout Q4_K_M ~45 GB ★★★★☆ MoE (109B/17B active). 512K context.

The jump from 24 GB to 48 GB isn't just "slightly bigger models." It's the difference between running a 70B model at Q3 (noticeable quality loss) and Q5 (near-perfect). That's a meaningful upgrade in output quality for coding, reasoning, and creative tasks.

Use the ToolHalla LLM Finder to see all 60+ models that fit 48 GB — filter by use case to find exactly what you need.


Troubleshooting Common Issues

"Only one GPU is being used"

  • Check NVIDIA drivers: Run nvidia-smi — both GPUs should appear.
  • Check CUDA visibility: echo $CUDA_VISIBLE_DEVICES — if set, it might be limiting to one GPU. Unset it: unset CUDA_VISIBLE_DEVICES
  • Ollama: Restart the service after installing the second GPU: sudo systemctl restart ollama
  • llama.cpp: Make sure you're using --tensor-split or -ngl 999 (offload all layers to GPU)

"Model loads but inference is very slow"

If one GPU is significantly slower than the other (mixed generations), use asymmetric tensor split:


# GPU 0 is slower, GPU 1 is faster
llama-server -m model.gguf -ngl 999 --tensor-split 10,24

"System crashes or resets during model loading"

Power supply issue. Model loading briefly spikes power draw on both GPUs simultaneously. Solutions:

  • Upgrade PSU to 1200W+
  • Stagger GPU power limits: nvidia-smi -i 0 -pl 300 and nvidia-smi -i 1 -pl 300 (reduces peak from 350W to 300W per card)

"BIOS doesn't see the second GPU"

  • Enable "Above 4G Decoding" in BIOS (required for >4 GB GPU memory mapping)
  • Enable "Resizable BAR" if available
  • Try the second GPU in a different PCIe slot
  • Update BIOS to the latest version

"Out of memory despite having 48 GB"

Don't forget the KV cache. A 70B model at Q5 uses ~47 GB for weights, leaving only 1 GB for context. Solutions:

  • Use quantized KV cache: --cache-type-k q8_0 --cache-type-v q8_0 (halves KV cache memory)
  • Reduce context length: --ctx-size 8192 instead of the default 32K
  • Drop to Q4_K_M to free up 5-7 GB of headroom

Power, Noise, and Running 24/7

If you're running a dual GPU setup as a home AI server (and you should — it's always available), here are some practical considerations:

Idle power draw: ~50-60W total (both GPUs idle). Modern NVIDIA GPUs drop to very low power states when not inferencing.

Active inference: 400-600W total depending on model size and batch. At average US electricity rates ($0.15/kWh), running inference 4 hours a day costs roughly $10-15/month.

Undervolting: Highly recommended. Use nvidia-smi or a tool like GreenWithEnvy:


# Reduce power limit to 280W per card (from 350W default)
sudo nvidia-smi -i 0 -pl 280
sudo nvidia-smi -i 1 -pl 280

This typically reduces temperatures by 10-15°C and noise significantly, with only a 5-10% reduction in tokens per second. For inference, this is a great trade-off.


The Bottom Line

Dual GPU is the most cost-effective way to run serious local AI in 2026. Two used RTX 3090s for ~$1,600 give you 48 GB of VRAM — enough to run 70B models at near-perfect quality, frontier MoE models like MiniMax M2.5, and multiple smaller models simultaneously.

The setup is straightforward: plug in both cards, enable Above 4G Decoding in BIOS, and Ollama handles the rest automatically. No NVLink needed, no special drivers, no complicated configuration.

Quick start checklist:

1. ✅ ATX motherboard with 2 CPU-connected x16 slots

2. ✅ 1000W+ power supply (1200W recommended)

3. ✅ Case with room for two 3-slot GPUs

4. ✅ Enable Above 4G Decoding + Resizable BAR in BIOS

5. ✅ Install both GPUs, run nvidia-smi to verify

6. ✅ ollama pull llama3.3:70b and enjoy

Find the perfect model for your dual GPU setup at ToolHalla LLM Finder. Read our quantization guide to understand the quality levels, or check the hardware buyer's guide if you're still deciding on cards.


*Last updated: February 2026. Running a multi-GPU setup? We'd love to hear about it — get in touch.*


FAQ

What are the benefits of a dual GPU setup for local LLMs?

Dual GPUs double your total VRAM. Two RTX 3090s give 48GB total, enabling 70B Q4 models that a single 24GB card can't run. llama.cpp and Ollama support tensor parallelism across multiple GPUs for near-linear memory scaling.

Do both GPUs need to be the same model?

No — mixed GPU setups work with llama.cpp and Ollama. You can pair an RTX 4090 (24GB) with a 3090 (24GB) for 48GB total. Performance is limited by the slower GPU on shared layers, but VRAM is additive. Matching models is simpler but not required.

How do I set up dual GPUs in Ollama?

Ollama auto-detects multiple GPUs and distributes layers automatically. For fine-grained control, use llama.cpp directly with --tensor-split to specify layer distribution between GPUs.

No — NVLink is not required for LLM inference. llama.cpp uses PCIe to transfer activations between GPUs, adding ~5-10% latency overhead vs NVLink. The VRAM benefit outweighs the latency cost for inference workloads.

What is the cheapest dual GPU setup for 70B models?

Two used RTX 3090s (~$1,400 total) gives 48GB VRAM — the cheapest way to run 70B Q4 locally. Requires a motherboard with two PCIe x16 slots and a 1200W+ PSU. Alternative: used A6000 (48GB) at ~$2,000 is simpler to manage as a single card.

Frequently Asked Questions

What are the benefits of a dual GPU setup for local LLMs?
Dual GPUs double your total VRAM. Two RTX 3090s give 48GB total, enabling 70B Q4 models that a single 24GB card can't run. llama.cpp and Ollama support tensor parallelism across multiple GPUs for near-linear memory scaling.
Do both GPUs need to be the same model?
No — mixed GPU setups work with llama.cpp and Ollama. You can pair an RTX 4090 (24GB) with a 3090 (24GB) for 48GB total. Performance is limited by the slower GPU on shared layers, but VRAM is additive. Matching models is simpler but not required.
How do I set up dual GPUs in Ollama?
Ollama auto-detects multiple GPUs and distributes layers automatically. For fine-grained control, use llama.cpp directly with --tensor-split to specify layer distribution between GPUs.
Do I need NVLink for dual GPU inference?
No — NVLink is not required for LLM inference. llama.cpp uses PCIe to transfer activations between GPUs, adding 5-10% latency overhead vs NVLink. The VRAM benefit outweighs the latency cost for inference workloads.
What is the cheapest dual GPU setup for 70B models?
Two used RTX 3090s ( $1,400 total) gives 48GB VRAM — the cheapest way to run 70B Q4 locally. Requires a motherboard with two PCIe x16 slots and a 1200W+ PSU. Alternative: used A6000 (48GB) at $2,000 is simpler to manage as a single card.

🔧 Tools in This Article

All tools →

Related Guides

All guides →
#dual-gpu#multi-gpu#rtx-3090#hardware#guide#local-llm#vram