AMD Strix Halo: Run 70B+ LLMs on 128GB Unified Memory
AMD Strix Halo: Run 70B+ LLMs on 128GB Unified Memory The AMD Ryzen AI Max+ 395 — codenamed "Strix Halo" — does something no discrete GPU under $2,000 can do: it gives you up to 128GB of memory accessible...
The AMD Ryzen AI Max+ 395 — codenamed "Strix Halo" — does something no discrete GPU under $2,000 can do: it gives you up to 128GB of memory accessible to the GPU. That's enough to load a 70B parameter model at Q4 quantization with room to spare.
The trick is unified memory. Instead of separate CPU RAM and GPU VRAM, Strix Halo shares a single pool of LPDDR5X-8000 between its 16 Zen 5 CPU cores and its 40-CU RDNA 3.5 integrated GPU. AMD's Variable Graphics Memory (VGM) lets you allocate up to 96GB of that pool as VRAM.
No other consumer-class hardware offers this combination of memory capacity and GPU compute at this price point. Available in mini PCs and laptops, Strix Halo is the most practical way to run truly large models locally without a server rack.
Key Specs
| Spec | Ryzen AI Max+ 395 |
|---|---|
| CPU | 16 Zen 5 cores, up to 5.1 GHz |
| GPU | Radeon 8060S, 40 RDNA 3.5 CUs |
| GPU Compute | ~59.4 TFLOPS (FP16/BF16) |
| NPU | XDNA 2, 50 TOPS |
| Max System RAM | 128GB LPDDR5X-8000 |
| Max GPU VRAM | 96GB (via Variable Graphics Memory) |
| Memory Bandwidth | ~256 GB/s (theoretical), ~212 GB/s (measured) |
| TDP | 120-140W under sustained load |
| Form Factors | Mini PCs (4-5L), laptops |
The 128GB Advantage: What Fits
The core selling point of Strix Halo for AI is simple: model capacity. Here's what 96GB of allocatable VRAM unlocks compared to common discrete GPUs:
| Model | Quantization | VRAM Needed | RTX 4090 (24GB) | Arc Pro B70 (32GB) | Strix Halo (96GB) |
|---|---|---|---|---|---|
| Qwen 3 14B | Q8_0 | ~15GB | ✅ | ✅ | ✅ |
| Qwen 3 32B | Q5_K_M | ~27GB | ❌ | ✅ | ✅ |
| DeepSeek R1 70B | Q4_K_M | ~42GB | ❌ | ❌ | ✅ |
| Llama 3.1 70B | Q4_K_M | ~42GB | ❌ | ❌ | ✅ |
| Qwen 3 72B | Q4_K_M | ~44GB | ❌ | ❌ | ✅ |
| Llama 3.1 70B | Q5_K_M | ~52GB | ❌ | ❌ | ✅ |
| DeepSeek R1 (MoE) | Q3_K_M | ~90GB | ❌ | ❌ | ✅ |
Strix Halo runs 70B models that require two RTX 4090s ($3,200+) or specialized server GPUs. A 128GB Strix Halo mini PC costs roughly $2,000-2,500 — and it fits on your desk. If you only need 32GB, the Intel Arc Pro B70 at $949 is the best value discrete GPU option.
Real-World LLM Benchmarks
Token generation speed depends on memory bandwidth. The Strix Halo's ~212 GB/s measured bandwidth is lower than discrete GPUs (RTX 4090 does 1,008 GB/s), which means slower per-token speeds. But for models that don't fit in discrete GPU memory, there is no comparison — the Strix Halo actually runs them.
Benchmarks from community testing on 128GB Strix Halo systems:
| Model | Quantization | Prompt (pp512) | Generation (tg128) |
|---|---|---|---|
| Qwen 3 30B-A3B (MoE) | Q4_K_M | Fast | ~86 tok/s |
| Qwen 3 14B | Q4_K_M | ~250 tok/s | ~25-30 tok/s |
| DeepSeek R1 7B | Q4_K_M | ~450 tok/s | ~40-50 tok/s |
| Llama 3.1 70B | Q4_K_M | ~80 tok/s | ~8-12 tok/s |
| MiniMax M2.5 (228B) | Q3_K_M | Slow | ~33 tok/s |
Key observations:
- Small models (7-14B): Perfectly usable at 25-50 tok/s generation. Fast enough for interactive chat.
- MoE models: The sweet spot for Strix Halo. Models like Qwen 3 30B-A3B only activate ~3B parameters per token, so they're fast despite large total size. 86 tok/s on a 30B model is outstanding.
- 70B dense models: Functional at 8-12 tok/s. Not fast, but faster than reading. Good for batch processing, code generation, and tasks where you don't need instant responses.
- 200B+ models: These actually run on Strix Halo. Slowly, but they run. No consumer GPU can touch this.
For comparison with what 24GB discrete GPUs handle, check our RTX 4090 local LLM guide.
Setting Up VRAM Allocation
By default, AMD allocates a conservative amount of system memory to the GPU. For LLM inference, you want to maximize it. Here's how:
Method 1: BIOS Setting (Recommended)
Most Strix Halo systems expose a UMA (Unified Memory Architecture) setting in BIOS:
1. Enter BIOS (usually F2 or DEL at boot)
2. Find Advanced > AMD CBS > NBIO > GFX Configuration
3. Set UMA Frame Buffer Size to maximum (typically 96GB on 128GB systems)
4. Save and reboot
Method 2: AMD Variable Graphics Memory
On Windows, AMD's Adrenalin drivers let you adjust VGM allocation dynamically. On Linux, the allocation is typically set at boot via kernel parameters or BIOS.
Verify Allocation
# Check GPU memory allocation on Linux
rocm-smi --showmeminfo vram
# Or via Vulkan
vulkaninfo | grep "memoryHeaps"
You want to see close to 96GB available to the GPU. If you see only 16-32GB, your BIOS or driver settings need adjustment.
Best Backend: Vulkan Beats ROCm
This is important and not obvious. The Strix Halo's integrated GPU (gfx1151) has better performance with the Vulkan backend than ROCm/HIP for llama.cpp and Ollama.
Community benchmarks confirm: RADV Vulkan is now faster on both prompt processing and token generation, including MoE models. It's also far easier to set up than ROCm.
Ollama Setup (Vulkan)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Ollama auto-detects Vulkan on AMD GPUs
# Verify GPU is detected
ollama run qwen3:14b
Ollama should automatically use the Vulkan backend on Strix Halo. If it falls back to CPU, check that Mesa RADV drivers are installed:
# Ubuntu/Debian
sudo apt install mesa-vulkan-drivers
# Verify Vulkan works
vulkaninfo | grep "deviceName"
# Should show: AMD Radeon 8060S or similar
llama.cpp Setup (Vulkan)
# Build llama.cpp with Vulkan support
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j
# Run inference
./build/bin/llama-cli -m model.gguf -ngl 99 --device vulkan0
For a comparison of inference frameworks and which works best on AMD hardware, see our vLLM vs Ollama vs TGI guide.
Available Systems
Strix Halo is shipping in mini PCs and workstation laptops. Key options with 128GB configurations:
Mini PCs
- Beelink GTR9 Pro — Ryzen AI Max+ 395, 128GB LPDDR5X, compact 4-5L chassis. Around $2,000-2,500.
- ASUS NUC 14 Pro AI — Similar specs, ASUS build quality.
- Framework Desktop — Modular design, community-tested for LLM workloads.
Laptops
- ASUS ROG Flow Z13 — Portable Strix Halo with 128GB. Tested at up to 12.2x faster than Intel Core Ultra 258V on 14B models.
- Lenovo ThinkPad P-series — Workstation laptop with Strix Halo options.
Prices vary by retailer and configuration. Check Amazon for Strix Halo mini PCs — the 128GB RAM option typically adds $200-400 over the 64GB version, but it's worth it for running 70B models.
Strix Halo vs Discrete GPUs: When to Choose What
| Use Case | Best Hardware | Why |
|---|---|---|
| 7-14B daily driver | RTX 4090 or Arc Pro B70 | Faster bandwidth = faster tokens |
| 32B at high quality | Arc Pro B70 or Strix Halo | Need 27GB+ VRAM |
| 70B models | Strix Halo | Only option under $2,500 |
| 200B+ / full DeepSeek R1 | Strix Halo (128GB) | Nothing else fits at consumer prices |
| Multi-user serving | Dual Arc Pro B70 or RTX 4090 | Need bandwidth for concurrent requests |
| Portable AI | Strix Halo laptop | Only laptop with 128GB GPU memory |
| Maximum tok/s | RTX 4090 | 1,008 GB/s bandwidth wins |
The decision comes down to: do you need speed (discrete GPU) or capacity (Strix Halo)? If the model fits in 24-32GB, a discrete GPU will generate tokens 3-4x faster. If it doesn't fit, Strix Halo is your only real consumer option.
Practical Tips
Start with MoE models. Models like Qwen 3 30B-A3B and DeepSeek V3-0324 are tailor-made for Strix Halo. They have large total parameters but only activate a fraction per token, giving you big-model quality at small-model speeds. See our model comparison guide for which models punch above their weight.
Use Q4_K_M or Q5_K_M. On Strix Halo you have enough VRAM to skip aggressive quantization. Don't use Q2 or Q3 unless you absolutely must — the quality drop is steep.
Monitor thermal throttling. Strix Halo systems pull 120-140W under sustained AI load. Mini PCs can thermal throttle if airflow is restricted. Keep ventilation clear and consider ambient temperature.
Set context length deliberately. More context = more VRAM for KV cache. If you're running a 70B model at Q4_K_M (~42GB), you still have ~50GB for KV cache — enough for very long context. But a 200B+ model leaves less room. Set --ctx-size explicitly based on your needs.
Update Mesa drivers. Vulkan performance on Strix Halo improves significantly with newer Mesa versions. Mesa 26.0+ has critical fixes for MoE model performance. Stay current.
Bottom Line
The AMD Strix Halo with Ryzen AI Max+ 395 is the only consumer hardware that runs 70B parameter models in a box that fits on your desk. The 128GB unified memory pool — with up to 96GB allocatable as VRAM — unlocks models that would otherwise require multi-GPU server setups costing $3,000+.
The trade-off is speed. At ~212 GB/s bandwidth, token generation on large models is 3-4x slower than an RTX 4090. For small models (7-14B), a discrete GPU is the better choice. But for anyone who needs to run 70B+ models locally — for privacy, for experimentation, or because cloud API costs add up — Strix Halo is the most practical path available.
Pair it with MoE models and you get the best of both worlds: large-model intelligence at usable speeds.
FAQ
How much VRAM does Strix Halo actually have?
The Ryzen AI Max+ 395 uses unified memory — it shares system RAM between CPU and GPU. With 128GB of LPDDR5X, you can allocate up to 96GB as GPU-accessible VRAM through AMD Variable Graphics Memory. The remaining 32GB stays reserved for the OS and CPU tasks.
Can Strix Halo run DeepSeek R1 (full 671B MoE)?
The full DeepSeek R1 at Q2_K quantization needs approximately 200GB+ of memory. Strix Halo with 128GB can't load the full model, but it can run the DeepSeek R1 distilled variants (7B, 14B, 32B, 70B) comfortably. The 70B distill at Q4_K_M fits easily in 96GB VRAM.
Should I use ROCm or Vulkan on Strix Halo?
Vulkan (RADV). Community benchmarks consistently show RADV Vulkan outperforming ROCm HIP on the Strix Halo's gfx1151 integrated GPU. Vulkan is also easier to set up — just install Mesa drivers and go. For the details, check our Ollama models guide.
How does Strix Halo compare to Apple M4 Max?
The M4 Max with 128GB shares the unified memory concept but has higher memory bandwidth (~546 GB/s vs ~212 GB/s), resulting in 2-2.5x faster token generation. The trade-off: M4 Max systems start at $3,500+ while Strix Halo mini PCs are $2,000-2,500. Strix Halo also runs standard Linux, giving access to the full open-source ML stack.
What's the best model to start with on Strix Halo?
Qwen 3 30B-A3B (MoE). It delivers 86 tok/s on Strix Halo — faster than most dense 7B models on this hardware — while offering 30B-class reasoning quality. It's the model that best demonstrates what unified memory + MoE architecture can do.
Frequently Asked Questions
How much VRAM does Strix Halo actually have?
Can Strix Halo run DeepSeek R1 (full 671B MoE)?
Should I use ROCm or Vulkan on Strix Halo?
How does Strix Halo compare to Apple M4 Max?
What's the best model to start with on Strix Halo?
🔧 Tools in This Article
All tools →Related Guides
All guides →Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now
Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now Meta's April 14, 2026 announcement of an expanded Broadcom partnership is a useful reminder that AI competition is increasingly fought below the API layer. Meta said it...
2 min read
AI ToolsMeta Muse Spark April 2026: What It Means for Consumer AI Assistants
Meta Muse Spark April 2026: What It Means for Consumer AI Assistants Meta's April 8, 2026 announcement of Muse Spark matters because it is not just another model launch. Meta is trying to reposition Meta AI around multimodal perception,...
2 min read
AI ToolsProject Glasswing April 2026: The AI Cybersecurity Shift Is Here
Project Glasswing April 2026: The AI Cybersecurity Shift Is Here Anthropic's April 7, 2026 announcement of Project Glasswing is one of the clearest recent signs that frontier AI labs now see cybersecurity as a central deployment battleground, not a...
2 min read