AI Tools

EXO Framework: Run 70B+ Models Across Multiple GPUs

EXO Framework: Run 70B+ Models Across Multiple GPUs Most people who want to run a 70B parameter model locally hit the same wall: a single GPU with 24GB of VRAM isn't enough. Even the RTX 4090 — currently the...

March 30, 2026·9 min read·1,910 words

Most people who want to run a 70B parameter model locally hit the same wall: a single GPU with 24GB of VRAM isn't enough. Even the RTX 4090 — currently the top consumer GPU — maxes out at 24GB. Fitting Llama 3.1 70B in 4-bit quantization still needs around 40GB of VRAM. That means one card isn't going to cut it.

EXO is an open-source framework that solves this by treating multiple devices as a single unified inference cluster. It shards the model across all your hardware — whether that's two RTX 3090s, a desktop GPU combined with an Apple Silicon laptop, or an entire rack of mixed machines. The result: you run models that would otherwise require enterprise hardware, using the machines you already own.

This guide covers how EXO works, how to set it up, what performance to expect, and which hardware combinations make the most sense.

What Is EXO?

EXO (available at github.com/exo-explore/exo) is an open-source distributed inference framework for large language models. Its core idea is simple: instead of buying a single expensive GPU with enough VRAM for a 70B model, you connect multiple smaller GPUs — or even heterogeneous devices — into a cluster that collectively has enough memory.

Key characteristics:

  • Heterogeneous device support — Mix NVIDIA (CUDA), AMD (ROCm), and Apple Silicon (Metal via MLX) in the same cluster
  • Automatic model sharding — EXO partitions the model across devices automatically; no manual layer assignment needed
  • ChatGPT-compatible API — Exposes a local OpenAI-compatible endpoint, so any tool that works with the OpenAI API works with EXO
  • Zero-config peer discovery — Nodes on the same local network find each other via mDNS/Bonjour without manual IP configuration
  • Supports popular models — Llama 3 / 3.1 (8B, 70B, 405B), Mistral, Mixtral, Qwen 2.5, Gemma 2, and more

EXO is not a training framework. It's inference-only, aimed at running already-quantized GGUF or MLX-format models across a fleet of consumer devices.

How EXO Works

Model Sharding

EXO splits transformer layers across devices using a ring topology. Each device holds a contiguous slice of the model's layers. During a forward pass, activations travel around the ring from device to device, with each machine processing its layers and passing the result to the next.

This approach means:

  • Each device only needs to store its portion of the model weights in VRAM
  • Communication happens once per forward pass (not once per layer), keeping bandwidth requirements manageable
  • Adding more devices proportionally reduces per-device memory load

For a 70B model at Q4 quantization (~40GB total), two devices each holding 20GB worth of layers is enough to run the full model.

Inference Engines

EXO supports multiple backend engines depending on your hardware:

Hardware Backend
Apple Silicon (M-series) MLX
NVIDIA CUDA tinygrad or llama.cpp
AMD ROCm tinygrad
CPU fallback llama.cpp

The MLX backend on Apple Silicon is the most mature and delivers the best per-device performance. NVIDIA support via tinygrad is functional but generally slower than running a dedicated framework like vLLM directly — EXO's value proposition on NVIDIA hardware is the pooling, not raw single-GPU throughput.

Device Discovery and Communication

EXO uses mDNS (the same protocol behind AirDrop and AirPlay) for zero-configuration peer discovery on local networks. Start EXO on two machines on the same LAN and they will find each other automatically within seconds.

Inter-node communication uses gRPC over TCP. For latency-sensitive workloads, a wired Gigabit Ethernet connection between nodes is strongly recommended over Wi-Fi. A 10GbE switch makes a meaningful difference for larger clusters.

Setup Guide

Requirements

  • Python 3.12+
  • CUDA 12.x (for NVIDIA nodes) or macOS 14.3+ (for Apple Silicon nodes)
  • All devices on the same local network (wired recommended)

Step 1: Install EXO on Each Node

On every machine that will be part of the cluster:


pip install exo-explore

For NVIDIA nodes, verify CUDA is accessible:


python -c "import torch; print(torch.cuda.is_available())"

For Apple Silicon, MLX installs automatically as a dependency.

Step 2: Start the EXO Server

On each machine, run:


exo

EXO will auto-discover other nodes on the network. You should see something like:


Discovered peer: MacBook-Pro-M3-Max.local (2 devices, 128GB)
Discovered peer: desktop-3090x2.local (2 devices, 48GB)
Cluster VRAM: 176GB total

Step 3: Run a 70B Model

Once the cluster is formed, point any OpenAI-compatible client at http://localhost:52415/v1. From the command line:


curl http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-70b",
    "messages": [{"role": "user", "content": "Explain model sharding in one paragraph."}]
  }'

EXO handles model download (via Hugging Face) on first run and distributes layers automatically based on each device's available VRAM.

Step 4: Verify Shard Distribution

Check the logs on each node to confirm layer distribution:


[Node 1 - RTX 3090] Layers 0-39 (24GB)
[Node 2 - RTX 3090] Layers 40-79 (20GB)

If one node is taking all layers, check that both nodes are running the same EXO version and are reachable via ping.

Performance Benchmarks

The following numbers are from EXO's official GitHub README and community-reported results. Performance varies significantly by hardware configuration, network speed, and model quantization.

Apple Silicon Clusters (MLX backend)

Configuration Model Quantization Tokens/sec
2× M3 Max (128GB each) Llama 3.1 70B 4-bit ~12
M3 Ultra + M3 Max Llama 3.1 70B 4-bit ~22
4× M2 Pro (32GB each) Llama 3.1 70B 4-bit ~8

NVIDIA Clusters (tinygrad/llama.cpp backend)

Configuration Model Quantization Tokens/sec
2× RTX 3090 (24GB each) Llama 3.1 70B Q4_K_M ~6–9
RTX 4090 + RTX 3090 Llama 3.1 70B Q4_K_M ~8–12

Important caveat: EXO's NVIDIA throughput is lower than running a model on a single GPU with sufficient VRAM, because tinygrad is less optimized than native CUDA kernels in vLLM or llama.cpp. If you have a single GPU with enough VRAM for your target model, use vLLM or Ollama instead. EXO's advantage is enabling models that simply won't fit on a single card.

Mixed Hardware (NVIDIA + Apple Silicon)

Mixing NVIDIA and Apple Silicon in the same cluster is supported but introduces additional latency from backend switching. Use this for maximum memory capacity, not for latency-sensitive applications.

When to Use EXO vs. a Single Large GPU

Scenario Recommendation
Single GPU fits the model in VRAM Use vLLM, Ollama, or llama.cpp — faster and simpler
70B+ model, 2+ GPUs on same network EXO is the right tool
You own multiple Macs with large RAM EXO + MLX is excellent
Low-latency production API serving Single large GPU (A100/H100) or GPU cloud
Occasional 70B inference, budget hardware EXO across 2× RTX 3090 is cost-effective

The main trade-offs with EXO:

Latency: Network hops between nodes add 20–100ms per request depending on your LAN speed. For interactive chat this is acceptable; for streaming token generation it's barely noticeable after the first token.

Throughput: Multi-node inference doesn't scale linearly. Two nodes won't give you 2× throughput — you'll see roughly 1.3–1.7× compared to a single node running the same model at lower context.

Complexity: Setting up EXO across multiple machines is harder than installing Ollama on one. If anything on your network changes (IP, hostname), the cluster needs to be restarted.

Reliability: If one node goes down mid-generation, the whole request fails. There's no fault tolerance yet.

Budget: 2× RTX 3090 (~$600–800 used)

The RTX 3090's 24GB VRAM makes it ideal for EXO clusters. Two cards gives you 48GB of pooled memory — enough for Llama 3.1 70B at Q4_K_M with headroom left over for context.

  • NVIDIA RTX 3090 on Amazon — buy two for a dual-node desktop setup or install both in the same PCIe machine (EXO works with multi-GPU single machines too)

Expected performance: 6–9 tokens/sec on 70B, which is conversational speed. Not fast, but functional.

Mid-Range: RTX 4090 + RTX 3090

The 4090's 24GB plus the 3090's 24GB gives 48GB total, but the 4090's faster CUDA cores mean the shared layers get processed faster. Expect 8–12 tokens/sec on 70B.

Apple Silicon Premium: M3 Max + M3 Max or M3 Ultra

If you're in the Apple ecosystem, two M3 Max MacBook Pros (or a Mac Studio + MacBook) is arguably the best EXO setup for its price point. The MLX backend is mature, memory is unified (no VRAM/RAM split), and 128GB per machine gives you enormous headroom.

Two M3 Max machines yield ~12 tokens/sec on Llama 3.1 70B — similar to the 4090+3090 NVIDIA cluster but with less setup friction.

Cloud Testing: Vast.ai Multi-GPU Instances

If you want to test EXO configurations before buying hardware, Vast.ai lets you rent multi-GPU instances by the hour. Spin up a 2× A6000 (48GB each) instance for a few dollars to benchmark the configuration you're planning before committing.

EXO vs. Other Distributed Inference Approaches

EXO vs. Petals: Petals is another distributed inference project, but it uses a public network of volunteers. EXO runs on your own hardware, giving you privacy, consistent latency, and no dependence on third-party nodes.

EXO vs. vLLM tensor parallelism: vLLM supports multi-GPU inference via tensor parallelism, but requires all GPUs to be in the same machine and connected via NVLink or PCIe. EXO works across separate machines over a regular LAN.

EXO vs. llama.cpp with split layers: llama.cpp's --split-mode can split a model across multiple GPUs within a single system. EXO extends this to separate machines and heterogeneous hardware.

For a deeper comparison of single-node inference tools, see our guide to vLLM vs Ollama vs TGI.

Verdict

EXO fills a genuine gap: it lets you run models that would otherwise require 40–80GB of VRAM using consumer hardware you can actually afford. The Apple Silicon + MLX path is the smoothest experience today. NVIDIA support works but is slower per-device than native alternatives.

If you already own multiple GPUs or Macs and want to run 70B models locally, EXO is worth setting up. The zero-config discovery and OpenAI-compatible API mean integration with existing tools (Open WebUI, Continue, LangChain) is straightforward.

If you're shopping for hardware specifically to run 70B models, also consider whether a single RTX 5090 (32GB) or a used A6000 (48GB) might be a cleaner solution — see our Best Local LLMs for Every RTX 50-Series GPU guide for single-GPU sizing. Multi-node adds complexity; only reach for it when one card genuinely isn't enough.


*EXO GitHub: github.com/exo-explore/exo — check the README for the latest supported models and benchmark updates.*

Frequently Asked Questions

How does EXO handle different types of GPUs in a single cluster?

EXO supports heterogeneous device support, allowing you to mix NVIDIA (CUDA), AMD (ROCm), and Apple Silicon (Metal via MLX) GPUs in the same cluster, making it versatile for various hardware setups.

What is the expected performance when running 70B+ models with EXO?

Performance varies based on the specific hardware configuration, but EXO efficiently shards the model across devices, enabling you to run large models like Llama 3.1 70B using the collective memory of multiple GPUs.

Can I use EXO with consumer-grade GPUs, or do I need enterprise hardware?

You can use EXO with consumer-grade GPUs. It is designed to leverage the combined VRAM of multiple consumer GPUs, such as RTX 3090s, to run models that typically require much more memory than a single GPU can provide.

What are the costs associated with using EXO?

Using EXO itself is cost-free as it is open-source. However, costs will arise from the hardware you choose to use in your cluster, such as multiple GPUs, which can vary widely in price depending on the models and quantities.

Are there any alternatives to EXO for running large models on multiple GPUs?

Yes, alternatives include frameworks like DeepSpeed, Hugging Face's Transformers with model parallelism, and PyTorch's distributed data parallel (DDP). Each has its own strengths and may be more suitable depending on your specific needs and technical expertise.

Frequently Asked Questions

How does EXO handle different types of GPUs in a single cluster?
EXO supports heterogeneous device support, allowing you to mix NVIDIA (CUDA), AMD (ROCm), and Apple Silicon (Metal via MLX) GPUs in the same cluster, making it versatile for various hardware setups.
What is the expected performance when running 70B+ models with EXO?
Performance varies based on the specific hardware configuration, but EXO efficiently shards the model across devices, enabling you to run large models like Llama 3.1 70B using the collective memory of multiple GPUs.
Can I use EXO with consumer-grade GPUs, or do I need enterprise hardware?
You can use EXO with consumer-grade GPUs. It is designed to leverage the combined VRAM of multiple consumer GPUs, such as RTX 3090s, to run models that typically require much more memory than a single GPU can provide.
What are the costs associated with using EXO?
Using EXO itself is cost-free as it is open-source. However, costs will arise from the hardware you choose to use in your cluster, such as multiple GPUs, which can vary widely in price depending on the models and quantities.
Are there any alternatives to EXO for running large models on multiple GPUs?
Yes, alternatives include frameworks like DeepSpeed, Hugging Face's Transformers with model parallelism, and PyTorch's distributed data parallel (DDP). Each has its own strengths and may be more suitable depending on your specific needs and technical expertise.

🔧 Tools in This Article

All tools →

Related Guides

All guides →