AI Tools

EXO Framework Guide: Distributed Local AI in 2026

EXO turns multiple Apple Silicon Macs into a local AI cluster. This 2026 guide covers setup, hardware limits, benchmarks, and when to use vLLM instead.

March 30, 2026·12 min read·2,477 words

In short: EXO is an open-source local AI cluster framework. Its official README describes automatic device discovery, topology-aware model placement, MLX support, OpenAI/Claude/Responses/Ollama-compatible APIs, and custom Hugging Face model loading. The strongest supported path in the current docs is Apple Silicon; the same README says macOS uses the GPU while Linux currently runs on CPU, with more accelerator support still under development. Sources: EXO README, EXO hardware accelerator support, Apple MLX.

Disclosure: Some hardware and cloud GPU links in this guide are affiliate or referral links. ToolHalla may earn a commission or referral credit at no extra cost to you. We do not quote live prices here because hardware and rental pricing changes constantly.

The reason EXO exists is simple: large local models often need more memory than one consumer machine can comfortably provide. Instead of treating every Mac or workstation as an island, EXO tries to turn nearby machines into one local inference cluster. If you are deciding whether to buy one larger GPU first, start with our best GPUs for running AI locally and RTX 50-series local LLM sizing guide. If you already own multiple machines, EXO is the more interesting question.

This refresh updates the setup path, supported-hardware caveats, benchmark sourcing, internal links, and affiliate disclosure without changing the article slug, publish date, or overall structure.

What Is EXO?

EXO (available at github.com/exo-explore/exo) is an open-source framework for running AI models across a local cluster. The current project README summarizes the core feature set as automatic device discovery, topology-aware auto parallel placement, tensor parallelism, MLX support, multiple API-compatible interfaces, and custom model loading from Hugging Face. Source: EXO features.

Key characteristics from the official README:

  • Automatic device discovery — devices running EXO discover each other without manual cluster configuration. Source: EXO Quick Start.
  • Topology-aware placement — EXO chooses how to split a model based on available devices, resources, and link characteristics. Source: EXO features.
  • Tensor parallelism — the README says EXO supports model sharding and reports up to 1.8× speedup on 2 devices and 3.2× on 4 devices. Source: EXO features.
  • MLX backend on Apple Silicon — EXO uses MLX and MLX distributed for Apple Silicon inference and distributed communication. Sources: EXO features, MLX distributed docs.
  • Multiple API formats — the README lists OpenAI Chat Completions, Claude Messages, OpenAI Responses, and Ollama-compatible APIs. Source: EXO API section.
  • Custom Hugging Face models — EXO can add custom model IDs from Hugging Face through its local API, with a security note for models that require trust_remote_code. Source: EXO custom model loading.

The important correction for 2026 readers: do not assume EXO is currently a polished NVIDIA/AMD GPU pooling tool on Linux. The official hardware support section says: "On macOS, exo uses the GPU. On Linux, exo currently runs on CPU." Treat Linux GPU support as a roadmap item unless the EXO README has changed by the time you read this. Source: EXO hardware accelerator support.

How EXO Works

Model Sharding

EXO's current README describes topology-aware auto parallelism: it looks at the devices in your local cluster and decides how to split model work across them. It also describes tensor parallelism and provides speedup figures for 2-device and 4-device setups. Source: EXO features.

In practice, that means EXO is best understood as a local distributed inference layer. You are not manually assigning every transformer layer in the article; you are asking EXO to find a valid placement for the model on the machines it can see. The API can also preview valid placements for a model before you create an instance. Source: EXO API placement preview.

This is different from the classic "buy one bigger GPU" route. A single large GPU is still simpler when it fits the model. EXO becomes useful when the memory you already own is spread across several Apple Silicon machines or when you want to experiment with a local cluster before committing to dedicated server hardware.

Inference Engines

The supported-backend picture has narrowed since the older version of this article. The official README now emphasizes MLX on Apple Silicon and says Linux currently runs on CPU while accelerator support is being extended. Sources: EXO features, EXO hardware accelerator support.

Hardware path What the current EXO docs say Practical takeaway
Apple Silicon Macs macOS uses the GPU; EXO uses MLX and MLX distributed Best-supported EXO path today
Linux workstations Linux currently runs on CPU Not the right primary choice for fast NVIDIA/AMD GPU serving yet
Single NVIDIA GPU that fits the model Use a purpose-built local server instead Compare vLLM vs Ollama vs TGI
Cloud GPU testing Rent before buying hardware See our GPU cloud guide

Device Discovery and Communication

The Quick Start says devices running EXO automatically discover each other and that each device exposes an API and dashboard at http://localhost:52415. Source: EXO Quick Start.

The README also describes RDMA over Thunderbolt 5 for supported Macs and points to macOS-specific setup caveats. Use that as an advanced Apple Silicon optimization path, not as a requirement for every small test cluster. Source: EXO RDMA section.

Setup Guide

Requirements

The current EXO README documents two main run paths: run from source on macOS/Linux, or use the macOS app. The source path depends on uv, Node.js for the dashboard, Rust nightly for bindings, and platform-specific prerequisites. Sources: EXO Quick Start, uv, Node.js, Rustup.

For Apple Silicon users, the README also lists Xcode and Homebrew in the macOS source setup, and it says the macOS app requires macOS Tahoe 26.2 or later. Source: EXO macOS source setup, EXO macOS app.

Step 1: Install or Clone EXO

Do not use the old pip install exo-explore instruction. PyPI does not publish an exo-explore package under that name, and the current project README shows source-based setup plus a macOS app path. Source: EXO Quick Start.

For source setup, the README flow is:


git clone https://github.com/exo-explore/exo
cd exo/dashboard && npm install && npm run build && cd ..
uv run exo

If you have Nix installed on macOS, the README also documents:


nix run .#exo

For the macOS app, the README links a DMG and a Homebrew cask:


brew install --cask exo

Check the upstream README before installing, because EXO is moving quickly and its system requirements are more important than this article's summary. Source: EXO Quick Start.

Step 2: Start the EXO Server

When EXO starts, the README says the dashboard and API run at http://localhost:52415/. Source: EXO Quick Start.

On a multi-Mac setup, start EXO on each machine on the same local network. If you need to isolate clusters on a shared network, the macOS app and environment-variable docs mention EXO_LIBP2P_NAMESPACE. Source: EXO macOS app namespace, EXO environment variables.

Step 3: Create or Load a Model Instance

The current README's API example creates an instance for mlx-community/Llama-3.2-1B-Instruct-4bit, waits for it to become ready, then calls the OpenAI-compatible chat completions endpoint. Source: EXO API example.

A simplified OpenAI-compatible request looks like this:


curl http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Explain model sharding in one paragraph."}]
  }'

Use the model IDs and placement workflow from the upstream README, not stale examples copied from older tutorials. Source: EXO API section.

Step 4: Verify Placement and Performance

Before treating a cluster as usable, check placement and benchmark it. The README documents /instance/previews, /state, and an exo-bench workflow for prompt and generation throughput. Source: EXO API section, EXO benchmarking.

The practical test is not "did the dashboard open?" It is:

  • Does EXO see all the machines you expect?
  • Does the model placement fit without memory errors?
  • Does generation speed meet your use case?
  • Does a node disconnect break the workload you care about?

Run that test on your hardware before buying more machines.

Performance Benchmarks

The older version of this article listed unsourced token-per-second estimates for mixed RTX and Apple Silicon clusters. This refresh removes those estimates. For source-backed numbers, use the benchmark material linked from the EXO README and run exo-bench on your own hardware. Sources: EXO benchmarks, EXO benchmarking tool.

Apple Silicon Clusters (MLX backend)

The current README's benchmark section links examples from Jeff Geerling using 4 × M3 Ultra Mac Studio systems, including Qwen3-235B, DeepSeek v3.1 671B, and Kimi K2 Thinking runs. Sources: EXO benchmarks, Jeff Geerling Mac Studio RDMA post.

That does not mean a small two-Mac cluster will match those results. It means EXO's strongest public proof point is now Apple Silicon plus MLX, especially where large unified memory pools and fast Mac-to-Mac links matter.

NVIDIA and AMD Clusters

For NVIDIA or AMD Linux GPU serving, be cautious. EXO's own hardware support section says Linux currently runs on CPU. If your goal is fast serving on one NVIDIA machine, compare vLLM, Ollama, TensorRT-LLM, or llama.cpp before forcing EXO into that role. Sources: EXO hardware accelerator support, vLLM GitHub, llama.cpp GitHub.

You can still experiment, but treat older "RTX 3090 cluster" advice as experimental unless upstream EXO docs now say otherwise.

Mixed Hardware

Mixed hardware is the hardest case to recommend. EXO's current documentation gives clear Apple Silicon/MLX signals and a clear Linux CPU caveat. If you mix Macs, Linux boxes, Wi-Fi, and different model formats, benchmark first and assume troubleshooting time. Source: EXO README.

When to Use EXO vs. a Single Large GPU

Scenario Recommendation
One GPU already fits the model Use vLLM, Ollama, llama.cpp, or another single-node server
You own multiple Apple Silicon Macs with large unified memory EXO is worth testing
You want a local dashboard and OpenAI/Claude/Ollama-compatible APIs EXO fits that workflow
You need low-latency production API serving Prefer one large GPU or a managed/cloud GPU setup
You mainly own NVIDIA/AMD Linux GPUs Check EXO's hardware support status first; Linux GPU support is not the documented primary path today

The trade-offs are straightforward:

Latency: distributed inference adds coordination overhead. RDMA over Thunderbolt 5 may help on supported Macs, but it is an advanced path with hardware and macOS requirements. Source: EXO RDMA section.

Throughput: EXO's README reports tensor-parallel speedups, but your result depends on model, placement, memory pressure, and the network between machines. Source: EXO features.

Complexity: one GPU and one inference server are easier to operate than a cluster. That is why single-node tools still matter.

Reliability: any local distributed setup has more failure points than a single box. Test node restarts, network drops, and long prompts before you rely on it.

Budget: Existing Apple Silicon Macs First

The best first EXO cluster is the hardware you already own. If you have two or more Apple Silicon Macs with enough unified memory, test EXO there before buying GPUs. EXO's official docs currently make Apple Silicon the clearest GPU-accelerated path. Sources: EXO hardware accelerator support, MLX.

NVIDIA Buyers: Check Whether One GPU Is Simpler

If you are shopping for NVIDIA cards, do not buy a pair just because an old tutorial said EXO can pool them. Check whether your model fits on one GPU first. For hardware sizing, use the RTX 50-series local LLM guide and local AI GPU guide linked near the top of this article.

If you still want used 24GB cards for local AI experiments, compare current listings rather than trusting a hard-coded price: RTX 3090 on Amazon and RTX 4090 on Amazon. NVIDIA's official pages list the RTX 3090 and RTX 4090 as 24GB cards. Sources: NVIDIA RTX 3090, NVIDIA RTX 4090.

Apple Silicon Premium: Mac Studio Cluster

The most source-backed EXO showpieces are Mac Studio clusters. The README benchmark section links Jeff Geerling's 4 × M3 Ultra Mac Studio tests over Thunderbolt 5/RDMA. Source: EXO benchmarks, Jeff Geerling Mac Studio RDMA post.

That is not a budget recommendation. It is proof that EXO's Apple Silicon direction is real. For most readers, the right move is to test with existing Macs before buying a dedicated Mac Studio cluster.

Cloud Testing: Vast.ai Before Hardware

If you need to compare a single large GPU, multiple GPUs in one host, or a cloud fallback before buying hardware, Vast.ai is a practical rental option. Use it to benchmark model size, quantization, and serving stack choices before a hardware purchase. Do not assume cloud GPU pricing in this article is current; verify it on the provider page.

EXO vs. Other Distributed Inference Approaches

EXO vs. Petals: Petals is a distributed inference project that focuses on collaborative/public-style model serving, while EXO is aimed at your own local cluster. Sources: Petals GitHub, EXO README.

EXO vs. vLLM: vLLM is a serving engine for high-throughput LLM inference and is usually the first comparison point for production-style NVIDIA serving. EXO is more interesting when your local Apple Silicon memory is spread across multiple machines. Sources: vLLM GitHub, EXO hardware accelerator support.

EXO vs. llama.cpp: llama.cpp remains a strong single-machine local inference baseline and supports many quantized GGUF workflows. EXO's value is the local cluster abstraction, dashboard, and multiple API-compatible interfaces. Sources: llama.cpp GitHub, EXO API section.

For a broader single-node comparison, use the vLLM vs Ollama vs TGI guide linked above.

Verdict

EXO is worth testing if you own multiple Apple Silicon Macs and want a local AI cluster with a dashboard and familiar APIs. It is not the default answer for every multi-GPU buyer, and the current README's Linux CPU caveat matters.

The practical decision tree is simple: use one GPU or one Mac if it fits; use vLLM, Ollama, or llama.cpp if you want a simpler single-node server; test EXO when memory is spread across multiple Macs or when you specifically want the local-cluster workflow described in the official README.


Sources

Frequently Asked Questions

Is EXO still for NVIDIA GPUs?

Not as the primary documented path in the current README. The hardware support section says macOS uses the GPU and Linux currently runs on CPU while accelerator support is being extended. If you mainly want NVIDIA serving, compare vLLM, Ollama, TensorRT-LLM, and llama.cpp first. Source: EXO hardware accelerator support.

How do I install EXO now?

Use the upstream README. The old pip install exo-explore instruction is stale. Current docs show source setup with git clone, dashboard build, and uv run exo, plus a macOS app/Homebrew cask path. Source: EXO Quick Start.

Does EXO expose an OpenAI-compatible API?

Yes. The README lists OpenAI Chat Completions compatibility, plus Claude Messages, OpenAI Responses, and Ollama-compatible APIs. Source: EXO API section.

What hardware makes the most sense for EXO?

Start with multiple Apple Silicon Macs you already own. The current README's clearest GPU-accelerated path is macOS/MLX, and its benchmark section points to Mac Studio cluster examples. Sources: EXO features, EXO benchmarks.

Should I buy GPUs or rent first?

Rent or benchmark first if you are unsure. A single larger GPU may be simpler than a local cluster, and cloud testing can reveal whether your model, quantization, and serving stack are worth a hardware purchase. Vast.ai is one rental option linked above with ToolHalla's referral URL.

Frequently Asked Questions

Is EXO still for NVIDIA GPUs?
Not as the primary documented path in the current README. The hardware support section says macOS uses the GPU and Linux currently runs on CPU while accelerator support is being extended. If you mainly want NVIDIA serving, compare vLLM, Ollama, TensorRT-LLM, and llama.cpp first. Source: EXO hardware accelerator support.
How do I install EXO now?
Use the upstream README. The old pip install exo-explore instruction is stale. Current docs show source setup with git clone, dashboard build, and uv run exo, plus a macOS app/Homebrew cask path. Source: EXO Quick Start.
Does EXO expose an OpenAI-compatible API?
Yes. The README lists OpenAI Chat Completions compatibility, plus Claude Messages, OpenAI Responses, and Ollama-compatible APIs. Source: EXO API section.
What hardware makes the most sense for EXO?
Start with multiple Apple Silicon Macs you already own. The current README's clearest GPU-accelerated path is macOS/MLX, and its benchmark section points to Mac Studio cluster examples. Sources: EXO features, EXO benchmarks.
Should I buy GPUs or rent first?
Rent or benchmark first if you are unsure. A single larger GPU may be simpler than a local cluster, and cloud testing can reveal whether your model, quantization, and serving stack are worth a hardware purchase. Vast.ai is one rental option linked above with ToolHalla's referral URL.

🔧 Tools in This Article

All tools →

Related Guides

All guides →