Guide

Run LLMs on Raspberry Pi 5: Step-by-Step Setup Guide (2026)

Learn how to run local LLMs on a Raspberry Pi 5 in 2026. Complete setup guide covering Ollama installation, best models (Phi-3, Gemma 3, Llama 3.2, TinyLlama), performance benchmarks, hardware recommendations, and practical AI projects.

March 15, 2026·13 min read·2,161 words

Running a large language model on a $80 single-board computer sounded absurd two years ago. In 2026, it's not only possible — it's practical. Thanks to aggressive quantization, optimized inference engines, and a new generation of compact models, you can run an LLM on a Raspberry Pi and get usable results for chatbots, home automation, code assistance, and offline AI projects.

This guide walks you through everything: hardware selection, OS prep, Ollama installation, model choices, real-world performance numbers, and tips to squeeze every last token per second out of your Pi.

Why Run a Local LLM on a Raspberry Pi?

Before we dive into the how, let's cover the why:

Privacy. Your prompts never leave your network. No cloud, no logging, no third-party access.
Cost. After the one-time hardware purchase, inference is free — forever. No API bills, no token limits.
Offline access. Works without internet. Perfect for remote locations, field work, or air-gapped environments.
Learning. Understanding how LLMs actually run on constrained hardware teaches you more about AI than any API wrapper ever will.
Home automation. A Pi running a local LLM can power a privacy-first smart home assistant, process voice commands, or classify sensor data — all without cloud dependencies.

If any of those resonate, keep reading.

What You Need: Hardware Shopping List

The Raspberry Pi 5 with 8GB RAM is the minimum viable platform for running LLMs. The 4GB model technically works but runs out of memory with anything beyond the smallest models. Here's the full hardware list:

Essential Hardware

Component	Why You Need It	Recommended
Raspberry Pi 5 (8GB)	The 8GB model gives you enough headroom for 1B–3B parameter models with room for the OS	Raspberry Pi 5 8GB on Amazon
MicroSD Card (128GB+)	Models, OS, and swap space add up fast. 128GB gives breathing room	128GB MicroSD Cards on Amazon
Active Cooler / Fan	LLM inference pushes all four CPU cores to 100%. Without active cooling, thermal throttling kills performance	Raspberry Pi 5 Active Cooler on Amazon
USB-C Power Supply (27W)	The Pi 5 needs a proper 5V/5A supply under sustained AI workloads. Don't cheap out here	Included with most Pi 5 kits

Strongly Recommended

Component	Why	Recommended
USB SSD (256GB+)	Swap on SSD is dramatically faster than microSD swap. Also lets you store more models	USB SSD for Raspberry Pi on Amazon
Coral USB Accelerator	Google's Edge TPU can accelerate certain inference tasks. Limited model support but impressive when it works	Google Coral USB Accelerator on Amazon

> Budget estimate: A complete Raspberry Pi AI setup runs about $120–$180 depending on whether you add the SSD and case. That's less than two months of GPT-4 API usage for most developers.

Step 1: Prepare Your Raspberry Pi

Flash the OS

Use Raspberry Pi OS (64-bit, Lite) for maximum available RAM. The desktop environment eats 300–500MB that you'll want for model inference.


# Download Raspberry Pi Imager from raspberrypi.com
# Select: Raspberry Pi OS Lite (64-bit) — Bookworm
# Flash to your microSD card
# Enable SSH in the imager settings

First Boot Setup


# Update everything
sudo apt update && sudo apt full-upgrade -y

# Install essentials
sudo apt install -y curl wget git htop

# Set your hostname (optional but nice)
sudo hostnamectl set-hostname pi-llm

Configure Swap on USB SSD

This is critical. MicroSD swap is painfully slow and wears out the card. If you have a USB SSD, set up swap there:


# Identify your USB SSD
lsblk

# Create a swap file on the SSD (adjust path to your mount point)
sudo mkdir -p /mnt/ssd
sudo mount /dev/sda1 /mnt/ssd

# Create 8GB swap file
sudo dd if=/dev/zero of=/mnt/ssd/swapfile bs=1M count=8192
sudo chmod 600 /mnt/ssd/swapfile
sudo mkswap /mnt/ssd/swapfile
sudo swapon /mnt/ssd/swapfile

# Make it permanent
echo '/mnt/ssd/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

# Disable SD card swap
sudo dphys-swapfile swapoff
sudo systemctl disable dphys-swapfile

Overclock (Optional but Recommended)

The Pi 5's default clock is 2.4GHz. You can safely push it to 2.8–3.0GHz with active cooling:


sudo nano /boot/firmware/config.txt

# Add these lines:
arm_freq=2800
gpu_freq=910
over_voltage_delta=50000

Overclocking typically yields a 15–25% improvement in tokens per second. Make sure you have the active cooler installed — without it, the Pi will thermal-throttle back to stock speeds within minutes.

Step 2: Install Ollama

Ollama is the easiest way to run LLMs on a Raspberry Pi. It handles model downloading, quantization selection, and provides both a CLI and an API server.


# One-line install
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

# Start the Ollama service (it auto-starts, but just in case)
sudo systemctl enable ollama
sudo systemctl start ollama

Ollama Configuration for Pi

Ollama's defaults aren't optimized for low-resource devices. Create or edit the systemd override:


sudo systemctl edit ollama

Add these environment variables:


[Service]
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_FLASH_ATTENTION=1"

This ensures Ollama only loads one model at a time (saving RAM) and enables flash attention for better memory efficiency.


sudo systemctl daemon-reload
sudo systemctl restart ollama

Step 3: Choose Your Models

Not all models are created equal on 8GB of ARM memory. Here's what actually works, tested and benchmarked on Raspberry Pi 5 (8GB):

Tier 1: Fast and Usable (Recommended)

Gemma 3 1B (Q4_K_M)


ollama pull gemma3:1b

Speed: ~18–22 tokens/sec
RAM usage: ~1.2GB
Verdict: The speed champion. Google's smallest Gemma model is remarkably capable for its size. Best choice for real-time chat applications and home automation triggers.

TinyLlama 1.1B (Q4_K_M)


ollama pull tinyllama

Speed: ~12–18 tokens/sec
RAM usage: ~1.0GB
Verdict: The original Pi-friendly model. Still excellent for simple tasks, summarization, and as a lightweight assistant. Very stable.

Llama 3.2 1B Instruct (Q4_K_M)


ollama pull llama3.2:1b

Speed: ~15–20 tokens/sec
RAM usage: ~1.3GB
Verdict: Meta's smallest Llama brings surprising instruction-following quality for a 1B model. Great balance of speed and capability.

Tier 2: Slower but Smarter

Qwen 2.5 1.5B (Q4_K_M)


ollama pull qwen2.5:1.5b

Speed: ~10–14 tokens/sec
RAM usage: ~1.5GB
Verdict: Alibaba's compact model punches above its weight in reasoning tasks. Worth the slight speed penalty if you need better answers.

Phi-3 Mini 3.8B (Q4_K_M)


ollama pull phi3:mini

Speed: ~4–7 tokens/sec
RAM usage: ~3.5GB
Verdict: Microsoft's Phi-3 Mini is the smartest model you can realistically run on a Pi. It's slow — expect 2–4 second pauses between sentences — but the output quality is notably better than the 1B models. Use it when accuracy matters more than speed.

Llama 3.2 3B Instruct (Q4_K_M)


ollama pull llama3.2:3b

Speed: ~4–6 tokens/sec
RAM usage: ~3.2GB
Verdict: The largest model that runs without swap pressure. Similar speed to Phi-3 Mini with strong general-purpose capability.

Gemma 2 2B (Q4_K_M)


ollama pull gemma2:2b

Speed: ~8–12 tokens/sec
RAM usage: ~2.0GB
Verdict: A solid middle-ground option. Faster than 3B models, smarter than 1B models. Good for structured output and classification.

Tier 3: Technically Possible (Not Recommended)

Models above 3.8B parameters (like Llama 3.2 7B, Mistral 7B, or Gemma 7B) will technically load on a Pi 5 with swap enabled, but expect:

1–3 tokens per second
Heavy swap usage that thrashes your SSD
Minutes-long time to first token
System instability under sustained use

If you need 7B-class models, consider a Mini PC with 16–32GB RAM or a machine with a dedicated GPU instead.

Performance Summary Table

Model	Parameters	Speed (tok/s)	RAM Usage	Quality Rating
Gemma 3 1B	1B	18–22	~1.2GB	★★★☆☆
TinyLlama	1.1B	12–18	~1.0GB	★★☆☆☆
Llama 3.2 1B	1B	15–20	~1.3GB	★★★☆☆
Qwen 2.5 1.5B	1.5B	10–14	~1.5GB	★★★☆☆
Gemma 2 2B	2B	8–12	~2.0GB	★★★☆☆
Llama 3.2 3B	3B	4–6	~3.2GB	★★★★☆
Phi-3 Mini	3.8B	4–7	~3.5GB	★★★★☆

*Benchmarks on Raspberry Pi 5 (8GB), Raspberry Pi OS Lite 64-bit, Ollama, Q4_K_M quantization. Overclocked to 2.8GHz with active cooling.*

Step 4: Test Your Setup

Interactive Chat


# Start a chat session
ollama run gemma3:1b

# Try a prompt
>>> Explain what a Raspberry Pi is in three sentences.

API Access

Ollama runs an API server on port 11434 by default. You can query it from any device on your network:


curl http://pi-llm.local:11434/api/generate -d '{
  "model": "gemma3:1b",
  "prompt": "What is edge AI?",
  "stream": false
}'

This opens up integrations with Home Assistant, Node-RED, custom Python scripts, and tools like Open WebUI for a ChatGPT-like web interface.

Open WebUI (Optional Web Interface)

If you want a browser-based chat interface:


# Install Docker
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER

# Run Open WebUI
docker run -d --network=host \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Access it at http://pi-llm.local:8080 and connect it to your local Ollama instance.

Step 5: Optimize Performance

Memory Management


# Check how much RAM Ollama is using
ollama ps

# Unload a model to free RAM
ollama stop gemma3:1b

# Monitor system resources
htop

Use llama.cpp for Maximum Speed

If you need every last token per second, llama.cpp gives you 10–20% better performance than Ollama thanks to finer control over thread count, context length, and batch size:


# Build llama.cpp from source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j4

# Run a model directly
./build/bin/llama-cli \
  -m models/gemma-3-1b-Q4_K_M.gguf \
  -t 4 \
  -c 2048 \
  -p "Explain quantum computing simply"

Thermal Management

Sustained LLM inference generates significant heat. Monitor your CPU temperature:


# Check temperature
vcgencmd measure_temp

# Watch it continuously
watch -n 1 vcgencmd measure_temp

If temperatures exceed 80°C consistently, your active cooler isn't making proper contact, or you may need a case with better airflow.

Practical Raspberry Pi AI Projects

Now that you have a working LLM on your Pi, here are real-world applications:

1. Privacy-First Smart Home Assistant

Connect your Pi LLM to Home Assistant via the Ollama API. Process voice commands locally — no Alexa, no Google, no cloud.

2. Offline Coding Assistant

Use Llama 3.2 3B or Phi-3 Mini as a local code helper. It won't replace GitHub Copilot, but for quick syntax lookups, regex generation, and boilerplate code, it's surprisingly useful.

3. Document Summarizer

Feed local documents into your LLM for summarization. Perfect for processing meeting notes, research papers, or logs without uploading sensitive content to the cloud.

4. Network Monitoring AI

Use a small model to analyze system logs, detect anomalies, and generate human-readable alerts. Tools like OpenClaw can orchestrate multiple AI agents on a Pi.

5. Educational AI Tutor

Set up a kid-friendly AI tutor that runs entirely offline. No content filtering concerns beyond what you configure locally.

Limitations: What a Pi Can't Do

Let's be honest about the boundaries:

No image generation. Stable Diffusion needs a GPU. Period.
No real-time conversation. Even the fastest models have noticeable latency. This is a "type and wait" experience, not a voice assistant replacement (yet).
Limited context windows. With 8GB RAM, you're practically limited to 2048–4096 token context windows. Long documents and multi-turn conversations will hit this ceiling.
No fine-tuning. Training or fine-tuning models requires far more RAM and compute than a Pi can offer. You're running pre-trained models only.
7B+ models are impractical. They technically run, but at 1–3 tokens per second with heavy swap usage, they're not useful for anything interactive.

Raspberry Pi 5 vs. Alternatives for Local LLMs

Platform	Price	RAM	LLM Performance	Best For
Raspberry Pi 5 (8GB)	~$80	8GB	1B–3B models, 4–22 tok/s	Budget AI, learning, IoT
Orange Pi 5 (16GB)	~$120	16GB	Up to 7B models	More headroom on a budget
Mac Mini M4	~$600	16–32GB	Up to 70B models	Serious local LLM work
Mini PC (32GB)	~$400	32GB	Up to 13B models	Home server, multi-model

The Pi is the entry point, not the endgame. If you outgrow it, check our Best Hardware for Local LLMs guide for upgrade paths.

Hailo AI HAT+ (Advanced: Hardware Acceleration)

Raspberry Pi's official AI HAT+ with the Hailo-8L chip offers hardware-accelerated inference. As of early 2026, it works with select models via the hailo-ollama compatibility layer:


# Install Hailo model zoo
wget https://dev-public.hailo.ai/2025_12/Hailo10/hailo_gen_ai_model_zoo_5.1.1_arm64.deb
sudo dpkg -i hailo_gen_ai_model_zoo_*.deb

This is still an evolving ecosystem. If you're primarily interested in LLMs (not computer vision), stick with CPU-based Ollama for now. The Hailo HAT shines more for vision tasks like object detection and image classification.

Troubleshooting Common Issues

"Out of memory" errors

Use smaller models (1B–1.5B)
Ensure swap is configured on SSD, not microSD
Set OLLAMA_MAX_LOADED_MODELS=1
Close any unnecessary services

Extremely slow first response

First token latency is normal (2–5 seconds for small models, 10–30 seconds for 3B+)
Subsequent tokens are faster
Ensure the model is fully loaded: ollama ps

Thermal throttling

Install the active cooler
Ensure proper thermal paste/pad contact
Use a case with ventilation
Consider undervolting if noise is a concern

Ollama won't start


# Check logs
journalctl -u ollama --no-pager -n 50

# Restart the service
sudo systemctl restart ollama

Conclusion

Running LLMs on a Raspberry Pi in 2026 is no longer a novelty — it's a legitimate way to deploy private, cost-free AI for home automation, learning, and lightweight assistant tasks. The sweet spot is Gemma 3 1B for speed or Phi-3 Mini for quality, both running through Ollama on a Raspberry Pi 5 (8GB) with an active cooler and a USB SSD for swap.

You won't replace GPT-4 or Claude with a Pi. But for private, always-on, zero-cost AI that runs on your desk and respects your data? A Raspberry Pi is hard to beat.

Next Steps

Scale up? Read our Home AI Server Build Guide
Compare inference engines? Check Ollama vs LM Studio vs llama.cpp
Run AI agents on your Pi? See our OpenClaw Raspberry Pi Setup Guide

*This article contains affiliate links. If you purchase through these links, ToolHalla earns a small commission at no extra cost to you. We only recommend products we've tested and believe in. See our affiliate disclosure for details.*

FAQ

Can you run LLMs on a Raspberry Pi?

Yes — Raspberry Pi 5 (8GB) runs 0.5-1.5B models via llama.cpp or Ollama at 2-8 tok/s. For practical use, 1B models (Qwen 0.5B, TinyLlama, Phi-3 Mini 3.8B with patience) are the sweet spot. Not fast, but fully offline and private.

What is the best LLM for Raspberry Pi?

Qwen 2.5 0.5B and 1.5B are the best quality at Pi-sized model sizes. TinyLlama 1.1B is a popular alternative. Phi-3 Mini 3.8B runs on 8GB Pi 5 at ~1-2 tok/s — usable for batch tasks, slow for chat.

How much RAM do you need for LLMs on Pi?

8GB is the minimum for anything useful — the Pi 5 8GB is the only model recommended for LLM inference. 4GB Pi models are too memory-constrained. The Pi's CPU is the primary bottleneck.

What is the fastest way to run LLMs on Raspberry Pi?

llama.cpp with -ngl 0 (CPU-only, no GPU layers) and optimized thread count (-t 4 for Pi 5's 4 cores) gives the best performance. Enable NEON SIMD: compile with cmake -DGGML_NEON=on.

Can a Raspberry Pi run a local AI assistant offline?

Yes — Ollama on Pi 5 with a 1.5B model gives a fully offline, private AI assistant. Response time is 5-15 seconds per message. Practical for low-frequency use cases: home automation triggers, offline notes assistant, or scheduled text summarization.

Frequently Asked Questions

Can you run LLMs on a Raspberry Pi?

What is the best LLM for Raspberry Pi?

Qwen 2.5 0.5B and 1.5B are the best quality at Pi-sized model sizes. TinyLlama 1.1B is a popular alternative. Phi-3 Mini 3.8B runs on 8GB Pi 5 at 1-2 tok/s — usable for batch tasks, slow for chat.

How much RAM do you need for LLMs on Pi?

8GB is the minimum for anything useful — the Pi 5 8GB is the only model recommended for LLM inference. 4GB Pi models are too memory-constrained. The Pi's CPU is the primary bottleneck.

What is the fastest way to run LLMs on Raspberry Pi?

llama.cpp with -ngl 0 (CPU-only, no GPU layers) and optimized thread count (-t 4 for Pi 5's 4 cores) gives the best performance. Enable NEON SIMD: compile with cmake -DGGML NEON=on.

Can a Raspberry Pi run a local AI assistant offline?

🔧 Tools in This Article

Make (Integromat)

Stable Diffusion

GitHub Copilot

Open WebUI

LM Studio

OpenClaw

Ollama

Related Guides

All guides →

Guide

What is Quantization? A Practical Guide for Local LLMs (2026)

Quantization is crucial for running large language models locally without memory issues. Understand it to choose the right model and format for your GPU.

12 min read

Guide

Best Hardware for Local LLMs in 2026: 5 Platforms Compared (From $500)

Choosing hardware for local AI in 2026 involves five platforms, each with unique strengths and tradeoffs.

15 min read

Guide

Best LLMs for 24GB GPUs: RTX 3090 & 4090 Guide (2026)

24GB of VRAM is ideal for running 32B parameter models locally in 2026, offering high-quality quantization for real-world use.

10 min read

Why Run a Local LLM on a Raspberry Pi?

What You Need: Hardware Shopping List

Essential Hardware

Strongly Recommended

Step 1: Prepare Your Raspberry Pi

Flash the OS

First Boot Setup

Configure Swap on USB SSD

Overclock (Optional but Recommended)

Step 2: Install Ollama

Ollama Configuration for Pi

Step 3: Choose Your Models

Tier 1: Fast and Usable (Recommended)

Gemma 3 1B (Q4_K_M)

TinyLlama 1.1B (Q4_K_M)

Llama 3.2 1B Instruct (Q4_K_M)

Tier 2: Slower but Smarter

Qwen 2.5 1.5B (Q4_K_M)

Phi-3 Mini 3.8B (Q4_K_M)

Llama 3.2 3B Instruct (Q4_K_M)

Gemma 2 2B (Q4_K_M)

Tier 3: Technically Possible (Not Recommended)

Performance Summary Table

Step 4: Test Your Setup

Interactive Chat

API Access

Open WebUI (Optional Web Interface)

Step 5: Optimize Performance

Memory Management

Use llama.cpp for Maximum Speed

Thermal Management

Practical Raspberry Pi AI Projects

1. Privacy-First Smart Home Assistant

2. Offline Coding Assistant

3. Document Summarizer

4. Network Monitoring AI

5. Educational AI Tutor

Limitations: What a Pi Can't Do

Raspberry Pi 5 vs. Alternatives for Local LLMs

Hailo AI HAT+ (Advanced: Hardware Acceleration)

Troubleshooting Common Issues

"Out of memory" errors

Extremely slow first response

Thermal throttling

Ollama won't start

Conclusion

Next Steps

Related Articles

FAQ

Can you run LLMs on a Raspberry Pi?

What is the best LLM for Raspberry Pi?

How much RAM do you need for LLMs on Pi?

What is the fastest way to run LLMs on Raspberry Pi?

Can a Raspberry Pi run a local AI assistant offline?

Frequently Asked Questions

🔧 Tools in This Article

Related Guides

What is Quantization? A Practical Guide for Local LLMs (2026)

Best Hardware for Local LLMs in 2026: 5 Platforms Compared (From $500)

Best LLMs for 24GB GPUs: RTX 3090 & 4090 Guide (2026)