Run LLMs on Raspberry Pi 5: Step-by-Step Setup Guide (2026)
Learn how to run local LLMs on a Raspberry Pi 5 in 2026. Complete setup guide covering Ollama installation, best models (Phi-3, Gemma 3, Llama 3.2, TinyLlama), performance benchmarks, hardware recommendations, and practical AI projects.
Running a large language model on a $80 single-board computer sounded absurd two years ago. In 2026, it's not only possible — it's practical. Thanks to aggressive quantization, optimized inference engines, and a new generation of compact models, you can run an LLM on a Raspberry Pi and get usable results for chatbots, home automation, code assistance, and offline AI projects.
This guide walks you through everything: hardware selection, OS prep, Ollama installation, model choices, real-world performance numbers, and tips to squeeze every last token per second out of your Pi.
Why Run a Local LLM on a Raspberry Pi?
Before we dive into the how, let's cover the why:
- Privacy. Your prompts never leave your network. No cloud, no logging, no third-party access.
- Cost. After the one-time hardware purchase, inference is free — forever. No API bills, no token limits.
- Offline access. Works without internet. Perfect for remote locations, field work, or air-gapped environments.
- Learning. Understanding how LLMs actually run on constrained hardware teaches you more about AI than any API wrapper ever will.
- Home automation. A Pi running a local LLM can power a privacy-first smart home assistant, process voice commands, or classify sensor data — all without cloud dependencies.
If any of those resonate, keep reading.
What You Need: Hardware Shopping List
The Raspberry Pi 5 with 8GB RAM is the minimum viable platform for running LLMs. The 4GB model technically works but runs out of memory with anything beyond the smallest models. Here's the full hardware list:
Essential Hardware
| Component | Why You Need It | Recommended |
|---|---|---|
| Raspberry Pi 5 (8GB) | The 8GB model gives you enough headroom for 1B–3B parameter models with room for the OS | Raspberry Pi 5 8GB on Amazon |
| MicroSD Card (128GB+) | Models, OS, and swap space add up fast. 128GB gives breathing room | 128GB MicroSD Cards on Amazon |
| Active Cooler / Fan | LLM inference pushes all four CPU cores to 100%. Without active cooling, thermal throttling kills performance | Raspberry Pi 5 Active Cooler on Amazon |
| USB-C Power Supply (27W) | The Pi 5 needs a proper 5V/5A supply under sustained AI workloads. Don't cheap out here | Included with most Pi 5 kits |
Strongly Recommended
| Component | Why | Recommended |
|---|---|---|
| USB SSD (256GB+) | Swap on SSD is dramatically faster than microSD swap. Also lets you store more models | USB SSD for Raspberry Pi on Amazon |
| Coral USB Accelerator | Google's Edge TPU can accelerate certain inference tasks. Limited model support but impressive when it works | Google Coral USB Accelerator on Amazon |
> Budget estimate: A complete Raspberry Pi AI setup runs about $120–$180 depending on whether you add the SSD and case. That's less than two months of GPT-4 API usage for most developers.
Step 1: Prepare Your Raspberry Pi
Flash the OS
Use Raspberry Pi OS (64-bit, Lite) for maximum available RAM. The desktop environment eats 300–500MB that you'll want for model inference.
# Download Raspberry Pi Imager from raspberrypi.com
# Select: Raspberry Pi OS Lite (64-bit) — Bookworm
# Flash to your microSD card
# Enable SSH in the imager settings
First Boot Setup
# Update everything
sudo apt update && sudo apt full-upgrade -y
# Install essentials
sudo apt install -y curl wget git htop
# Set your hostname (optional but nice)
sudo hostnamectl set-hostname pi-llm
Configure Swap on USB SSD
This is critical. MicroSD swap is painfully slow and wears out the card. If you have a USB SSD, set up swap there:
# Identify your USB SSD
lsblk
# Create a swap file on the SSD (adjust path to your mount point)
sudo mkdir -p /mnt/ssd
sudo mount /dev/sda1 /mnt/ssd
# Create 8GB swap file
sudo dd if=/dev/zero of=/mnt/ssd/swapfile bs=1M count=8192
sudo chmod 600 /mnt/ssd/swapfile
sudo mkswap /mnt/ssd/swapfile
sudo swapon /mnt/ssd/swapfile
# Make it permanent
echo '/mnt/ssd/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
# Disable SD card swap
sudo dphys-swapfile swapoff
sudo systemctl disable dphys-swapfile
Overclock (Optional but Recommended)
The Pi 5's default clock is 2.4GHz. You can safely push it to 2.8–3.0GHz with active cooling:
sudo nano /boot/firmware/config.txt
# Add these lines:
arm_freq=2800
gpu_freq=910
over_voltage_delta=50000
Overclocking typically yields a 15–25% improvement in tokens per second. Make sure you have the active cooler installed — without it, the Pi will thermal-throttle back to stock speeds within minutes.
Step 2: Install Ollama
Ollama is the easiest way to run LLMs on a Raspberry Pi. It handles model downloading, quantization selection, and provides both a CLI and an API server.
# One-line install
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
# Start the Ollama service (it auto-starts, but just in case)
sudo systemctl enable ollama
sudo systemctl start ollama
Ollama Configuration for Pi
Ollama's defaults aren't optimized for low-resource devices. Create or edit the systemd override:
sudo systemctl edit ollama
Add these environment variables:
[Service]
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_FLASH_ATTENTION=1"
This ensures Ollama only loads one model at a time (saving RAM) and enables flash attention for better memory efficiency.
sudo systemctl daemon-reload
sudo systemctl restart ollama
Step 3: Choose Your Models
Not all models are created equal on 8GB of ARM memory. Here's what actually works, tested and benchmarked on Raspberry Pi 5 (8GB):
Tier 1: Fast and Usable (Recommended)
Gemma 3 1B (Q4_K_M)
ollama pull gemma3:1b
- Speed: ~18–22 tokens/sec
- RAM usage: ~1.2GB
- Verdict: The speed champion. Google's smallest Gemma model is remarkably capable for its size. Best choice for real-time chat applications and home automation triggers.
TinyLlama 1.1B (Q4_K_M)
ollama pull tinyllama
- Speed: ~12–18 tokens/sec
- RAM usage: ~1.0GB
- Verdict: The original Pi-friendly model. Still excellent for simple tasks, summarization, and as a lightweight assistant. Very stable.
Llama 3.2 1B Instruct (Q4_K_M)
ollama pull llama3.2:1b
- Speed: ~15–20 tokens/sec
- RAM usage: ~1.3GB
- Verdict: Meta's smallest Llama brings surprising instruction-following quality for a 1B model. Great balance of speed and capability.
Tier 2: Slower but Smarter
Qwen 2.5 1.5B (Q4_K_M)
ollama pull qwen2.5:1.5b
- Speed: ~10–14 tokens/sec
- RAM usage: ~1.5GB
- Verdict: Alibaba's compact model punches above its weight in reasoning tasks. Worth the slight speed penalty if you need better answers.
Phi-3 Mini 3.8B (Q4_K_M)
ollama pull phi3:mini
- Speed: ~4–7 tokens/sec
- RAM usage: ~3.5GB
- Verdict: Microsoft's Phi-3 Mini is the smartest model you can realistically run on a Pi. It's slow — expect 2–4 second pauses between sentences — but the output quality is notably better than the 1B models. Use it when accuracy matters more than speed.
Llama 3.2 3B Instruct (Q4_K_M)
ollama pull llama3.2:3b
- Speed: ~4–6 tokens/sec
- RAM usage: ~3.2GB
- Verdict: The largest model that runs without swap pressure. Similar speed to Phi-3 Mini with strong general-purpose capability.
Gemma 2 2B (Q4_K_M)
ollama pull gemma2:2b
- Speed: ~8–12 tokens/sec
- RAM usage: ~2.0GB
- Verdict: A solid middle-ground option. Faster than 3B models, smarter than 1B models. Good for structured output and classification.
Tier 3: Technically Possible (Not Recommended)
Models above 3.8B parameters (like Llama 3.2 7B, Mistral 7B, or Gemma 7B) will technically load on a Pi 5 with swap enabled, but expect:
- 1–3 tokens per second
- Heavy swap usage that thrashes your SSD
- Minutes-long time to first token
- System instability under sustained use
If you need 7B-class models, consider a Mini PC with 16–32GB RAM or a machine with a dedicated GPU instead.
Performance Summary Table
| Model | Parameters | Speed (tok/s) | RAM Usage | Quality Rating |
|---|---|---|---|---|
| Gemma 3 1B | 1B | 18–22 | ~1.2GB | ★★★☆☆ |
| TinyLlama | 1.1B | 12–18 | ~1.0GB | ★★☆☆☆ |
| Llama 3.2 1B | 1B | 15–20 | ~1.3GB | ★★★☆☆ |
| Qwen 2.5 1.5B | 1.5B | 10–14 | ~1.5GB | ★★★☆☆ |
| Gemma 2 2B | 2B | 8–12 | ~2.0GB | ★★★☆☆ |
| Llama 3.2 3B | 3B | 4–6 | ~3.2GB | ★★★★☆ |
| Phi-3 Mini | 3.8B | 4–7 | ~3.5GB | ★★★★☆ |
*Benchmarks on Raspberry Pi 5 (8GB), Raspberry Pi OS Lite 64-bit, Ollama, Q4_K_M quantization. Overclocked to 2.8GHz with active cooling.*
Step 4: Test Your Setup
Interactive Chat
# Start a chat session
ollama run gemma3:1b
# Try a prompt
>>> Explain what a Raspberry Pi is in three sentences.
API Access
Ollama runs an API server on port 11434 by default. You can query it from any device on your network:
curl http://pi-llm.local:11434/api/generate -d '{
"model": "gemma3:1b",
"prompt": "What is edge AI?",
"stream": false
}'
This opens up integrations with Home Assistant, Node-RED, custom Python scripts, and tools like Open WebUI for a ChatGPT-like web interface.
Open WebUI (Optional Web Interface)
If you want a browser-based chat interface:
# Install Docker
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
# Run Open WebUI
docker run -d --network=host \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Access it at http://pi-llm.local:8080 and connect it to your local Ollama instance.
Step 5: Optimize Performance
Memory Management
# Check how much RAM Ollama is using
ollama ps
# Unload a model to free RAM
ollama stop gemma3:1b
# Monitor system resources
htop
Use llama.cpp for Maximum Speed
If you need every last token per second, llama.cpp gives you 10–20% better performance than Ollama thanks to finer control over thread count, context length, and batch size:
# Build llama.cpp from source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j4
# Run a model directly
./build/bin/llama-cli \
-m models/gemma-3-1b-Q4_K_M.gguf \
-t 4 \
-c 2048 \
-p "Explain quantum computing simply"
Thermal Management
Sustained LLM inference generates significant heat. Monitor your CPU temperature:
# Check temperature
vcgencmd measure_temp
# Watch it continuously
watch -n 1 vcgencmd measure_temp
If temperatures exceed 80°C consistently, your active cooler isn't making proper contact, or you may need a case with better airflow.
Practical Raspberry Pi AI Projects
Now that you have a working LLM on your Pi, here are real-world applications:
1. Privacy-First Smart Home Assistant
Connect your Pi LLM to Home Assistant via the Ollama API. Process voice commands locally — no Alexa, no Google, no cloud.
2. Offline Coding Assistant
Use Llama 3.2 3B or Phi-3 Mini as a local code helper. It won't replace GitHub Copilot, but for quick syntax lookups, regex generation, and boilerplate code, it's surprisingly useful.
3. Document Summarizer
Feed local documents into your LLM for summarization. Perfect for processing meeting notes, research papers, or logs without uploading sensitive content to the cloud.
4. Network Monitoring AI
Use a small model to analyze system logs, detect anomalies, and generate human-readable alerts. Tools like OpenClaw can orchestrate multiple AI agents on a Pi.
5. Educational AI Tutor
Set up a kid-friendly AI tutor that runs entirely offline. No content filtering concerns beyond what you configure locally.
Limitations: What a Pi Can't Do
Let's be honest about the boundaries:
- No image generation. Stable Diffusion needs a GPU. Period.
- No real-time conversation. Even the fastest models have noticeable latency. This is a "type and wait" experience, not a voice assistant replacement (yet).
- Limited context windows. With 8GB RAM, you're practically limited to 2048–4096 token context windows. Long documents and multi-turn conversations will hit this ceiling.
- No fine-tuning. Training or fine-tuning models requires far more RAM and compute than a Pi can offer. You're running pre-trained models only.
- 7B+ models are impractical. They technically run, but at 1–3 tokens per second with heavy swap usage, they're not useful for anything interactive.
Raspberry Pi 5 vs. Alternatives for Local LLMs
| Platform | Price | RAM | LLM Performance | Best For |
|---|---|---|---|---|
| Raspberry Pi 5 (8GB) | ~$80 | 8GB | 1B–3B models, 4–22 tok/s | Budget AI, learning, IoT |
| Orange Pi 5 (16GB) | ~$120 | 16GB | Up to 7B models | More headroom on a budget |
| Mac Mini M4 | ~$600 | 16–32GB | Up to 70B models | Serious local LLM work |
| Mini PC (32GB) | ~$400 | 32GB | Up to 13B models | Home server, multi-model |
The Pi is the entry point, not the endgame. If you outgrow it, check our Best Hardware for Local LLMs guide for upgrade paths.
Hailo AI HAT+ (Advanced: Hardware Acceleration)
Raspberry Pi's official AI HAT+ with the Hailo-8L chip offers hardware-accelerated inference. As of early 2026, it works with select models via the hailo-ollama compatibility layer:
# Install Hailo model zoo
wget https://dev-public.hailo.ai/2025_12/Hailo10/hailo_gen_ai_model_zoo_5.1.1_arm64.deb
sudo dpkg -i hailo_gen_ai_model_zoo_*.deb
This is still an evolving ecosystem. If you're primarily interested in LLMs (not computer vision), stick with CPU-based Ollama for now. The Hailo HAT shines more for vision tasks like object detection and image classification.
Troubleshooting Common Issues
"Out of memory" errors
- Use smaller models (1B–1.5B)
- Ensure swap is configured on SSD, not microSD
- Set
OLLAMA_MAX_LOADED_MODELS=1 - Close any unnecessary services
Extremely slow first response
- First token latency is normal (2–5 seconds for small models, 10–30 seconds for 3B+)
- Subsequent tokens are faster
- Ensure the model is fully loaded:
ollama ps
Thermal throttling
- Install the active cooler
- Ensure proper thermal paste/pad contact
- Use a case with ventilation
- Consider undervolting if noise is a concern
Ollama won't start
# Check logs
journalctl -u ollama --no-pager -n 50
# Restart the service
sudo systemctl restart ollama
Conclusion
Running LLMs on a Raspberry Pi in 2026 is no longer a novelty — it's a legitimate way to deploy private, cost-free AI for home automation, learning, and lightweight assistant tasks. The sweet spot is Gemma 3 1B for speed or Phi-3 Mini for quality, both running through Ollama on a Raspberry Pi 5 (8GB) with an active cooler and a USB SSD for swap.
You won't replace GPT-4 or Claude with a Pi. But for private, always-on, zero-cost AI that runs on your desk and respects your data? A Raspberry Pi is hard to beat.
Next Steps
- Scale up? Read our Home AI Server Build Guide
- Compare inference engines? Check Ollama vs LM Studio vs llama.cpp
- Run AI agents on your Pi? See our OpenClaw Raspberry Pi Setup Guide
*This article contains affiliate links. If you purchase through these links, ToolHalla earns a small commission at no extra cost to you. We only recommend products we've tested and believe in. See our affiliate disclosure for details.*
Related Articles
- How to Run DeepSeek R1 Locally: Complete Setup Guide (2026)
- Dual GPU Setup Guide for Local LLMs (2026): Double Your VRAM
FAQ
Can you run LLMs on a Raspberry Pi?
Yes — Raspberry Pi 5 (8GB) runs 0.5-1.5B models via llama.cpp or Ollama at 2-8 tok/s. For practical use, 1B models (Qwen 0.5B, TinyLlama, Phi-3 Mini 3.8B with patience) are the sweet spot. Not fast, but fully offline and private.
What is the best LLM for Raspberry Pi?
Qwen 2.5 0.5B and 1.5B are the best quality at Pi-sized model sizes. TinyLlama 1.1B is a popular alternative. Phi-3 Mini 3.8B runs on 8GB Pi 5 at ~1-2 tok/s — usable for batch tasks, slow for chat.
How much RAM do you need for LLMs on Pi?
8GB is the minimum for anything useful — the Pi 5 8GB is the only model recommended for LLM inference. 4GB Pi models are too memory-constrained. The Pi's CPU is the primary bottleneck.
What is the fastest way to run LLMs on Raspberry Pi?
llama.cpp with -ngl 0 (CPU-only, no GPU layers) and optimized thread count (-t 4 for Pi 5's 4 cores) gives the best performance. Enable NEON SIMD: compile with cmake -DGGML_NEON=on.
Can a Raspberry Pi run a local AI assistant offline?
Yes — Ollama on Pi 5 with a 1.5B model gives a fully offline, private AI assistant. Response time is 5-15 seconds per message. Practical for low-frequency use cases: home automation triggers, offline notes assistant, or scheduled text summarization.
Frequently Asked Questions
Can you run LLMs on a Raspberry Pi?
What is the best LLM for Raspberry Pi?
How much RAM do you need for LLMs on Pi?
What is the fastest way to run LLMs on Raspberry Pi?
Can a Raspberry Pi run a local AI assistant offline?
🔧 Tools in This Article
All tools →Related Guides
All guides →What is Quantization? A Practical Guide for Local LLMs (2026)
Quantization is crucial for running large language models locally without memory issues. Understand it to choose the right model and format for your GPU.
12 min read
GuideBest Hardware for Local LLMs in 2026: 5 Platforms Compared (From $500)
Choosing hardware for local AI in 2026 involves five platforms, each with unique strengths and tradeoffs.
15 min read
GuideBest LLMs for 24GB GPUs: RTX 3090 & 4090 Guide (2026)
24GB of VRAM is ideal for running 32B parameter models locally in 2026, offering high-quality quantization for real-world use.
10 min read