OpenClaw + Ollama: 2026 Self-Hosted AI Agents
Zero cloud costs, full privacy. Learn to run AI agents locally with OpenClaw & Ollama. Hardware, tuning, models, and failure modes covered.
Running AI agents on someone else's infrastructure means sending every prompt, every document, and every decision through a third-party API. For teams that care about privacy, cost control, or simply want to own their stack — self-hosting is the answer.
OpenClaw is an open-source AI agent framework that handles multi-agent orchestration, memory, tool use, and cross-platform messaging. Ollama is the simplest way to run LLMs locally. Together, they form a complete self-hosted AI agent platform with zero API costs.
This guide covers the production configuration: which models to run, how to size your hardware, how to set up multi-agent workflows, and how to keep everything running reliably.
Why Self-Host AI Agents?
Zero per-token costs. After hardware, every token is free. A team sending 10M tokens/month to Claude spends ~$30-150/month. Self-hosted: $0 marginal cost.
Complete privacy. Prompts, documents, and agent reasoning never leave your network. Required for legal, medical, financial, and defense use cases.
No rate limits. Run as many concurrent agents as your hardware supports. No 429 errors, no queue slots, no throttling.
Full control. Choose your models, customize behavior, modify the framework, integrate with internal tools. No vendor lock-in.
Architecture Overview
┌─────────────────────────────────────────┐
│ OpenClaw Gateway │
│ (agent orchestration, memory, MCP) │
├─────────────────┬───────────────────────┤
│ Agent: Skald │ Agent: Sleipnir │
│ (content) │ (research) │
│ Model: Qwen 3.5│ Model: Qwen 3.5 14B │
├─────────────────┼───────────────────────┤
│ Agent: Völundr │ Agent: Heimdall │
│ (development) │ (monitoring) │
│ Model: Qwen 3.5│ Model: Qwen 3.5 8B │
└────────┬────────┴───────────┬───────────┘
│ │
┌────▼────┐ ┌────▼────┐
│ Ollama │ │ Tools │
│ (LLM) │ │ (MCP) │
└─────────┘ └─────────┘
OpenClaw manages the agents. Ollama serves the models. MCP servers provide tools (filesystem, web, database, APIs). Each agent can use a different model — match model size to task complexity.
Hardware Sizing Guide
Minimum: Single Agent, 8B Model
| Component | Spec | Cost |
|---|---|---|
| GPU | RTX 3060 12GB | ~$180 (used) |
| CPU | Any modern quad-core | — |
| RAM | 16GB | — |
| Storage | 256GB NVMe | — |
Runs Qwen 3.5 8B at Q4_K_M (~30 tok/s). Good for a single personal assistant agent. Not enough for concurrent multi-agent workloads.
Recommended: Multi-Agent, 14B Model
| Component | Spec | Cost |
|---|---|---|
| GPU | RTX 3090 24GB | ~$600 (used) |
| CPU | Ryzen 7 / i7 (8+ cores) | ~$200 |
| RAM | 64GB DDR5 | ~$150 |
| Storage | 1TB NVMe | ~$80 |
Runs Qwen 3.5 14B at Q4_K_M with room for multiple agents sharing the model. The 24GB VRAM handles context windows up to 16K comfortably. This is the sweet spot for small teams (2-5 agents). See our 24GB GPU model guide for the full model lineup.
Production: Multi-Agent, 32B Model
| Component | Spec | Cost |
|---|---|---|
| GPU | RTX 4090 24GB | ~$1,200 |
| CPU | Ryzen 9 / i9 (12+ cores) | ~$350 |
| RAM | 128GB DDR5 | ~$300 |
| Storage | 2TB NVMe | ~$120 |
Runs Qwen 3.5 32B at Q4_K_M (~20 tok/s) or 14B at Q8_0 (near-perfect quality). Handles 5-10 concurrent agents with model sharing. For our BerserKI setup running 14 agents, we use this tier with model-level routing.
Cloud Alternative
No hardware? Vast.ai rents RTX 4090s from ~$0.20/hr. Deploy Ollama + OpenClaw on a cloud GPU instance. At 12 hours/day usage, that's ~$75/month — still cheaper than most API setups at volume.
Step-by-Step Setup
1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:14b # Primary model
ollama pull qwen3:8b # Fast model for simple tasks
2. Install OpenClaw
npm install -g openclaw
openclaw init
3. Configure Models in OpenClaw
Edit your OpenClaw config to route agents to the right model:
# openclaw.yaml
models:
default: ollama/qwen3:14b
fast: ollama/qwen3:8b
agents:
research:
model: default # 14B for complex research
tools: [web_search, web_fetch, memory]
assistant:
model: fast # 8B for quick responses
tools: [memory, calendar]
coding:
model: default # 14B for code generation
tools: [exec, read, write, web_search]
4. Optimize Ollama for Multi-Agent
# Create systemd override
sudo systemctl edit ollama
# Add:
[Service]
Environment="OLLAMA_NUM_PARALLEL=4" # Handle 4 concurrent requests
Environment="OLLAMA_MAX_LOADED_MODELS=2" # Keep 2 models in VRAM
Environment="OLLAMA_FLASH_ATTENTION=1" # Better memory efficiency
Environment="OLLAMA_KEEP_ALIVE=24h" # Don't unload models
sudo systemctl daemon-reload && sudo systemctl restart ollama
OLLAMA_NUM_PARALLEL=4 is the key setting — it lets multiple agents query the same model simultaneously. Without it, agents queue and wait.
5. Set Up Reverse Proxy (Production)
For remote access, use nginx or Caddy:
# Install Caddy (simplest HTTPS)
sudo apt install caddy
# /etc/caddy/Caddyfile
ollama.yourdomain.com {
reverse_proxy localhost:11434
basicauth {
admin $2a$14$... # bcrypt hash of your password
}
}
sudo systemctl restart caddy
For the cheapest hosting layer, a Hostinger VPS (~$5/month) handles the reverse proxy and SSL termination if your Ollama server is on a home network behind NAT.
Model Selection for Multi-Agent Systems
Not every agent needs the same model. Match model size to task:
| Agent Role | Recommended Model | Why |
|---|---|---|
| Research / analysis | Qwen 3.5 14B Q4 | Needs reasoning depth |
| Code generation | Qwen 2.5 Coder 14B | Still best for code |
| Quick chat / triage | Qwen 3.5 8B Q4 | Speed over depth |
| Content writing | Qwen 3.5 14B Q4 | Quality matters |
| Monitoring / alerts | Qwen 3.5 8B Q3 | Minimal, fast |
| Complex reasoning | Qwen 3.5 14B + /think | Thinking mode for hard problems |
Using multiple model sizes saves VRAM. An 8B model uses ~5GB vs 14B's ~9.5GB. Loading both costs ~15GB total — well within a 24GB GPU.
Monitoring and Reliability
Health Checks
# Check Ollama is running
curl -s http://localhost:11434/api/tags | jq '.models[].name'
# Check loaded models
ollama ps
# Monitor GPU utilization
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv -l 5
Auto-Restart on Failure
# Ollama already runs as systemd service with restart
sudo systemctl status ollama
# For OpenClaw, add a systemd service:
sudo tee /etc/systemd/system/openclaw.service << EOF
[Unit]
Description=OpenClaw AI Agent Gateway
After=ollama.service
[Service]
ExecStart=/usr/bin/openclaw start
Restart=always
RestartSec=10
User=openclaw
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable openclaw && sudo systemctl start openclaw
Cost Tracking
Self-hosted doesn't mean free — you pay for electricity and hardware depreciation. Track your actual costs:
| Cost | Monthly | Annual |
|---|---|---|
| Electricity (RTX 3090, 12h/day) | ~$15 | ~$180 |
| Hardware depreciation (3yr) | ~$30 | ~$360 |
| VPS for proxy (optional) | ~$5 | ~$60 |
| Total | ~$50 | ~$600 |
Compare with API costs for equivalent usage (10M tokens/month on Claude): ~$30-150/month. Self-hosting breaks even at ~5M tokens/month and saves significantly above that.
Common Issues and Fixes
Ollama OOM (out of memory): Reduce OLLAMA_NUM_PARALLEL or use a smaller model. Check nvidia-smi for actual VRAM usage.
Slow response with multiple agents: Agents are queuing. Increase OLLAMA_NUM_PARALLEL or add a second GPU.
Model not loading: Check ollama ps and journalctl -u ollama. Common cause: insufficient VRAM for the model + KV cache at your context length.
OpenClaw can't reach Ollama: Verify Ollama is listening on the right interface. Default is localhost:11434. For Docker setups, use host.docker.internal:11434.
FAQ
Q: How many agents can I run on one GPU?
A: With model sharing (all agents use the same loaded model), 5-10 agents work well on an RTX 3090 with OLLAMA_NUM_PARALLEL=4. Each agent doesn't load a separate copy — they share the model in VRAM and take turns with inference.
Q: Can I mix cloud and local models in OpenClaw?
A: Yes. Configure some agents to use Ollama (local) and others to use OpenAI/Anthropic APIs. Use local for high-volume, low-stakes tasks and cloud for complex reasoning where frontier model quality matters.
Q: Do I need to run OpenClaw and Ollama on the same machine?
A: No. Ollama can run on a GPU server while OpenClaw runs anywhere that can reach Ollama's API. Connect them over your local network, Tailscale, or WireGuard.
Q: Is OpenClaw production-ready?
A: OpenClaw is actively developed and used in production by multiple teams (including us at BerserKI with 14 agents). It's stable for multi-agent orchestration. Check the OpenClaw GitHub for the latest release.
Q: What's the cheapest possible self-hosted AI agent setup?
A: A used RTX 3060 12GB (~$180) + any modern PC running Ollama + OpenClaw. Total: ~$300-400. Runs Qwen 3.5 8B at ~30 tok/s — enough for a personal AI assistant running 24/7. For more options, see our best hardware for local LLMs guide.
*More self-hosted AI guides: Best LLMs for 24GB GPUs · Best Hardware for Local LLMs · Ollama vs LM Studio vs llama.cpp · Multi-Agent Orchestration Guide*
*Disclosure: This article contains affiliate links. ToolHalla may earn a commission at no extra cost to you.*
🔧 Tools in This Article
All tools →Related Guides
All guides →Qwen 3.5 vs Qwen 2.5: Benchmarks, Speed & VRAM Compared (2026)
Head-to-head benchmark comparison of Qwen 3.5 and Qwen 2.5 — coding, reasoning, speed, and VRAM usage. Real test data to help you pick the right model for local inference.
12 min read
ComparisonOllama vs LM Studio vs llama.cpp: Which Should You Use in 2026?
Three tools, one goal: run AI locally. Ollama for simplicity, LM Studio for a GUI, llama.cpp for power users. Here is how to choose.
10 min read
GuideWhat is Quantization? A Practical Guide for Local LLMs (2026)
Quantization is crucial for running large language models locally without memory issues. Understand it to choose the right model and format for your GPU.
12 min read