OpenClaw + Ollama Production Config: Local AI Agents (2026)
Run OpenClaw agents on Ollama with GPU sizing, model routing, NUM PARALLEL tuning, health checks, cloud fallback, and failure-mode fixes.
In short: OpenClaw plus Ollama gives a self-hosted multi-agent stack with zero per-token costs and full privacy. Size hardware to your models (an RTX 3090 24GB runs Qwen 3.5 14B for 2-5 agents), set OLLAMA_NUM_PARALLEL so agents share a loaded model, and add health checks and auto-restart for reliability.
Running AI agents on someone else's infrastructure means sending every prompt, every document, and every decision through a third-party API. For teams that care about privacy, cost control, or simply want to own their stack — self-hosting is the answer.
OpenClaw is an open-source AI agent framework that handles multi-agent orchestration, memory, tool use, and cross-platform messaging. Ollama is the simplest way to run LLMs locally. Together, they form a complete self-hosted AI agent platform with zero API costs.
This guide covers the production configuration: which models to run, how to size your hardware, how to set up multi-agent workflows, and how to keep everything running reliably.
Why Self-Host AI Agents?
Zero per-token costs. After hardware, every token is free. A team sending 10M tokens/month to Claude spends ~$30-150/month. Self-hosted: $0 marginal cost.
Complete privacy. Prompts, documents, and agent reasoning never leave your network. Required for legal, medical, financial, and defense use cases.
No rate limits. Run as many concurrent agents as your hardware supports. No 429 errors, no queue slots, no throttling.
Full control. Choose your models, customize behavior, modify the framework, integrate with internal tools. No vendor lock-in.
Architecture Overview
┌─────────────────────────────────────────┐
│ OpenClaw Gateway │
│ (agent orchestration, memory, MCP) │
├─────────────────┬───────────────────────┤
│ Agent: Skald │ Agent: Sleipnir │
│ (content) │ (research) │
│ Model: Qwen 3.5│ Model: Qwen 3.5 14B │
├─────────────────┼───────────────────────┤
│ Agent: Völundr │ Agent: Heimdall │
│ (development) │ (monitoring) │
│ Model: Qwen 3.5│ Model: Qwen 3.5 8B │
└────────┬────────┴───────────┬───────────┘
│ │
┌────▼────┐ ┌────▼────┐
│ Ollama │ │ Tools │
│ (LLM) │ │ (MCP) │
└─────────┘ └─────────┘
OpenClaw manages the agents. Ollama serves the models. MCP servers provide tools (filesystem, web, database, APIs). Each agent can use a different model — match model size to task complexity.
Hardware Sizing Guide
Minimum: Single Agent, 8B Model
| Component | Spec | Cost |
|---|---|---|
| GPU | RTX 3060 12GB | ~$180 (used) |
| CPU | Any modern quad-core | — |
| RAM | 16GB | — |
| Storage | 256GB NVMe | — |
Runs Qwen 3.5 8B at Q4_K_M (~30 tok/s). Good for a single personal assistant agent. Not enough for concurrent multi-agent workloads.
Recommended: Multi-Agent, 14B Model
| Component | Spec | Cost |
|---|---|---|
| GPU | RTX 3090 24GB | ~$600 (used) |
| CPU | Ryzen 7 / i7 (8+ cores) | ~$200 |
| RAM | 64GB DDR5 | ~$150 |
| Storage | 1TB NVMe | ~$80 |
Runs Qwen 3.5 14B at Q4_K_M with room for multiple agents sharing the model. The 24GB VRAM handles context windows up to 16K comfortably. This is the sweet spot for small teams (2-5 agents). See our 24GB GPU model guide for the full model lineup.
Production: Multi-Agent, 32B Model
| Component | Spec | Cost |
|---|---|---|
| GPU | RTX 4090 24GB | ~$1,200 |
| CPU | Ryzen 9 / i9 (12+ cores) | ~$350 |
| RAM | 128GB DDR5 | ~$300 |
| Storage | 2TB NVMe | ~$120 |
Runs Qwen 3.5 32B at Q4_K_M (~20 tok/s) or 14B at Q8_0 (near-perfect quality). Handles 5-10 concurrent agents with model sharing. For our BerserKI setup running 14 agents, we use this tier with model-level routing.
Cloud Alternative
No hardware? Vast.ai rents RTX 4090s from ~$0.20/hr. Deploy Ollama + OpenClaw on a cloud GPU instance. At 12 hours/day usage, that's ~$75/month — still cheaper than most API setups at volume.
Step-by-Step Setup
1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:14b # Primary model
ollama pull qwen3:8b # Fast model for simple tasks
2. Install OpenClaw
npm install -g openclaw
openclaw init
3. Configure Models in OpenClaw
Edit your OpenClaw config to route agents to the right model:
# openclaw.yaml
models:
default: ollama/qwen3:14b
fast: ollama/qwen3:8b
agents:
research:
model: default # 14B for complex research
tools: [web_search, web_fetch, memory]
assistant:
model: fast # 8B for quick responses
tools: [memory, calendar]
coding:
model: default # 14B for code generation
tools: [exec, read, write, web_search]
4. Optimize Ollama for Multi-Agent
# Create systemd override
sudo systemctl edit ollama
# Add:
[Service]
Environment="OLLAMA_NUM_PARALLEL=4" # Handle 4 concurrent requests
Environment="OLLAMA_MAX_LOADED_MODELS=2" # Keep 2 models in VRAM
Environment="OLLAMA_FLASH_ATTENTION=1" # Better memory efficiency
Environment="OLLAMA_KEEP_ALIVE=24h" # Don't unload models
sudo systemctl daemon-reload && sudo systemctl restart ollama
OLLAMA_NUM_PARALLEL=4 is the key setting — it lets multiple agents query the same model simultaneously. Without it, agents queue and wait.
5. Set Up Reverse Proxy (Production)
For remote access, use nginx or Caddy:
# Install Caddy (simplest HTTPS)
sudo apt install caddy
# /etc/caddy/Caddyfile
ollama.yourdomain.com {
reverse_proxy localhost:11434
basicauth {
admin $2a$14$... # bcrypt hash of your password
}
}
sudo systemctl restart caddy
For the cheapest hosting layer, a Hostinger VPS (~$5/month) handles the reverse proxy and SSL termination if your Ollama server is on a home network behind NAT.
Model Selection for Multi-Agent Systems
Not every agent needs the same model. Match model size to task:
| Agent Role | Recommended Model | Why |
|---|---|---|
| Research / analysis | Qwen 3.5 14B Q4 | Needs reasoning depth |
| Code generation | Qwen 2.5 Coder 14B | Still best for code |
| Quick chat / triage | Qwen 3.5 8B Q4 | Speed over depth |
| Content writing | Qwen 3.5 14B Q4 | Quality matters |
| Monitoring / alerts | Qwen 3.5 8B Q3 | Minimal, fast |
| Complex reasoning | Qwen 3.5 14B + /think | Thinking mode for hard problems |
Using multiple model sizes saves VRAM. An 8B model uses ~5GB vs 14B's ~9.5GB. Loading both costs ~15GB total — well within a 24GB GPU.
Monitoring and Reliability
Health Checks
# Check Ollama is running
curl -s http://localhost:11434/api/tags | jq '.models[].name'
# Check loaded models
ollama ps
# Monitor GPU utilization
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv -l 5
Auto-Restart on Failure
# Ollama already runs as systemd service with restart
sudo systemctl status ollama
# For OpenClaw, add a systemd service:
sudo tee /etc/systemd/system/openclaw.service << EOF
[Unit]
Description=OpenClaw AI Agent Gateway
After=ollama.service
[Service]
ExecStart=/usr/bin/openclaw start
Restart=always
RestartSec=10
User=openclaw
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable openclaw && sudo systemctl start openclaw
Cost Tracking
Self-hosted doesn't mean free — you pay for electricity and hardware depreciation. Track your actual costs:
| Cost | Monthly | Annual |
|---|---|---|
| Electricity (RTX 3090, 12h/day) | ~$15 | ~$180 |
| Hardware depreciation (3yr) | ~$30 | ~$360 |
| VPS for proxy (optional) | ~$5 | ~$60 |
| Total | ~$50 | ~$600 |
Compare with API costs for equivalent usage (10M tokens/month on Claude): ~$30-150/month. Self-hosting breaks even at ~5M tokens/month and saves significantly above that.
Common Issues and Fixes
Ollama OOM (out of memory): Reduce OLLAMA_NUM_PARALLEL or use a smaller model. Check nvidia-smi for actual VRAM usage.
Slow response with multiple agents: Agents are queuing. Increase OLLAMA_NUM_PARALLEL or add a second GPU.
Model not loading: Check ollama ps and journalctl -u ollama. Common cause: insufficient VRAM for the model + KV cache at your context length.
OpenClaw can't reach Ollama: Verify Ollama is listening on the right interface. Default is localhost:11434. For Docker setups, use host.docker.internal:11434.
FAQ
Q: How many agents can I run on one GPU?
A: With model sharing (all agents use the same loaded model), 5-10 agents work well on an RTX 3090 with OLLAMA_NUM_PARALLEL=4. Each agent doesn't load a separate copy — they share the model in VRAM and take turns with inference.
Q: Can I mix cloud and local models in OpenClaw?
A: Yes. Configure some agents to use Ollama (local) and others to use OpenAI/Anthropic APIs. Use local for high-volume, low-stakes tasks and cloud for complex reasoning where frontier model quality matters.
Q: Do I need to run OpenClaw and Ollama on the same machine?
A: No. Ollama can run on a GPU server while OpenClaw runs anywhere that can reach Ollama's API. Connect them over your local network, Tailscale, or WireGuard.
Q: Is OpenClaw production-ready?
A: OpenClaw is actively developed and used in production by multiple teams (including us at BerserKI with 14 agents). It's stable for multi-agent orchestration. Check the OpenClaw GitHub for the latest release.
Q: What's the cheapest possible self-hosted AI agent setup?
A: A used RTX 3060 12GB (~$180) + any modern PC running Ollama + OpenClaw. Total: ~$300-400. Runs Qwen 3.5 8B at ~30 tok/s — enough for a personal AI assistant running 24/7. For more options, see our best hardware for local LLMs guide.
Q: What should OLLAMA_NUM_PARALLEL be for local agents?
A: Start at 2 on 12GB GPUs and 4 on 24GB GPUs, then watch VRAM and latency with ollama ps and nvidia-smi. Higher values only help if the model plus KV cache still fits in memory. If requests get slower or fail with OOM, lower parallelism before changing models.
Q: When should I use a cloud GPU for OpenClaw instead of local hardware?
A: Use Vast.ai for short benchmark runs, temporary 32B/70B testing, or a team pilot before buying a 24GB card. Keep steady private workloads local, especially when agent tools can read source code, customer data, or internal documents.
Q: Which model should I start with for a small OpenClaw deployment?
A: Start with Qwen 3.5 8B for triage and assistant tasks, then add Qwen 2.5 Coder 14B or Qwen 3.5 14B for coding and research agents. The Qwen 3.5 vs 2.5 comparison explains when to keep the older model as the stable fallback.
*More self-hosted AI guides: Best LLMs for 24GB GPUs · Best Hardware for Local LLMs · Ollama vs LM Studio vs llama.cpp · Multi-Agent Orchestration Guide*
*Disclosure: This article contains affiliate links. ToolHalla may earn a commission at no extra cost to you.*
🔧 Tools in This Article
All tools →Related Guides
All guides →Qwen 3.5 vs Qwen 2.5: Upgrade Decision (2026)
Qwen 3.5 vs Qwen 2.5 for local AI: when to upgrade, when to keep Qwen 2.5, and which official Ollama and Hugging Face sources to check.
12 min read
ComparisonOllama vs LM Studio vs llama.cpp: Which Should You Use in 2026?
Three tools, one goal: run AI locally. Ollama for simplicity, LM Studio for a GUI, llama.cpp for power users. Here is how to choose.
10 min read
GuideWhat Is LLM Quantization? Pick Q4, Q5, or Q8 (2026)
Pick the right LLM quantization: Q4 K M, Q5 K M, Q8, GGUF, GPTQ, AWQ, and the VRAM tradeoffs before you download a local model.
12 min read