guides

OpenClaw + Ollama: 2026 Self-Hosted AI Agents

Zero cloud costs, full privacy. Learn to run AI agents locally with OpenClaw & Ollama. Hardware, tuning, models, and failure modes covered.

March 9, 2026·5 min read·1,097 words

Running AI agents on someone else's infrastructure means sending every prompt, every document, and every decision through a third-party API. For teams that care about privacy, cost control, or simply want to own their stack — self-hosting is the answer.

OpenClaw is an open-source AI agent framework that handles multi-agent orchestration, memory, tool use, and cross-platform messaging. Ollama is the simplest way to run LLMs locally. Together, they form a complete self-hosted AI agent platform with zero API costs.

This guide covers the production configuration: which models to run, how to size your hardware, how to set up multi-agent workflows, and how to keep everything running reliably.

Why Self-Host AI Agents?

Zero per-token costs. After hardware, every token is free. A team sending 10M tokens/month to Claude spends ~$30-150/month. Self-hosted: $0 marginal cost.

Complete privacy. Prompts, documents, and agent reasoning never leave your network. Required for legal, medical, financial, and defense use cases.

No rate limits. Run as many concurrent agents as your hardware supports. No 429 errors, no queue slots, no throttling.

Full control. Choose your models, customize behavior, modify the framework, integrate with internal tools. No vendor lock-in.

Architecture Overview


┌─────────────────────────────────────────┐
│          OpenClaw Gateway               │
│  (agent orchestration, memory, MCP)     │
├─────────────────┬───────────────────────┤
│  Agent: Skald   │  Agent: Sleipnir      │
│  (content)      │  (research)           │
│  Model: Qwen 3.5│  Model: Qwen 3.5 14B │
├─────────────────┼───────────────────────┤
│  Agent: Völundr │  Agent: Heimdall      │
│  (development)  │  (monitoring)         │
│  Model: Qwen 3.5│  Model: Qwen 3.5 8B  │
└────────┬────────┴───────────┬───────────┘
         │                    │
    ┌────▼────┐          ┌────▼────┐
    │ Ollama  │          │  Tools  │
    │ (LLM)   │          │  (MCP)  │
    └─────────┘          └─────────┘

OpenClaw manages the agents. Ollama serves the models. MCP servers provide tools (filesystem, web, database, APIs). Each agent can use a different model — match model size to task complexity.

Hardware Sizing Guide

Minimum: Single Agent, 8B Model

Component	Spec	Cost
GPU	RTX 3060 12GB	~$180 (used)
CPU	Any modern quad-core	—
RAM	16GB	—
Storage	256GB NVMe	—

Runs Qwen 3.5 8B at Q4_K_M (~30 tok/s). Good for a single personal assistant agent. Not enough for concurrent multi-agent workloads.

Recommended: Multi-Agent, 14B Model

Component	Spec	Cost
GPU	RTX 3090 24GB	~$600 (used)
CPU	Ryzen 7 / i7 (8+ cores)	~$200
RAM	64GB DDR5	~$150
Storage	1TB NVMe	~$80

Runs Qwen 3.5 14B at Q4_K_M with room for multiple agents sharing the model. The 24GB VRAM handles context windows up to 16K comfortably. This is the sweet spot for small teams (2-5 agents). See our 24GB GPU model guide for the full model lineup.

Production: Multi-Agent, 32B Model

Component	Spec	Cost
GPU	RTX 4090 24GB	~$1,200
CPU	Ryzen 9 / i9 (12+ cores)	~$350
RAM	128GB DDR5	~$300
Storage	2TB NVMe	~$120

Runs Qwen 3.5 32B at Q4_K_M (~20 tok/s) or 14B at Q8_0 (near-perfect quality). Handles 5-10 concurrent agents with model sharing. For our BerserKI setup running 14 agents, we use this tier with model-level routing.

Cloud Alternative

No hardware? Vast.ai rents RTX 4090s from ~$0.20/hr. Deploy Ollama + OpenClaw on a cloud GPU instance. At 12 hours/day usage, that's ~$75/month — still cheaper than most API setups at volume.

Step-by-Step Setup

1. Install Ollama


curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:14b      # Primary model
ollama pull qwen3:8b        # Fast model for simple tasks

2. Install OpenClaw


npm install -g openclaw
openclaw init

3. Configure Models in OpenClaw

Edit your OpenClaw config to route agents to the right model:


# openclaw.yaml
models:
  default: ollama/qwen3:14b
  fast: ollama/qwen3:8b

agents:
  research:
    model: default          # 14B for complex research
    tools: [web_search, web_fetch, memory]
  assistant:
    model: fast             # 8B for quick responses
    tools: [memory, calendar]
  coding:
    model: default          # 14B for code generation
    tools: [exec, read, write, web_search]

4. Optimize Ollama for Multi-Agent


# Create systemd override
sudo systemctl edit ollama

# Add:
[Service]
Environment="OLLAMA_NUM_PARALLEL=4"        # Handle 4 concurrent requests
Environment="OLLAMA_MAX_LOADED_MODELS=2"   # Keep 2 models in VRAM
Environment="OLLAMA_FLASH_ATTENTION=1"     # Better memory efficiency
Environment="OLLAMA_KEEP_ALIVE=24h"        # Don't unload models

sudo systemctl daemon-reload && sudo systemctl restart ollama

OLLAMA_NUM_PARALLEL=4 is the key setting — it lets multiple agents query the same model simultaneously. Without it, agents queue and wait.

5. Set Up Reverse Proxy (Production)

For remote access, use nginx or Caddy:


# Install Caddy (simplest HTTPS)
sudo apt install caddy

# /etc/caddy/Caddyfile
ollama.yourdomain.com {
    reverse_proxy localhost:11434
    basicauth {
        admin $2a$14$... # bcrypt hash of your password
    }
}

sudo systemctl restart caddy

For the cheapest hosting layer, a Hostinger VPS (~$5/month) handles the reverse proxy and SSL termination if your Ollama server is on a home network behind NAT.

Model Selection for Multi-Agent Systems

Not every agent needs the same model. Match model size to task:

Agent Role	Recommended Model	Why
Research / analysis	Qwen 3.5 14B Q4	Needs reasoning depth
Code generation	Qwen 2.5 Coder 14B	Still best for code
Quick chat / triage	Qwen 3.5 8B Q4	Speed over depth
Content writing	Qwen 3.5 14B Q4	Quality matters
Monitoring / alerts	Qwen 3.5 8B Q3	Minimal, fast
Complex reasoning	Qwen 3.5 14B + /think	Thinking mode for hard problems

Using multiple model sizes saves VRAM. An 8B model uses ~5GB vs 14B's ~9.5GB. Loading both costs ~15GB total — well within a 24GB GPU.

Monitoring and Reliability

Health Checks


# Check Ollama is running
curl -s http://localhost:11434/api/tags | jq '.models[].name'

# Check loaded models
ollama ps

# Monitor GPU utilization
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv -l 5

Auto-Restart on Failure


# Ollama already runs as systemd service with restart
sudo systemctl status ollama

# For OpenClaw, add a systemd service:
sudo tee /etc/systemd/system/openclaw.service << EOF
[Unit]
Description=OpenClaw AI Agent Gateway
After=ollama.service

[Service]
ExecStart=/usr/bin/openclaw start
Restart=always
RestartSec=10
User=openclaw

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable openclaw && sudo systemctl start openclaw

Cost Tracking

Self-hosted doesn't mean free — you pay for electricity and hardware depreciation. Track your actual costs:

Cost	Monthly	Annual
Electricity (RTX 3090, 12h/day)	~$15	~$180
Hardware depreciation (3yr)	~$30	~$360
VPS for proxy (optional)	~$5	~$60
Total	~$50	~$600

Compare with API costs for equivalent usage (10M tokens/month on Claude): ~$30-150/month. Self-hosting breaks even at ~5M tokens/month and saves significantly above that.

Common Issues and Fixes

Ollama OOM (out of memory): Reduce OLLAMA_NUM_PARALLEL or use a smaller model. Check nvidia-smi for actual VRAM usage.

Slow response with multiple agents: Agents are queuing. Increase OLLAMA_NUM_PARALLEL or add a second GPU.

Model not loading: Check ollama ps and journalctl -u ollama. Common cause: insufficient VRAM for the model + KV cache at your context length.

OpenClaw can't reach Ollama: Verify Ollama is listening on the right interface. Default is localhost:11434. For Docker setups, use host.docker.internal:11434.

FAQ

Q: How many agents can I run on one GPU?

A: With model sharing (all agents use the same loaded model), 5-10 agents work well on an RTX 3090 with OLLAMA_NUM_PARALLEL=4. Each agent doesn't load a separate copy — they share the model in VRAM and take turns with inference.

Q: Can I mix cloud and local models in OpenClaw?

A: Yes. Configure some agents to use Ollama (local) and others to use OpenAI/Anthropic APIs. Use local for high-volume, low-stakes tasks and cloud for complex reasoning where frontier model quality matters.

Q: Do I need to run OpenClaw and Ollama on the same machine?

A: No. Ollama can run on a GPU server while OpenClaw runs anywhere that can reach Ollama's API. Connect them over your local network, Tailscale, or WireGuard.

Q: Is OpenClaw production-ready?

A: OpenClaw is actively developed and used in production by multiple teams (including us at BerserKI with 14 agents). It's stable for multi-agent orchestration. Check the OpenClaw GitHub for the latest release.

Q: What's the cheapest possible self-hosted AI agent setup?

A: A used RTX 3060 12GB (~$180) + any modern PC running Ollama + OpenClaw. Total: ~$300-400. Runs Qwen 3.5 8B at ~30 tok/s — enough for a personal AI assistant running 24/7. For more options, see our best hardware for local LLMs guide.

*More self-hosted AI guides: Best LLMs for 24GB GPUs · Best Hardware for Local LLMs · Ollama vs LM Studio vs llama.cpp · Multi-Agent Orchestration Guide*

*Disclosure: This article contains affiliate links. ToolHalla may earn a commission at no extra cost to you.*

🔧 Tools in This Article

LM Studio

OpenClaw

Ollama

Dify

Related Guides

All guides →

Comparison

Qwen 3.5 vs Qwen 2.5: Benchmarks, Speed & VRAM Compared (2026)

Head-to-head benchmark comparison of Qwen 3.5 and Qwen 2.5 — coding, reasoning, speed, and VRAM usage. Real test data to help you pick the right model for local inference.

12 min read

Comparison

Ollama vs LM Studio vs llama.cpp: Which Should You Use in 2026?

Three tools, one goal: run AI locally. Ollama for simplicity, LM Studio for a GUI, llama.cpp for power users. Here is how to choose.

10 min read

Guide

What is Quantization? A Practical Guide for Local LLMs (2026)

Quantization is crucial for running large language models locally without memory issues. Understand it to choose the right model and format for your GPU.

12 min read

#openclaw#ollama#local-llm#production#docker