guides

OpenClaw + Ollama: 2026 Self-Hosted AI Agents

Zero cloud costs, full privacy. Learn to run AI agents locally with OpenClaw & Ollama. Hardware, tuning, models, and failure modes covered.

March 9, 2026·5 min read·1,097 words

Running AI agents on someone else's infrastructure means sending every prompt, every document, and every decision through a third-party API. For teams that care about privacy, cost control, or simply want to own their stack — self-hosting is the answer.

OpenClaw is an open-source AI agent framework that handles multi-agent orchestration, memory, tool use, and cross-platform messaging. Ollama is the simplest way to run LLMs locally. Together, they form a complete self-hosted AI agent platform with zero API costs.

This guide covers the production configuration: which models to run, how to size your hardware, how to set up multi-agent workflows, and how to keep everything running reliably.

Why Self-Host AI Agents?

Zero per-token costs. After hardware, every token is free. A team sending 10M tokens/month to Claude spends ~$30-150/month. Self-hosted: $0 marginal cost.

Complete privacy. Prompts, documents, and agent reasoning never leave your network. Required for legal, medical, financial, and defense use cases.

No rate limits. Run as many concurrent agents as your hardware supports. No 429 errors, no queue slots, no throttling.

Full control. Choose your models, customize behavior, modify the framework, integrate with internal tools. No vendor lock-in.

Architecture Overview


┌─────────────────────────────────────────┐
│          OpenClaw Gateway               │
│  (agent orchestration, memory, MCP)     │
├─────────────────┬───────────────────────┤
│  Agent: Skald   │  Agent: Sleipnir      │
│  (content)      │  (research)           │
│  Model: Qwen 3.5│  Model: Qwen 3.5 14B │
├─────────────────┼───────────────────────┤
│  Agent: Völundr │  Agent: Heimdall      │
│  (development)  │  (monitoring)         │
│  Model: Qwen 3.5│  Model: Qwen 3.5 8B  │
└────────┬────────┴───────────┬───────────┘
         │                    │
    ┌────▼────┐          ┌────▼────┐
    │ Ollama  │          │  Tools  │
    │ (LLM)   │          │  (MCP)  │
    └─────────┘          └─────────┘

OpenClaw manages the agents. Ollama serves the models. MCP servers provide tools (filesystem, web, database, APIs). Each agent can use a different model — match model size to task complexity.

Hardware Sizing Guide

Minimum: Single Agent, 8B Model

Component Spec Cost
GPU RTX 3060 12GB ~$180 (used)
CPU Any modern quad-core
RAM 16GB
Storage 256GB NVMe

Runs Qwen 3.5 8B at Q4_K_M (~30 tok/s). Good for a single personal assistant agent. Not enough for concurrent multi-agent workloads.

Component Spec Cost
GPU RTX 3090 24GB ~$600 (used)
CPU Ryzen 7 / i7 (8+ cores) ~$200
RAM 64GB DDR5 ~$150
Storage 1TB NVMe ~$80

Runs Qwen 3.5 14B at Q4_K_M with room for multiple agents sharing the model. The 24GB VRAM handles context windows up to 16K comfortably. This is the sweet spot for small teams (2-5 agents). See our 24GB GPU model guide for the full model lineup.

Production: Multi-Agent, 32B Model

Component Spec Cost
GPU RTX 4090 24GB ~$1,200
CPU Ryzen 9 / i9 (12+ cores) ~$350
RAM 128GB DDR5 ~$300
Storage 2TB NVMe ~$120

Runs Qwen 3.5 32B at Q4_K_M (~20 tok/s) or 14B at Q8_0 (near-perfect quality). Handles 5-10 concurrent agents with model sharing. For our BerserKI setup running 14 agents, we use this tier with model-level routing.

Cloud Alternative

No hardware? Vast.ai rents RTX 4090s from ~$0.20/hr. Deploy Ollama + OpenClaw on a cloud GPU instance. At 12 hours/day usage, that's ~$75/month — still cheaper than most API setups at volume.

Step-by-Step Setup

1. Install Ollama


curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:14b      # Primary model
ollama pull qwen3:8b        # Fast model for simple tasks

2. Install OpenClaw


npm install -g openclaw
openclaw init

3. Configure Models in OpenClaw

Edit your OpenClaw config to route agents to the right model:


# openclaw.yaml
models:
  default: ollama/qwen3:14b
  fast: ollama/qwen3:8b

agents:
  research:
    model: default          # 14B for complex research
    tools: [web_search, web_fetch, memory]
  assistant:
    model: fast             # 8B for quick responses
    tools: [memory, calendar]
  coding:
    model: default          # 14B for code generation
    tools: [exec, read, write, web_search]

4. Optimize Ollama for Multi-Agent


# Create systemd override
sudo systemctl edit ollama

# Add:
[Service]
Environment="OLLAMA_NUM_PARALLEL=4"        # Handle 4 concurrent requests
Environment="OLLAMA_MAX_LOADED_MODELS=2"   # Keep 2 models in VRAM
Environment="OLLAMA_FLASH_ATTENTION=1"     # Better memory efficiency
Environment="OLLAMA_KEEP_ALIVE=24h"        # Don't unload models

sudo systemctl daemon-reload && sudo systemctl restart ollama

OLLAMA_NUM_PARALLEL=4 is the key setting — it lets multiple agents query the same model simultaneously. Without it, agents queue and wait.

5. Set Up Reverse Proxy (Production)

For remote access, use nginx or Caddy:


# Install Caddy (simplest HTTPS)
sudo apt install caddy

# /etc/caddy/Caddyfile
ollama.yourdomain.com {
    reverse_proxy localhost:11434
    basicauth {
        admin $2a$14$... # bcrypt hash of your password
    }
}

sudo systemctl restart caddy

For the cheapest hosting layer, a Hostinger VPS (~$5/month) handles the reverse proxy and SSL termination if your Ollama server is on a home network behind NAT.

Model Selection for Multi-Agent Systems

Not every agent needs the same model. Match model size to task:

Agent Role Recommended Model Why
Research / analysis Qwen 3.5 14B Q4 Needs reasoning depth
Code generation Qwen 2.5 Coder 14B Still best for code
Quick chat / triage Qwen 3.5 8B Q4 Speed over depth
Content writing Qwen 3.5 14B Q4 Quality matters
Monitoring / alerts Qwen 3.5 8B Q3 Minimal, fast
Complex reasoning Qwen 3.5 14B + /think Thinking mode for hard problems

Using multiple model sizes saves VRAM. An 8B model uses ~5GB vs 14B's ~9.5GB. Loading both costs ~15GB total — well within a 24GB GPU.

Monitoring and Reliability

Health Checks


# Check Ollama is running
curl -s http://localhost:11434/api/tags | jq '.models[].name'

# Check loaded models
ollama ps

# Monitor GPU utilization
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv -l 5

Auto-Restart on Failure


# Ollama already runs as systemd service with restart
sudo systemctl status ollama

# For OpenClaw, add a systemd service:
sudo tee /etc/systemd/system/openclaw.service << EOF
[Unit]
Description=OpenClaw AI Agent Gateway
After=ollama.service

[Service]
ExecStart=/usr/bin/openclaw start
Restart=always
RestartSec=10
User=openclaw

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable openclaw && sudo systemctl start openclaw

Cost Tracking

Self-hosted doesn't mean free — you pay for electricity and hardware depreciation. Track your actual costs:

Cost Monthly Annual
Electricity (RTX 3090, 12h/day) ~$15 ~$180
Hardware depreciation (3yr) ~$30 ~$360
VPS for proxy (optional) ~$5 ~$60
Total ~$50 ~$600

Compare with API costs for equivalent usage (10M tokens/month on Claude): ~$30-150/month. Self-hosting breaks even at ~5M tokens/month and saves significantly above that.

Common Issues and Fixes

Ollama OOM (out of memory): Reduce OLLAMA_NUM_PARALLEL or use a smaller model. Check nvidia-smi for actual VRAM usage.

Slow response with multiple agents: Agents are queuing. Increase OLLAMA_NUM_PARALLEL or add a second GPU.

Model not loading: Check ollama ps and journalctl -u ollama. Common cause: insufficient VRAM for the model + KV cache at your context length.

OpenClaw can't reach Ollama: Verify Ollama is listening on the right interface. Default is localhost:11434. For Docker setups, use host.docker.internal:11434.

FAQ

Q: How many agents can I run on one GPU?

A: With model sharing (all agents use the same loaded model), 5-10 agents work well on an RTX 3090 with OLLAMA_NUM_PARALLEL=4. Each agent doesn't load a separate copy — they share the model in VRAM and take turns with inference.

Q: Can I mix cloud and local models in OpenClaw?

A: Yes. Configure some agents to use Ollama (local) and others to use OpenAI/Anthropic APIs. Use local for high-volume, low-stakes tasks and cloud for complex reasoning where frontier model quality matters.

Q: Do I need to run OpenClaw and Ollama on the same machine?

A: No. Ollama can run on a GPU server while OpenClaw runs anywhere that can reach Ollama's API. Connect them over your local network, Tailscale, or WireGuard.

Q: Is OpenClaw production-ready?

A: OpenClaw is actively developed and used in production by multiple teams (including us at BerserKI with 14 agents). It's stable for multi-agent orchestration. Check the OpenClaw GitHub for the latest release.

Q: What's the cheapest possible self-hosted AI agent setup?

A: A used RTX 3060 12GB (~$180) + any modern PC running Ollama + OpenClaw. Total: ~$300-400. Runs Qwen 3.5 8B at ~30 tok/s — enough for a personal AI assistant running 24/7. For more options, see our best hardware for local LLMs guide.


*More self-hosted AI guides: Best LLMs for 24GB GPUs · Best Hardware for Local LLMs · Ollama vs LM Studio vs llama.cpp · Multi-Agent Orchestration Guide*

*Disclosure: This article contains affiliate links. ToolHalla may earn a commission at no extra cost to you.*

🔧 Tools in This Article

All tools →

Related Guides

All guides →
#openclaw#ollama#local-llm#production#docker