OpenClaw + Ollama Production Config 2026: Run AI Agents on Local Hardware
Running AI agents through cloud APIs works — until it doesn't. Rate limits hit at 2 AM. A provider outage kills your automation mid-task. Monthly bills…
Running AI agents through cloud APIs works — until it doesn't. Rate limits hit at 2 AM. A provider outage kills your automation mid-task. Monthly bills climb past $200 for what's essentially autocomplete with tool calling. And every prompt you send leaves your machine.
The alternative: run the entire stack locally. OpenClaw as the agent runtime, Ollama as the inference engine, your own hardware doing the work. Zero API costs, zero latency to an external server, complete privacy, and uptime that depends only on your power supply.
This guide covers the complete production setup — hardware selection, Ollama configuration, OpenClaw integration, performance tuning, and the failure modes that'll bite you if you don't plan for them.
Why Run Locally in 2026?
The local LLM ecosystem matured fast. Three things changed the calculus:
1. Models got efficient. Qwen3 8B and Llama 3.3 8B deliver GPT-3.5-class quality at 40+ tokens/second on consumer GPUs. For agent tasks — tool calling, code generation, structured output — that's more than sufficient.
2. Quantization stopped hurting. Q4_K_M and Q5_K_M quantizations retain 95%+ of full-precision quality on benchmarks that matter for agent work. You're losing fractions of a percent on MMLU in exchange for 4× less VRAM.
3. Agent runtimes learned to work with smaller models. OpenClaw's context engineering — aggressive summarization, memory files, structured scratchpads — means a 32K-context local model can handle tasks that would require 128K+ tokens of raw history. See our context engineering deep dive for the architectural patterns.
The result: a $1,500 desktop can run a genuinely useful AI agent 24/7. A $3,000 build makes it comfortable.
Hardware Requirements
Minimum Viable Setup (8B Models)
For running 8B-parameter models (Llama 3.3, Qwen3, Mistral) at usable speeds:
- GPU: NVIDIA RTX 3060 12GB or RTX 4060 Ti 16GB
- RAM: 32 GB DDR4/DDR5
- Storage: 500 GB NVMe SSD (models + OS)
- CPU: Any modern 8-core (Ryzen 5 / Intel i5 or better)
This runs 8B Q4_K_M at 30-50 tok/s decode. Enough for single-agent workflows where you're not waiting in real-time. Prefill (processing input tokens) will be 200-400 tok/s, so long prompts take a noticeable beat.
Recommended Setup (8B–32B Models)
For running multiple model sizes and handling longer contexts:
- GPU: NVIDIA RTX 4080 16GB or RTX 4090 24GB
- RAM: 64 GB DDR5
- Storage: 1 TB NVMe SSD
- CPU: Ryzen 7 / Intel i7 or better
The RTX 4090 remains the sweet spot for local inference in 2026. Its 24 GB VRAM fits 32B Q4_K_M models entirely in GPU memory, and the memory bandwidth (1 TB/s) delivers 50-80 tok/s on 8B models. For multi-agent setups where two agents might run concurrently, the extra VRAM headroom matters.
High-End Setup (70B+ Models)
For running larger models or multiple concurrent agents:
- GPU: 2× RTX 4090 or RTX 5090 32GB, or NVIDIA DGX Spark for its 128 GB unified memory
- RAM: 128 GB DDR5
- Storage: 2 TB NVMe SSD
- CPU: Ryzen 9 / Intel i9 / Threadripper
At this tier, 70B Q4_K_M fits in a single RTX 5090 or across two 4090s via tensor parallelism. Decode speed will be 10-20 tok/s — usable for batch agent work, but not snappy for interactive use.
Apple Silicon Alternative
Mac Studio with M4 Max (128 GB) or M4 Ultra (192 GB) handles large models gracefully thanks to unified memory and 800+ GB/s bandwidth. Ollama runs natively on macOS. The trade-off: no CUDA, so some Ollama optimizations (Flash Attention 2 on NVIDIA) don't apply, and throughput per dollar is lower than a well-configured Linux GPU box.
Installing and Configuring Ollama
Installation
# Linux (recommended for production)
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
# Start the service (systemd)
sudo systemctl enable ollama
sudo systemctl start ollama
Pull Your Models
Start with proven agent-capable models:
# Primary agent model — excellent tool calling
ollama pull qwen3:8b
# Alternative with strong instruction following
ollama pull llama3.3:8b-instruct-q5_K_M
# Larger model for complex reasoning tasks
ollama pull qwen3:32b-q4_K_M
# Embedding model for RAG/memory (optional)
ollama pull nomic-embed-text
Ollama Production Configuration
The default Ollama config is tuned for casual use. Production agent workloads need adjustments.
Create or edit /etc/systemd/system/ollama.service.d/override.conf:
[Service]
# Keep models loaded — agent workflows send frequent requests
Environment="OLLAMA_KEEP_ALIVE=24h"
# Increase context window (default is 2048 — far too small for agents)
Environment="OLLAMA_NUM_CTX=32768"
# Set parallel request handling for multi-agent setups
Environment="OLLAMA_NUM_PARALLEL=2"
# Bind to localhost only (security — don't expose to network)
Environment="OLLAMA_HOST=127.0.0.1:11434"
# GPU layers — load entire model to GPU (adjust if model doesn't fit)
Environment="OLLAMA_GPU_LAYERS=999"
# Flash attention — significant speedup on supported GPUs
Environment="OLLAMA_FLASH_ATTENTION=1"
Apply:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Critical setting: OLLAMA_KEEP_ALIVE. By default, Ollama unloads models after 5 minutes of inactivity. Agent workflows have bursty patterns — a coding agent might go 10 minutes reading files before firing off a batch of LLM calls. Setting 24h keeps the model hot in VRAM. The trade-off is that you can't easily switch models without manually unloading.
Critical setting: OLLAMA_NUM_CTX. The default 2048-token context is useless for agents. 32K is the practical sweet spot — large enough for most agent tasks, small enough that 8B models still run fast. Going beyond 32K on 8B models causes noticeable speed degradation and rarely helps for structured agent work.
Creating a Custom Modelfile
For agent-specific tuning, create a Modelfile:
FROM qwen3:8b
# System prompt baked into model config
SYSTEM """You are a helpful AI assistant running locally via OpenClaw. You have access to tools and should use them when appropriate. Be concise and action-oriented."""
# Temperature — lower for agent work (deterministic tool calls)
PARAMETER temperature 0.3
# Context window
PARAMETER num_ctx 32768
# Repeat penalty — helps prevent agent loops
PARAMETER repeat_penalty 1.1
ollama create openclaw-agent -f Modelfile
Configuring OpenClaw for Local Ollama
Basic Configuration
OpenClaw connects to Ollama via the OpenAI-compatible API. In your OpenClaw config:
# openclaw.yaml
providers:
ollama:
type: openai
baseUrl: http://127.0.0.1:11434/v1
apiKey: "ollama" # Ollama doesn't require auth, but the field is mandatory
models:
- qwen3:8b
- qwen3:32b-q4_K_M
defaultModel: ollama/qwen3:8b
Model Routing
For production, you'll want different models for different tasks. OpenClaw supports per-agent model overrides:
# Use the fast 8B for routine tasks
defaultModel: ollama/qwen3:8b
# Override for complex reasoning
agents:
researcher:
model: ollama/qwen3:32b-q4_K_M
coder:
model: ollama/qwen3:8b # Speed matters more for coding loops
Context Window Management
Local models have smaller context windows than cloud APIs. This is actually fine — most agent tasks don't need 128K tokens if you manage context well. OpenClaw's memory system handles this:
- Aggressive summarization compresses conversation history, keeping the context window focused
- MEMORY.md scratchpads persist state across context resets — the agent writes what matters to disk and reads it back
- Tool output truncation prevents a single large file read from consuming the entire context
A 32K local model with good context engineering outperforms a 128K cloud model stuffed with unstructured history. That's not cope — it's what the research on context window failures consistently shows.
Performance Optimization
1. GPU Memory Management
Monitor VRAM usage to prevent OOM crashes:
# Watch GPU utilization and memory
watch -n 1 nvidia-smi
# Check Ollama's loaded models
ollama ps
If you're running close to VRAM limits, Ollama will silently offload layers to CPU — and performance craters. A model that runs at 50 tok/s fully on GPU might drop to 5 tok/s with even a few layers on CPU. Either quantize harder or use a smaller model.
2. Concurrent Request Handling
OLLAMA_NUM_PARALLEL controls how many requests Ollama processes simultaneously. Each parallel slot shares the context memory, so:
NUM_PARALLEL=1: Full context window available, maximum qualityNUM_PARALLEL=2: Context split between slots, fine for most agent workNUM_PARALLEL=4: Only viable on 24GB+ GPUs with 8B models
For multi-agent orchestration where a supervisor dispatches to worker agents, NUM_PARALLEL=2 is the sweet spot. The supervisor and one worker can run concurrently without context starvation.
3. Model Quantization Choices
For agent workloads specifically:
| Quantization | VRAM (8B) | Speed | Quality | Agent Verdict |
|---|---|---|---|---|
| Q8_0 | ~9 GB | Baseline | Best | Overkill — use FP16 API instead |
| Q5_K_M | ~6 GB | +15% | Excellent | Best quality/speed balance |
| Q4_K_M | ~5 GB | +25% | Great | Production sweet spot |
| Q3_K_M | ~4 GB | +35% | Good | Acceptable if VRAM-constrained |
| Q2_K | ~3 GB | +40% | Fair | Tool calling starts degrading |
Below Q3_K_M, structured output reliability drops — the model starts malforming JSON and missing tool call syntax. For agents, that's a dealbreaker.
4. Keep-Alive and Preloading
Avoid cold-start latency by preloading your agent model on boot:
# Add to crontab or systemd timer
@reboot sleep 30 && curl -s http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"hello","stream":false}' > /dev/null
This fires a dummy request that loads the model into VRAM. First real agent request then gets instant response instead of waiting 5-10 seconds for model loading.
Troubleshooting
Model Loading Fails or Runs on CPU
Symptom: Ollama loads the model but inference is painfully slow (2-5 tok/s for an 8B model).
Cause: Model doesn't fit in VRAM, so layers are offloaded to CPU.
Fix:
# Check what's actually loaded
ollama ps
# Check GPU memory
nvidia-smi
# Use a more aggressive quantization
ollama pull qwen3:8b-q4_K_M
Context Window Errors
Symptom: Agent responses become incoherent or Ollama returns truncation errors.
Cause: Agent sent more tokens than num_ctx allows.
Fix: Increase OLLAMA_NUM_CTX or (better) configure OpenClaw's context management to summarize more aggressively. A well-configured agent should rarely exceed 16K tokens of actual context.
Agent Tool Calls Malformed
Symptom: The agent tries to call tools but the JSON is broken or the tool name is hallucinated.
Cause: Model too small or quantization too aggressive for reliable structured output.
Fix: Upgrade to Q4_K_M minimum. Switch to Qwen3 — it has the best tool calling reliability in the 8B class as of mid-2026. If the problem persists, try the 32B variant.
High Latency Between Turns
Symptom: Each agent turn takes 10+ seconds even for short responses.
Cause: Model is being unloaded and reloaded between requests (KEEP_ALIVE too short).
Fix: Set OLLAMA_KEEP_ALIVE=24h and verify with ollama ps that the model stays loaded.
Memory Leaks on Long Sessions
Symptom: Ollama's memory usage grows over hours/days until the system runs out of RAM.
Cause: Known issue with some Ollama versions when NUM_PARALLEL > 1 and context accumulates.
Fix: Schedule a daily Ollama restart:
# /etc/cron.d/ollama-restart
0 4 * * * root systemctl restart ollama
Pick a time when your agents are idle. The model reload takes 5-10 seconds.
Security Considerations
Running agents locally doesn't automatically mean "secure." A few things to lock down:
1. Bind Ollama to localhost. Never expose port 11434 to the network unless you've added authentication (Ollama has none built-in). The default OLLAMA_HOST=127.0.0.1:11434 is correct.
2. Firewall the inference port. Belt and suspenders:
`bash
sudo ufw deny 11434
`
3. Isolate agent workspaces. OpenClaw agents can read and write files. Run them in a dedicated user account or container with limited filesystem access.
4. Monitor resource usage. A runaway agent loop can peg your GPU at 100% for hours. Set up alerts for sustained high utilization.
5. Update regularly. Both Ollama and the models receive security patches. ollama pull updates to the latest version of a model tag.
Cost Comparison: Local vs Cloud
For an agent running 8 hours/day, processing ~500K tokens daily:
| Cloud (GPT-4o) | Cloud (Claude Sonnet) | Local (Ollama + RTX 4090) | |
|---|---|---|---|
| Monthly token cost | ~$75 | ~$45 | $0 |
| Hardware amortized (36 mo) | $0 | $0 | ~$50/mo |
| Electricity | $0 | $0 | ~$15/mo |
| Total monthly | ~$75 | ~$45 | ~$65 |
The local setup breaks even against GPT-4o in about 18 months. Against cheaper cloud models, it takes longer. The real value isn't cost savings — it's zero rate limits, zero downtime dependency, and complete data privacy. If your agent processes sensitive code, financial data, or personal information, running locally isn't just cheaper; it's the only responsible option.
Recommended Production Stack
Here's the full stack we recommend for a reliable local AI agent setup:
| Component | Choice | Why |
|---|---|---|
| Inference | Ollama | Mature, fast, great model library |
| Agent Runtime | OpenClaw | Context engineering, tool calling, memory |
| Primary Model | Qwen3 8B Q4_K_M | Best tool calling in class |
| Reasoning Model | Qwen3 32B Q4_K_M | For complex multi-step tasks |
| Embeddings | nomic-embed-text | Fast, good quality, runs on CPU |
| GPU | RTX 4090 24GB | VRAM sweet spot for 8B-32B models |
| OS | Ubuntu 24.04 LTS | Best NVIDIA driver support |
What's Next
Once your local stack is running, explore:
- Multi-agent orchestration — run supervisor + worker patterns entirely on local hardware
- Agent memory systems — persistent context that survives across sessions
- Prompt caching — while you're not paying per-token locally, caching still speeds up prefill significantly
- RAG pipelines with local embeddings — keep your documents searchable without sending them to an API
The local AI agent stack in 2026 is production-ready. Not "works if you squint" — genuinely reliable for daily use. The models are good enough, the tooling is mature enough, and the hardware is affordable enough. The only question is whether your use case justifies running your own infrastructure versus paying for cloud convenience.
For most developers and small teams, the answer is increasingly: yes.
*Disclosure: Links above are affiliate links. ToolHalla may earn a commission at no extra cost to you. We only recommend hardware we'd actually use.*
*Building AI agents? Start with our context engineering guide for the architectural foundations, or jump straight to building a coding agent for a hands-on tutorial.*
FAQ
What are the main types of memory in AI agents?
Four types: (1) In-context memory — the active conversation window; (2) External memory — vector databases and key-value stores; (3) Episodic memory — logs of past interactions; (4) Semantic memory — facts extracted from past experiences.
What is the difference between short-term and long-term memory in AI agents?
Short-term memory is the active context window. Long-term memory is persisted storage (vector DBs, SQL, files) that survives session resets. Effective agents move important info to long-term memory before the context fills.
How do AI agents remember things between sessions?
Agents persist memory by writing key facts to external storage at session end: (1) Summarize and store as vector embedding; (2) Extract entities to a knowledge graph; (3) Store structured facts to a database.
What is a vector database used for in AI agents?
Vector databases store embeddings for semantic retrieval. Agents query: 'what do I know about this topic?' and get relevant stored memories ranked by similarity. Qdrant, ChromaDB, and Pinecone are common choices.
How much memory does an AI agent actually need?
Depends on scope. Conversational agents: mainly session context. Research agents: deep document retrieval via vector DB. Personal assistants: long-term episodic memory for user preferences and past decisions.
Frequently Asked Questions
What are the main types of memory in AI agents?
What is the difference between short-term and long-term memory in AI agents?
How do AI agents remember things between sessions?
What is a vector database used for in AI agents?
How much memory does an AI agent actually need?
🔧 Tools in This Article
All tools →Related Guides
All guides →Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now
Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now Meta's April 14, 2026 announcement of an expanded Broadcom partnership is a useful reminder that AI competition is increasingly fought below the API layer. Meta said it...
2 min read
AI ToolsMeta Muse Spark April 2026: What It Means for Consumer AI Assistants
Meta Muse Spark April 2026: What It Means for Consumer AI Assistants Meta's April 8, 2026 announcement of Muse Spark matters because it is not just another model launch. Meta is trying to reposition Meta AI around multimodal perception,...
2 min read
AI ToolsProject Glasswing April 2026: The AI Cybersecurity Shift Is Here
Project Glasswing April 2026: The AI Cybersecurity Shift Is Here Anthropic's April 7, 2026 announcement of Project Glasswing is one of the clearest recent signs that frontier AI labs now see cybersecurity as a central deployment battleground, not a...
2 min read