AI Tools

OpenClaw + Ollama Production Config 2026: Run AI Agents on Local Hardware

Running AI agents through cloud APIs works — until it doesn't. Rate limits hit at 2 AM. A provider outage kills your automation mid-task. Monthly bills…

March 21, 2026·10 min read·2,112 words

Running AI agents through cloud APIs works — until it doesn't. Rate limits hit at 2 AM. A provider outage kills your automation mid-task. Monthly bills climb past $200 for what's essentially autocomplete with tool calling. And every prompt you send leaves your machine.

The alternative: run the entire stack locally. OpenClaw as the agent runtime, Ollama as the inference engine, your own hardware doing the work. Zero API costs, zero latency to an external server, complete privacy, and uptime that depends only on your power supply.

This guide covers the complete production setup — hardware selection, Ollama configuration, OpenClaw integration, performance tuning, and the failure modes that'll bite you if you don't plan for them.

Why Run Locally in 2026?

The local LLM ecosystem matured fast. Three things changed the calculus:

1. Models got efficient. Qwen3 8B and Llama 3.3 8B deliver GPT-3.5-class quality at 40+ tokens/second on consumer GPUs. For agent tasks — tool calling, code generation, structured output — that's more than sufficient.

2. Quantization stopped hurting. Q4_K_M and Q5_K_M quantizations retain 95%+ of full-precision quality on benchmarks that matter for agent work. You're losing fractions of a percent on MMLU in exchange for 4× less VRAM.

3. Agent runtimes learned to work with smaller models. OpenClaw's context engineering — aggressive summarization, memory files, structured scratchpads — means a 32K-context local model can handle tasks that would require 128K+ tokens of raw history. See our context engineering deep dive for the architectural patterns.

The result: a $1,500 desktop can run a genuinely useful AI agent 24/7. A $3,000 build makes it comfortable.

Hardware Requirements

Minimum Viable Setup (8B Models)

For running 8B-parameter models (Llama 3.3, Qwen3, Mistral) at usable speeds:

  • GPU: NVIDIA RTX 3060 12GB or RTX 4060 Ti 16GB
  • RAM: 32 GB DDR4/DDR5
  • Storage: 500 GB NVMe SSD (models + OS)
  • CPU: Any modern 8-core (Ryzen 5 / Intel i5 or better)

This runs 8B Q4_K_M at 30-50 tok/s decode. Enough for single-agent workflows where you're not waiting in real-time. Prefill (processing input tokens) will be 200-400 tok/s, so long prompts take a noticeable beat.

For running multiple model sizes and handling longer contexts:

  • GPU: NVIDIA RTX 4080 16GB or RTX 4090 24GB
  • RAM: 64 GB DDR5
  • Storage: 1 TB NVMe SSD
  • CPU: Ryzen 7 / Intel i7 or better

The RTX 4090 remains the sweet spot for local inference in 2026. Its 24 GB VRAM fits 32B Q4_K_M models entirely in GPU memory, and the memory bandwidth (1 TB/s) delivers 50-80 tok/s on 8B models. For multi-agent setups where two agents might run concurrently, the extra VRAM headroom matters.

High-End Setup (70B+ Models)

For running larger models or multiple concurrent agents:

  • GPU: 2× RTX 4090 or RTX 5090 32GB, or NVIDIA DGX Spark for its 128 GB unified memory
  • RAM: 128 GB DDR5
  • Storage: 2 TB NVMe SSD
  • CPU: Ryzen 9 / Intel i9 / Threadripper

At this tier, 70B Q4_K_M fits in a single RTX 5090 or across two 4090s via tensor parallelism. Decode speed will be 10-20 tok/s — usable for batch agent work, but not snappy for interactive use.

Apple Silicon Alternative

Mac Studio with M4 Max (128 GB) or M4 Ultra (192 GB) handles large models gracefully thanks to unified memory and 800+ GB/s bandwidth. Ollama runs natively on macOS. The trade-off: no CUDA, so some Ollama optimizations (Flash Attention 2 on NVIDIA) don't apply, and throughput per dollar is lower than a well-configured Linux GPU box.

Installing and Configuring Ollama

Installation


# Linux (recommended for production)
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

# Start the service (systemd)
sudo systemctl enable ollama
sudo systemctl start ollama

Pull Your Models

Start with proven agent-capable models:


# Primary agent model — excellent tool calling
ollama pull qwen3:8b

# Alternative with strong instruction following
ollama pull llama3.3:8b-instruct-q5_K_M

# Larger model for complex reasoning tasks
ollama pull qwen3:32b-q4_K_M

# Embedding model for RAG/memory (optional)
ollama pull nomic-embed-text

Ollama Production Configuration

The default Ollama config is tuned for casual use. Production agent workloads need adjustments.

Create or edit /etc/systemd/system/ollama.service.d/override.conf:


[Service]
# Keep models loaded — agent workflows send frequent requests
Environment="OLLAMA_KEEP_ALIVE=24h"

# Increase context window (default is 2048 — far too small for agents)
Environment="OLLAMA_NUM_CTX=32768"

# Set parallel request handling for multi-agent setups
Environment="OLLAMA_NUM_PARALLEL=2"

# Bind to localhost only (security — don't expose to network)
Environment="OLLAMA_HOST=127.0.0.1:11434"

# GPU layers — load entire model to GPU (adjust if model doesn't fit)
Environment="OLLAMA_GPU_LAYERS=999"

# Flash attention — significant speedup on supported GPUs
Environment="OLLAMA_FLASH_ATTENTION=1"

Apply:


sudo systemctl daemon-reload
sudo systemctl restart ollama

Critical setting: OLLAMA_KEEP_ALIVE. By default, Ollama unloads models after 5 minutes of inactivity. Agent workflows have bursty patterns — a coding agent might go 10 minutes reading files before firing off a batch of LLM calls. Setting 24h keeps the model hot in VRAM. The trade-off is that you can't easily switch models without manually unloading.

Critical setting: OLLAMA_NUM_CTX. The default 2048-token context is useless for agents. 32K is the practical sweet spot — large enough for most agent tasks, small enough that 8B models still run fast. Going beyond 32K on 8B models causes noticeable speed degradation and rarely helps for structured agent work.

Creating a Custom Modelfile

For agent-specific tuning, create a Modelfile:


FROM qwen3:8b

# System prompt baked into model config
SYSTEM """You are a helpful AI assistant running locally via OpenClaw. You have access to tools and should use them when appropriate. Be concise and action-oriented."""

# Temperature — lower for agent work (deterministic tool calls)
PARAMETER temperature 0.3

# Context window
PARAMETER num_ctx 32768

# Repeat penalty — helps prevent agent loops
PARAMETER repeat_penalty 1.1

ollama create openclaw-agent -f Modelfile

Configuring OpenClaw for Local Ollama

Basic Configuration

OpenClaw connects to Ollama via the OpenAI-compatible API. In your OpenClaw config:


# openclaw.yaml
providers:
  ollama:
    type: openai
    baseUrl: http://127.0.0.1:11434/v1
    apiKey: "ollama"  # Ollama doesn't require auth, but the field is mandatory
    models:
      - qwen3:8b
      - qwen3:32b-q4_K_M

defaultModel: ollama/qwen3:8b

Model Routing

For production, you'll want different models for different tasks. OpenClaw supports per-agent model overrides:


# Use the fast 8B for routine tasks
defaultModel: ollama/qwen3:8b

# Override for complex reasoning
agents:
  researcher:
    model: ollama/qwen3:32b-q4_K_M
  coder:
    model: ollama/qwen3:8b  # Speed matters more for coding loops

Context Window Management

Local models have smaller context windows than cloud APIs. This is actually fine — most agent tasks don't need 128K tokens if you manage context well. OpenClaw's memory system handles this:

  • Aggressive summarization compresses conversation history, keeping the context window focused
  • MEMORY.md scratchpads persist state across context resets — the agent writes what matters to disk and reads it back
  • Tool output truncation prevents a single large file read from consuming the entire context

A 32K local model with good context engineering outperforms a 128K cloud model stuffed with unstructured history. That's not cope — it's what the research on context window failures consistently shows.

Performance Optimization

1. GPU Memory Management

Monitor VRAM usage to prevent OOM crashes:


# Watch GPU utilization and memory
watch -n 1 nvidia-smi

# Check Ollama's loaded models
ollama ps

If you're running close to VRAM limits, Ollama will silently offload layers to CPU — and performance craters. A model that runs at 50 tok/s fully on GPU might drop to 5 tok/s with even a few layers on CPU. Either quantize harder or use a smaller model.

2. Concurrent Request Handling

OLLAMA_NUM_PARALLEL controls how many requests Ollama processes simultaneously. Each parallel slot shares the context memory, so:

  • NUM_PARALLEL=1: Full context window available, maximum quality
  • NUM_PARALLEL=2: Context split between slots, fine for most agent work
  • NUM_PARALLEL=4: Only viable on 24GB+ GPUs with 8B models

For multi-agent orchestration where a supervisor dispatches to worker agents, NUM_PARALLEL=2 is the sweet spot. The supervisor and one worker can run concurrently without context starvation.

3. Model Quantization Choices

For agent workloads specifically:

Quantization VRAM (8B) Speed Quality Agent Verdict
Q8_0 ~9 GB Baseline Best Overkill — use FP16 API instead
Q5_K_M ~6 GB +15% Excellent Best quality/speed balance
Q4_K_M ~5 GB +25% Great Production sweet spot
Q3_K_M ~4 GB +35% Good Acceptable if VRAM-constrained
Q2_K ~3 GB +40% Fair Tool calling starts degrading

Below Q3_K_M, structured output reliability drops — the model starts malforming JSON and missing tool call syntax. For agents, that's a dealbreaker.

4. Keep-Alive and Preloading

Avoid cold-start latency by preloading your agent model on boot:


# Add to crontab or systemd timer
@reboot sleep 30 && curl -s http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"hello","stream":false}' > /dev/null

This fires a dummy request that loads the model into VRAM. First real agent request then gets instant response instead of waiting 5-10 seconds for model loading.

Troubleshooting

Model Loading Fails or Runs on CPU

Symptom: Ollama loads the model but inference is painfully slow (2-5 tok/s for an 8B model).

Cause: Model doesn't fit in VRAM, so layers are offloaded to CPU.

Fix:


# Check what's actually loaded
ollama ps

# Check GPU memory
nvidia-smi

# Use a more aggressive quantization
ollama pull qwen3:8b-q4_K_M

Context Window Errors

Symptom: Agent responses become incoherent or Ollama returns truncation errors.

Cause: Agent sent more tokens than num_ctx allows.

Fix: Increase OLLAMA_NUM_CTX or (better) configure OpenClaw's context management to summarize more aggressively. A well-configured agent should rarely exceed 16K tokens of actual context.

Agent Tool Calls Malformed

Symptom: The agent tries to call tools but the JSON is broken or the tool name is hallucinated.

Cause: Model too small or quantization too aggressive for reliable structured output.

Fix: Upgrade to Q4_K_M minimum. Switch to Qwen3 — it has the best tool calling reliability in the 8B class as of mid-2026. If the problem persists, try the 32B variant.

High Latency Between Turns

Symptom: Each agent turn takes 10+ seconds even for short responses.

Cause: Model is being unloaded and reloaded between requests (KEEP_ALIVE too short).

Fix: Set OLLAMA_KEEP_ALIVE=24h and verify with ollama ps that the model stays loaded.

Memory Leaks on Long Sessions

Symptom: Ollama's memory usage grows over hours/days until the system runs out of RAM.

Cause: Known issue with some Ollama versions when NUM_PARALLEL > 1 and context accumulates.

Fix: Schedule a daily Ollama restart:


# /etc/cron.d/ollama-restart
0 4 * * * root systemctl restart ollama

Pick a time when your agents are idle. The model reload takes 5-10 seconds.

Security Considerations

Running agents locally doesn't automatically mean "secure." A few things to lock down:

1. Bind Ollama to localhost. Never expose port 11434 to the network unless you've added authentication (Ollama has none built-in). The default OLLAMA_HOST=127.0.0.1:11434 is correct.

2. Firewall the inference port. Belt and suspenders:

`bash

sudo ufw deny 11434

`

3. Isolate agent workspaces. OpenClaw agents can read and write files. Run them in a dedicated user account or container with limited filesystem access.

4. Monitor resource usage. A runaway agent loop can peg your GPU at 100% for hours. Set up alerts for sustained high utilization.

5. Update regularly. Both Ollama and the models receive security patches. ollama pull updates to the latest version of a model tag.

Cost Comparison: Local vs Cloud

For an agent running 8 hours/day, processing ~500K tokens daily:

Cloud (GPT-4o) Cloud (Claude Sonnet) Local (Ollama + RTX 4090)
Monthly token cost ~$75 ~$45 $0
Hardware amortized (36 mo) $0 $0 ~$50/mo
Electricity $0 $0 ~$15/mo
Total monthly ~$75 ~$45 ~$65

The local setup breaks even against GPT-4o in about 18 months. Against cheaper cloud models, it takes longer. The real value isn't cost savings — it's zero rate limits, zero downtime dependency, and complete data privacy. If your agent processes sensitive code, financial data, or personal information, running locally isn't just cheaper; it's the only responsible option.

Here's the full stack we recommend for a reliable local AI agent setup:

Component Choice Why
Inference Ollama Mature, fast, great model library
Agent Runtime OpenClaw Context engineering, tool calling, memory
Primary Model Qwen3 8B Q4_K_M Best tool calling in class
Reasoning Model Qwen3 32B Q4_K_M For complex multi-step tasks
Embeddings nomic-embed-text Fast, good quality, runs on CPU
GPU RTX 4090 24GB VRAM sweet spot for 8B-32B models
OS Ubuntu 24.04 LTS Best NVIDIA driver support

What's Next

Once your local stack is running, explore:

  • Multi-agent orchestration — run supervisor + worker patterns entirely on local hardware
  • Agent memory systems — persistent context that survives across sessions
  • Prompt caching — while you're not paying per-token locally, caching still speeds up prefill significantly
  • RAG pipelines with local embeddings — keep your documents searchable without sending them to an API

The local AI agent stack in 2026 is production-ready. Not "works if you squint" — genuinely reliable for daily use. The models are good enough, the tooling is mature enough, and the hardware is affordable enough. The only question is whether your use case justifies running your own infrastructure versus paying for cloud convenience.

For most developers and small teams, the answer is increasingly: yes.


*Disclosure: Links above are affiliate links. ToolHalla may earn a commission at no extra cost to you. We only recommend hardware we'd actually use.*

*Building AI agents? Start with our context engineering guide for the architectural foundations, or jump straight to building a coding agent for a hands-on tutorial.*

FAQ

What are the main types of memory in AI agents?

Four types: (1) In-context memory — the active conversation window; (2) External memory — vector databases and key-value stores; (3) Episodic memory — logs of past interactions; (4) Semantic memory — facts extracted from past experiences.

What is the difference between short-term and long-term memory in AI agents?

Short-term memory is the active context window. Long-term memory is persisted storage (vector DBs, SQL, files) that survives session resets. Effective agents move important info to long-term memory before the context fills.

How do AI agents remember things between sessions?

Agents persist memory by writing key facts to external storage at session end: (1) Summarize and store as vector embedding; (2) Extract entities to a knowledge graph; (3) Store structured facts to a database.

What is a vector database used for in AI agents?

Vector databases store embeddings for semantic retrieval. Agents query: 'what do I know about this topic?' and get relevant stored memories ranked by similarity. Qdrant, ChromaDB, and Pinecone are common choices.

How much memory does an AI agent actually need?

Depends on scope. Conversational agents: mainly session context. Research agents: deep document retrieval via vector DB. Personal assistants: long-term episodic memory for user preferences and past decisions.

Frequently Asked Questions

What are the main types of memory in AI agents?
Four types: (1) In-context memory — the active conversation window; (2) External memory — vector databases and key-value stores; (3) Episodic memory — logs of past interactions; (4) Semantic memory — facts extracted from past experiences.
What is the difference between short-term and long-term memory in AI agents?
Short-term memory is the active context window. Long-term memory is persisted storage (vector DBs, SQL, files) that survives session resets. Effective agents move important info to long-term memory before the context fills.
How do AI agents remember things between sessions?
Agents persist memory by writing key facts to external storage at session end: (1) Summarize and store as vector embedding; (2) Extract entities to a knowledge graph; (3) Store structured facts to a database.
What is a vector database used for in AI agents?
Vector databases store embeddings for semantic retrieval. Agents query: 'what do I know about this topic?' and get relevant stored memories ranked by similarity. Qdrant, ChromaDB, and Pinecone are common choices.
How much memory does an AI agent actually need?
Depends on scope. Conversational agents: mainly session context. Research agents: deep document retrieval via vector DB. Personal assistants: long-term episodic memory for user preferences and past decisions.

🔧 Tools in This Article

All tools →

Related Guides

All guides →