Context Engineering for AI Agents: The Complete Guide (2026)
Prompt engineering was about finding the right words. Context engineering is about curating the right information — at the right time, in the right…
Prompt engineering was about finding the right words. Context engineering is about curating the right *information* — at the right time, in the right amount, for every step an AI agent takes.
Andrej Karpathy called it *"the delicate art and science of filling the context window with just the right information for the next step."* By 2026, Anthropic, LangChain, and Cognition have all published detailed frameworks for it. Context engineering has become the single most important skill for building capable AI agents — whether you're running Claude on the API or Qwen 3 locally with Ollama.
This guide covers what context engineering is, the four failure modes that break agents, and the practical patterns to fix them.
What Is Context Engineering?
Every time you call an LLM, you send a context window — the full set of tokens the model sees. That includes system prompts, tool definitions, conversation history, retrieved documents, and agent state (memory, scratchpads, goals).
Prompt engineering focuses on the instruction text. Context engineering focuses on all of it — continuously deciding what enters the context, what stays, and what gets removed or compressed.
Think of it like computer architecture. The LLM is the CPU. The context window is RAM. Context engineering is the operating system — managing what fits in working memory so the processor can do its best work.
Why Not Just Use the Whole Context Window?
Modern models support 128K–1M+ tokens. The temptation: throw everything in and let the model sort it out. This fails for three reasons:
1. Context rot. Chroma Research showed that every LLM degrades as context grows, often hitting a cliff around 32K-64K tokens. A focused 20K-token context outperforms a bloated 200K-token context on virtually every task.
2. Attention budget. Transformers create n² pairwise relationships between tokens. At 100K tokens, that's 10 billion relationships. Each gets proportionally less attention.
3. Cost and speed. More tokens means more money on APIs. For local inference, more tokens means slower generation — measurably so on an RTX 4090 running Qwen 3 14B.
Context engineering exists because what you leave out matters as much as what you put in.
The 4 Context Failure Modes
Drew Breunig identified four specific failure modes that break AI agents.
1. Context Poisoning
A hallucination enters the context and gets treated as ground truth. Google DeepMind documented this with their Gemini Pokémon agent — one hallucinated game state propagated into goals and summaries, causing nonsensical strategies that persisted for the entire session.
This is especially dangerous for local LLMs, which hallucinate more often. Without validation, errors cascade through every turn.
2. Context Distraction
Accumulated information overwhelms training-time knowledge. A Databricks study found accuracy drops around 32K tokens for Llama 3.1 405B — earlier for smaller models. The Gemini agent started repeating past actions from history beyond ~100K tokens, literally forgetting how to play because its history was louder than its training.
For local models (32K–128K windows), distraction hits after 15-20 complex tool interactions.
3. Context Confusion
Irrelevant information influences decisions. The Berkeley Function-Calling Leaderboard proved every model performs worse with more tools. A quantized Llama 3.1 8B given 46 tools failed a task it solved easily with 19 — even though everything fit in context. The model *must* attend to everything you put in.
4. Context Clash
Different parts of the context disagree. A Microsoft/Salesforce team showed that splitting the same benchmark across multiple turns caused a 39% accuracy drop. OpenAI's o3 dropped from 98.1 to 64.1. Early preliminary answers remain in context and act as strong priors for the final answer — *"When LLMs take a wrong turn, they get lost and do not recover."*
The 4 Context Engineering Operations
The industry has converged on four core strategies. Each maps directly to one or more failure modes.
1. Write — Persist Information Outside the Context
Write important information to external storage; retrieve it on demand.
Why it works: The context window doesn't grow unboundedly. Critical state survives context resets. And the act of writing forces the agent to synthesize — which itself improves quality.
Patterns:
- MEMORY.md scratchpads. The agent maintains a structured file — goals, observations, decisions, blockers — and reads it at the start of each session. Claude Code does this extensively, tracking progress and recording decisions across complex tasks. We use the same pattern at BerserKI with OpenClaw agents, where each agent maintains its own accumulated knowledge base.
- Structured task logs. For multi-step tasks, write each completed step and its result to a log file. The agent reads only the log summary instead of keeping the full tool output history in context.
- Memory databases. Mem0, Letta, or Redis for semantic memory queries at scale. More powerful than flat files when your agent accumulates hundreds of facts. We cover architectural patterns for this in depth in How to Build Agent Memory That Actually Works.
Even a 7B model can maintain useful memory files — the key is keeping them short and structured.
2. Select — Retrieve Only What's Relevant
Don't pre-load everything. Pull data just-in-time so you pay the token cost only when information is actually needed.
Patterns:
- Tool-based retrieval. Give the agent
grep, file-read, and search tools. It decides what to load based on the current task. Claude Code uses this — writing targeted queries and usinghead/tailto extract specific sections rather than loading entire files. - RAG (Retrieval-Augmented Generation). Embed your knowledge base into ChromaDB or Qdrant and retrieve relevant chunks at query time. ChromaDB runs entirely on-device with zero API cost, pairing well with Ollama for fully local setups.
- Progressive disclosure. Let the agent explore incrementally. File names suggest relevance. Directory structure hints at organization. The agent builds understanding layer by layer.
- Hybrid retrieval. Drop a small, always-relevant context file (like
CLAUDE.mdorAGENTS.md) into the system prompt for speed, and use RAG for everything else. This is the approach Claude Code and OpenClaw both take.
The difference between giving a 14B model 50K tokens of noise and 5K tokens of signal is night and day.
3. Compress — Reduce What's Already There
When context grows too large, compress without losing critical information. Without compression, every agent eventually hits the limit or degrades below useful accuracy.
Patterns:
- Conversation compaction. Summarize the conversation history and reinitiate with the compressed summary. Claude Code preserves architectural decisions and unresolved bugs while discarding redundant tool outputs. After compaction, it continues with the summary plus the five most recently accessed files.
- Tool result clearing. Replace old tool outputs with brief summaries or remove entirely. Anthropic launched this as a platform feature — it's one of the safest forms of compression because old results are rarely re-referenced.
- Sliding window with anchors. Keep system prompt + last N turns; summarize everything between. "Anchors" are turns with critical decisions or state changes that survive even aggressive compression.
- Incremental summarization. Instead of compressing everything at once, maintain a running summary updated every K turns. This avoids the quality loss of compressing a very long conversation in one shot.
For 32K local models: Set a compaction threshold at 75% usage (~24K tokens). Summarize into a structured brief (goals, progress, blockers, key findings), reinitiate with brief + system prompt + last 3 turns, write the full detailed state to MEMORY.md for reference. This lets a 32K model handle tasks that would otherwise require 200K+ tokens of history.
4. Isolate — Split Work Across Separate Contexts
Use multiple agents with clean, focused context windows instead of one agent trying to hold everything. This is the nuclear option against all four failure modes.
Patterns:
- Sub-agent delegation. A coordinator agent delegates focused tasks to specialized sub-agents, each with its own clean context. The sub-agent processes, returns a summary, and its full context is discarded. The parent's context stays lean.
- Fan-out/fan-in. For research tasks, spin up multiple agents in parallel to investigate different aspects. Each works independently, then a coordinator synthesizes. This is how Devin (Cognition) handles complex coding.
- Tool-scoped isolation. When using complex tools that return large outputs (web search, code analysis), make the tool call in an isolated inference. Return only a compressed result to the main agent.
- Context quarantines. Process potentially conflicting external data in an isolated call first. The quarantine agent evaluates, reconciles, and returns a clean synthesis — keeping contradictions out of the main context.
Three Qwen 3 14B agents with 8K of focused context each will outperform a single 70B model drowning in 100K tokens. OpenClaw makes this practical — run a coordinator on a smarter model and delegate subtasks to faster, smaller models via Ollama.
Context Engineering vs Prompt Engineering
| Prompt Engineering | Context Engineering | |
|---|---|---|
| Focus | The instruction text | Everything the model sees |
| When | Before deployment | Continuously at runtime |
| Static/Dynamic | Mostly static | Always dynamic |
| Scale | Single interaction | Multi-turn, multi-tool, multi-source |
| Key skill | Writing clearly | Systems design |
Prompt engineering doesn't disappear — it becomes 3% of the token budget, not 100% of the discipline.
Anti-Patterns to Avoid
- Dumping entire files into context. Retrieve the specific lines you need.
- No summarization strategy. Every agent running 10+ turns needs compaction.
- Too many tool definitions. Curate 5-8 tools max; load others dynamically.
- Ignoring early errors. Context poisoning compounds — validate critical facts.
- System prompts covering every edge case. Aim for principles, not scripts.
- Treating window size as a feature. A 1M-token window ≠ use 1M tokens.
Token Budget: Practical Framework
128K context window:
| Component | Budget | % |
|---|---|---|
| System prompt + rules | 3,000–4,000 | 3% |
| Tool definitions (5-8) | 2,000–3,000 | 2% |
| Few-shot examples | 3,000–5,000 | 3% |
| Retrieved context (RAG) | 15,000–20,000 | 15% |
| Conversation history (compacted) | 25,000–35,000 | 25% |
| Working memory | 30,000–40,000 | 28% |
| Safety buffer | 25,000–30,000 | 22% |
32K local model:
| Component | Budget | % |
|---|---|---|
| System prompt | 1,500 | 5% |
| Tools (3-5) | 1,000 | 3% |
| Retrieved context | 5,000 | 16% |
| History (compacted) | 8,000 | 25% |
| Working memory | 10,000 | 31% |
| Buffer | 6,500 | 20% |
The buffer isn't wasted — it's insurance against unexpected tool outputs.
Local LLMs: What Changes
Running agents with Ollama? The principles apply, but constraints are tighter:
- Smaller windows demand better engineering. 32K–128K is usable, but you can't be lazy. These techniques are requirements, not nice-to-haves.
- Smaller context = faster inference. A focused 8K context generates noticeably faster than 30K on the same hardware. Prompt caching amplifies this further.
- Write aggressively. MEMORY.md, task logs, structured notes — keep the window free.
- Local RAG is cheap. ChromaDB + Ollama embeddings add negligible overhead for dramatically better context management.
- Compaction is non-negotiable at 32K. You'll hit the wall after 10-15 complex tool interactions.
FAQ
What is context engineering and how is it different from prompt engineering?
Context engineering manages all information entering an LLM's context window — system prompts, tool definitions, history, documents, and memory. Prompt engineering focuses only on instructions. Context engineering is continuous and dynamic, running at every inference step.
Why does my AI agent get worse after many turns?
Context rot — LLM accuracy degrades as context length increases. Accumulated history overwhelms the attention mechanism. The fix: compaction, summarization, and external memory.
What are the four context failure modes?
Context poisoning (hallucinations compound), distraction (history overwhelms training), confusion (irrelevant info influences decisions), and clash (contradictory sources cause reasoning errors).
How do I manage context on local LLMs with Ollama?
Use MEMORY.md for state persistence. Compact at 75% usage. Keep tools minimal (3-5). Use local RAG with ChromaDB. Prefer Write + Select over keeping history in context.
Is RAG still useful with million-token windows?
Yes. Focused retrieved context consistently outperforms large unfocused dumps. The 2026 approach is hybrid: small pre-loaded core + RAG for everything else.
How many tools should an AI agent have?
5-8 with clear, non-overlapping purposes. More tools = worse performance across all models. Use dynamic tool loading if you need a larger set.
*Sources: Anthropic — Effective Context Engineering · Chroma — Context Rot · Drew Breunig — How Contexts Fail · LangChain — Context Engineering · Cognition — Don't Build Multi-Agents*
Related Articles
- AI Hallucination Guardrails That Actually Work (2026)
- The Reflection Pattern: How AI Agents Self-Correct
Recommended Hardware
Recommended Products
- NVIDIA GeForce RTX 5090 GPU — Essential for running powerful AI models locally, providing the necessary computational power to handle large context windows efficiently.
- HP Z8 G5 Workstation — A robust server option that can accommodate multiple GPUs and large amounts of RAM, ideal for developing and deploying AI agents that require significant processing power.
- WD My Cloud EX2 Ultra — Provides reliable network-attached storage (NAS) for managing and retrieving large datasets and documents, crucial for context engineering in AI development.
Frequently Asked Questions
What is context engineering and how is it different from prompt engineering?
Why does my AI agent get worse after many turns?
What are the four context failure modes?
How do I manage context on local LLMs with Ollama?
Is RAG still useful with million-token windows?
How many tools should an AI agent have?
🔧 Tools in This Article
All tools →Related Guides
All guides →Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now
Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now Meta's April 14, 2026 announcement of an expanded Broadcom partnership is a useful reminder that AI competition is increasingly fought below the API layer. Meta said it...
2 min read
AI ToolsMeta Muse Spark April 2026: What It Means for Consumer AI Assistants
Meta Muse Spark April 2026: What It Means for Consumer AI Assistants Meta's April 8, 2026 announcement of Muse Spark matters because it is not just another model launch. Meta is trying to reposition Meta AI around multimodal perception,...
2 min read
AI ToolsProject Glasswing April 2026: The AI Cybersecurity Shift Is Here
Project Glasswing April 2026: The AI Cybersecurity Shift Is Here Anthropic's April 7, 2026 announcement of Project Glasswing is one of the clearest recent signs that frontier AI labs now see cybersecurity as a central deployment battleground, not a...
2 min read