AI Tools

4 Ways Your AI Agent Context Window Fails (And How to Fix Them)

Your AI agent works perfectly for ten turns. By turn thirty, it's calling the wrong tools, repeating actions, and making decisions based on information…

March 16, 2026·11 min read·2,288 words

Your AI agent works perfectly for ten turns. By turn thirty, it's calling the wrong tools, repeating actions, and making decisions based on information that doesn't exist. You haven't changed the prompt. The model hasn't changed. But the output is measurably worse.

This isn't random. Every agent fails in one of four specific, predictable ways — and each has a concrete fix. Drew Breunig identified these four failure modes in June 2025, and subsequent research from Anthropic, Google DeepMind, and Berkeley has confirmed and quantified each one.

If you're building AI agents — on the API, with frameworks like LangChain or CrewAI, or locally with Ollama — you will hit these failures. This guide explains how to diagnose each one and what to do about it.

For the broader framework behind these fixes, see our Complete Guide to Context Engineering.


Failure Mode 1: Context Poisoning

What It Looks Like

The agent confidently acts on information that is completely wrong — and keeps doing it. Not once, but for every subsequent turn. It invented a fact, wrote it into its own history, and now treats it as ground truth.

Why It Happens

LLMs don't distinguish between information *you* provided and information *they* generated. Once a hallucinated fact enters the conversation history, it sits alongside verified data with equal weight. The model has no mechanism to flag its own prior output as uncertain.

This is uniquely dangerous for agents because they loop: act → observe → plan → act. One bad observation propagates through every future planning step.

The Research

Google DeepMind documented this extensively in the Gemini 2.5 technical report. Their Pokémon-playing agent hallucinated a game state early in a session. That hallucination propagated into the goals section of its context. The result: *"nonsensical strategies and repeated behaviors in pursuit of a goal that cannot be met."* The agent didn't recover — it couldn't, because the poisoned context was indistinguishable from valid context.

The OpenAI Cookbook notes the same pattern in tool-using agents: when an API call returns malformed data and the agent misinterprets it, the wrong interpretation persists across every subsequent reasoning step.

How to Fix It

1. Validate critical outputs. After any tool call that returns external data, add a lightweight verification step. This can be as simple as a second LLM call that checks: "Does this output look reasonable given the input?" For local setups, a fast model like Qwen 3 8B can validate outputs from a larger reasoning model.

2. Use structured state, not raw history. Instead of letting the agent reason from its full conversation history, maintain a structured state object — a scratchpad or MEMORY.md — that gets explicitly updated. When facts change, the old version gets overwritten, not appended.


## Agent State (updated turn 14)
- Target: repository security audit
- Found: SQL injection in auth.py line 42 (VERIFIED via grep)
- Status: patch drafted, tests passing
- Blocked: awaiting review approval

3. Context quarantine for external data. Process untrusted inputs (API responses, scraped web content, user-uploaded documents) in an isolated sub-agent call. The quarantine agent validates, cleans, and summarizes — only the verified summary enters the main context.

4. Implement rollback points. Periodically snapshot the agent's state. When you detect degraded output quality, roll back to the last known-good state rather than trying to debug forward through poisoned context.


Failure Mode 2: Context Distraction

What It Looks Like

The agent stops being creative. It repeats strategies it already tried. It copies patterns from its recent history instead of reasoning from its training. A coding agent that was writing clean solutions early in the session starts producing cookie-cutter copies of its own prior code — even when the new task is fundamentally different.

Why It Happens

As conversation history grows, it becomes the dominant signal in the context window. The model's training-time knowledge — everything it learned during pre-training — gets drowned out by the sheer volume of in-context examples. The model over-indexes on "what I've been doing" rather than "what I know how to do."

The Research

A Databricks study measured this precisely: model correctness begins falling around 32K tokens for Llama 3.1 405B, and hits even earlier for smaller models. This isn't the context window limit — it's a *distraction ceiling* well below the technical maximum.

Chroma Research's context rot study confirmed the pattern across 18 models: accuracy drops of 20–50% between 10K and 100K tokens. The degradation isn't linear — it often hits a cliff around 32K–64K tokens. We covered the full data in our context rot deep dive.

Google DeepMind observed it in practice: beyond ~100K tokens, their Gemini agent *"showed a tendency toward repeating actions from its vast history rather than synthesizing novel plans."*

How to Fix It

1. Aggressive summarization on a schedule. Set a compaction threshold — 75% of your context window for local models, 50% for API models where you want peak performance. When you hit it, summarize older history into a structured brief and continue with fresh context.


Compaction at turn 25:
[SUMMARY of turns 1-20]
- Analyzed 3 repos, found 2 vulnerabilities
- CVE-2026-1234 patched and tested
- CVE-2026-5678 requires upstream fix, ticket filed
[FULL HISTORY of turns 21-25]

2. Fresh-context delegation. When the agent needs to do something genuinely new (different task type, different domain), delegate to a sub-agent with a clean context. Pass only the relevant output back. Three agents with 8K of focused context each will outperform one agent drowning in 100K tokens.

3. Periodic re-grounding. Every N turns, re-inject the original system prompt and task description at high priority. This reminds the model of its core instructions and counteracts the gravitational pull of accumulated history.

4. Monitor token count, not just turn count. Tool calls with large outputs can push you past the distraction ceiling in just a few turns. Track cumulative tokens, not just conversation length.


Failure Mode 3: Context Confusion

What It Looks Like

The agent calls tools it shouldn't. It brings in information from a different task. It answers a question about database performance by referencing a completely unrelated API spec that happened to be in context. The agent isn't wrong about the facts — it's wrong about which facts are relevant.

Why It Happens

Transformer attention is indiscriminate — every token attends to every other token. The model can't ignore information in its context window. Irrelevant content doesn't just waste space; it actively interferes with reasoning by competing for the model's attention budget.

The Research

The Berkeley Function-Calling Leaderboard proved this systematically. Every model they tested performed worse when given more tools — even when the extra tools were irrelevant to the task. The results were striking: a quantized Llama 3.1 8B failed a benchmark when given 46 tools that it solved easily with only 19. Same model, same task, same context window. Fewer distractions, dramatically better results.

The "Less is More" paper quantified the impact: using an LLM-powered tool recommender to dynamically select relevant tools improved Llama 3.1 8B performance by 44%. Even when dynamic selection didn't improve accuracy, it reduced power consumption by 18% and speed by 77%.

How to Fix It

1. Dynamic tool loading. Don't register all your tools at agent startup. Store tool definitions in a vector database and retrieve only the 5–8 most relevant ones per turn. The RAG MCP paper describes this pattern in detail.


# Instead of loading all 50 MCP tools:
relevant_tools = tool_index.query(
    current_task_description,
    top_k=6
)
agent.set_tools(relevant_tools)

2. Context sectioning with clear boundaries. Structure your context into labeled sections — ## TASK, ## TOOLS, ## REFERENCE DATA, ## HISTORY. Models attend to structure. Clear boundaries reduce cross-contamination between sections.

3. Remove completed sub-task context. When a sub-task finishes, strip its working data from context and keep only the result. The intermediate steps of "researching competitor pricing" don't need to persist when the agent moves on to "drafting the report."

4. Tool description hygiene. Write tool descriptions that are specific and non-overlapping. Vague descriptions like "searches for information" cause confusion. Specific descriptions like "queries the PostgreSQL product database by SKU or name, returns price and stock count" give the model clear boundaries.

If you're using MCP servers, audit the tool descriptions they expose. Third-party servers often have vague or overlapping definitions that need tightening.


Failure Mode 4: Context Clash

What It Looks Like

The agent contradicts itself. It says the deployment succeeded in one paragraph and lists the failure errors in the next. It recommends option A, then switches to option B mid-response with no explanation. The reasoning feels incoherent — because it is.

Why It Happens

Real agent workflows gather information incrementally. Data arrives across multiple turns from multiple sources — tool calls, document retrieval, sub-agent reports, user corrections. This information frequently disagrees with itself. When conflicting facts coexist in context, the model can't reliably resolve them. Worse: early wrong answers create strong priors that resist correction by later, accurate information.

The Research

A Microsoft and Salesforce study demonstrated this with surgical precision. They took standard benchmarks and split ("sharded") the same information across multiple conversational turns — mimicking how real conversations work. The result: an average accuracy drop of 39% across all models tested. OpenAI's o3 dropped from 98.1 to 64.1.

The mechanism: when information arrives piecemeal, the model forms preliminary conclusions in early turns. These preliminary answers remain in context and act as strong anchors. When later turns deliver the correct (or complete) information, the model struggles to override its own prior reasoning. *"When LLMs take a wrong turn in a conversation, they get lost and do not recover."*

How to Fix It

1. Single-source-of-truth pattern. Maintain one authoritative state document that gets *overwritten*, not appended to. When new information arrives, update the document — replacing the old version. The agent always reads the latest version, never a contradictory history.

2. Reconciliation before injection. When gathering data from multiple sources, don't inject all of it raw into context. Run a reconciliation step first: a focused LLM call whose only job is to identify contradictions and produce a single coherent summary.


## Reconciled findings (3 sources checked)
- Pricing: $0.50/1M tokens (confirmed by API docs + billing page)
- Rate limit: 100 RPM (API docs say 100, changelog says 120 — using API docs as authoritative)
- Context window: 128K (consistent across all sources)

3. Explicit uncertainty markers. When information is preliminary or unverified, mark it clearly in context: [UNVERIFIED], [PRELIMINARY — awaiting confirmation]. This gives the model a textual signal to weight the information lower.

4. Late-turn summarization. Before the agent produces its final output, insert a summarization step that synthesizes all gathered information into a clean, non-contradictory brief. The final answer draws from the summary, not from the messy accumulated history.


A Diagnostic Checklist

When your agent degrades, use this to identify which failure mode you're hitting:

Symptom Likely Failure First Fix
Agent acts on invented facts Poisoning Add output validation
Agent repeats old strategies Distraction Summarize + compact history
Agent uses wrong tools / irrelevant data Confusion Reduce tool count, section context
Agent contradicts itself Clash Reconcile sources, use single state doc
All of the above after turn 30+ Context rot Full context engineering overhaul

For local LLM setups on RTX 4090 or similar hardware, these failures hit earlier and harder. A 14B model at 32K context has no margin for sloppy context management. The upside: fixing these issues also makes local inference faster, because smaller contexts generate tokens faster.


FAQ

What are the 4 context window failure modes for AI agents?

The four modes are context poisoning (hallucinations compound as ground truth), context distraction (history overwhelms training knowledge), context confusion (irrelevant information interferes with decisions), and context clash (contradictory data causes incoherent reasoning). Each has distinct symptoms and fixes.

Why does my AI agent get worse after many conversation turns?

Two mechanisms: context rot causes accuracy to degrade as the context window fills — Chroma Research measured 20–50% drops across 18 models. Additionally, accumulated history creates distraction, pulling the model toward repeating past patterns instead of reasoning fresh.

How many tools should I give my AI agent?

Research from the Berkeley Function-Calling Leaderboard shows every model performs worse with more tools. Keep 5–8 active tools with clear, non-overlapping descriptions. If you need more, use dynamic tool loading — store definitions in a vector database and retrieve relevant ones per task.

How do I fix context poisoning in AI agents?

Validate critical outputs with a second check (can be a fast, small model). Use structured state documents that get overwritten rather than appended. Quarantine untrusted external data in isolated sub-agent calls before injecting results into the main context.

Does context window size matter for agent reliability?

Yes, but not how you'd think. Bigger windows don't prevent failures — they delay them. A 1M-token window still degrades past 32K–64K tokens. The fix is context engineering: managing what enters, stays, and leaves the context regardless of window size.

Can local LLMs handle long agent workflows?

Yes, with disciplined context management. Compact at 75% of window capacity, use MEMORY.md for persistent state, keep tools minimal, and delegate sub-tasks to fresh contexts. A well-managed 32K context outperforms a bloated 128K context on agent tasks.


*This article is part of the Context Engineering content cluster. Next: how to build a practical memory system for AI agents using MEMORY.md, RAG, and structured state.*

*Sources: Drew Breunig — How Contexts Fail · Chroma Research — Context Rot · Anthropic — Context Engineering · Berkeley Function-Calling Leaderboard · Microsoft/Salesforce — Sharded Benchmarks · OpenAI Cookbook — Building Agents*


> 🔗 Microsoft BitNet: 100B LLM on CPU Only — relevant to hardware-constrained context management.

  • NVIDIA GeForce RTX 5090 GPU — Essential for running powerful AI models locally, the RTX 5090 provides the necessary computational power to handle complex tasks and large context windows efficiently.
  • HP Z8 G5 Workstation — Built for demanding workloads, this workstation offers robust performance and ample storage, ideal for developers and engineers working with AI agents and large datasets.
  • Samsung NVMe 980 Pro 2TB M.2 SSD — Speed up your AI agent's performance with this high-speed SSD, which provides quick access to large datasets and models, reducing load times and improving overall efficiency.

Frequently Asked Questions

What are the 4 context window failure modes for AI agents?
The four modes are context poisoning (hallucinations compound as ground truth), context distraction (history overwhelms training knowledge), context confusion (irrelevant information interferes with decisions), and context clash (contradictory data causes incoherent reasoning). Each has distinct symptoms and fixes.
Why does my AI agent get worse after many conversation turns?
Two mechanisms: context rot causes accuracy to degrade as the context window fills — Chroma Research measured 20–50% drops across 18 models. Additionally, accumulated history creates distraction, pulling the model toward repeating past patterns instead of reasoning fresh.
How many tools should I give my AI agent?
Research from the Berkeley Function-Calling Leaderboard shows every model performs worse with more tools. Keep 5–8 active tools with clear, non-overlapping descriptions. If you need more, use dynamic tool loading — store definitions in a vector database and retrieve relevant ones per task.
How do I fix context poisoning in AI agents?
Validate critical outputs with a second check (can be a fast, small model). Use structured state documents that get overwritten rather than appended. Quarantine untrusted external data in isolated sub-agent calls before injecting results into the main context.
Does context window size matter for agent reliability?
Yes, but not how you'd think. Bigger windows don't prevent failures — they delay them. A 1M-token window still degrades past 32K–64K tokens. The fix is context engineering: managing what enters, stays, and leaves the context regardless of window size.
Can local LLMs handle long agent workflows?
Yes, with disciplined context management. Compact at 75% of window capacity, use MEMORY.md for persistent state, keep tools minimal, and delegate sub-tasks to fresh contexts. A well-managed 32K context outperforms a bloated 128K context on agent tasks. --- This article is part of the Context Engineering content cluster. Next: how to build a practical memory system for AI agents using MEMORY.md, RAG, and structured state. Sources: Drew Breunig — How Contexts Fail · Chroma Research — Context Rot · Anthropic — Context Engineering · Berkeley Function-Calling Leaderboard · Microsoft/Salesforce — Sharded Benchmarks · OpenAI Cookbook — Building Agents ---

🔧 Tools in This Article

All tools →

Related Guides

All guides →