AI Tools

Context Rot: Why Your AI Agent Gets Dumber Over Time (And How to Fix It)

You've built an AI agent. It works brilliantly for the first few tasks. Then, twenty turns into a complex workflow, it starts making bizarre decisions —…

March 16, 2026·12 min read·2,497 words

You've built an AI agent. It works brilliantly for the first few tasks. Then, twenty turns into a complex workflow, it starts making bizarre decisions — calling the wrong tools, repeating actions it already tried, hallucinating facts that don't exist. You haven't changed anything. The model hasn't changed. But the output is measurably worse.

You've just experienced context rot.

The term was coined by Chroma Research in their June 2025 study, and it names something every AI engineer has felt but couldn't quantify: as your context window fills up, your model gets dumber. Not gradually. Not gracefully. It degrades in specific, predictable, and often catastrophic ways.

This isn't a theoretical problem. We run a multi-agent system at BerserKI with specialized agents handling research, content, and infrastructure (see our multi-agent orchestration guide for the architecture). Context rot nearly broke our entire pipeline before we learned to manage it. This article covers what context rot actually is, why every LLM suffers from it, and the specific techniques we — and the broader AI engineering community — use to prevent it.

The Data: Every LLM Degrades

Chroma Research tested 18 large language models on needle-in-a-haystack retrieval tasks across varying context lengths. The results were unambiguous:

  • Every single model showed accuracy degradation as context length increased
  • Average accuracy drops ranged from 20% to 50% between 10K and 100K tokens
  • Even the best models (Claude, GPT-4o, Gemini) showed measurable degradation
  • The degradation isn't linear — it often hits a cliff around 32K-64K tokens

Anthropic confirmed this in their context engineering guide: "As the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases." They describe context as a finite resource with diminishing marginal returns — every new token you add depletes the model's "attention budget."

The architectural reason is straightforward: transformers use self-attention where every token attends to every other token. That creates n² pairwise relationships for n tokens. As context grows, the model's ability to capture these relationships gets stretched thin. Models are also trained primarily on shorter sequences, meaning they have fewer specialized parameters for long-range dependencies.

The n² scaling problem is important to understand. At 10K tokens, the model manages 100 million pairwise relationships. At 100K tokens, that jumps to 10 billion. The model doesn't run out of memory — it runs out of *attention precision*. Each individual relationship gets proportionally less focus.

Position encoding techniques like RoPE interpolation allow models to handle longer sequences than they were trained on, but with increasing position uncertainty. The model can still *read* tokens at position 90,000 — it just can't *attend to them as precisely* as tokens at position 5,000.

The practical implication: A million-token context window doesn't mean you should use a million tokens. A focused 20K-token context will outperform a bloated 200K-token context on almost every task. Treat context like RAM in an embedded system — technically available, but budget it like it's scarce.

The 4 Context Failure Modes

Drew Breunig published an excellent taxonomy of context failures that every agent builder should memorize. These aren't abstract categories — they're the exact failure modes you'll encounter when shipping agents in production.

1. Context Poisoning

What happens: A hallucination or error enters the context, and the model keeps referencing it in subsequent turns.

The DeepMind team documented this in the Gemini 2.5 technical report. When their Pokémon-playing agent hallucinated a game state, it poisoned the goals section of its context. The agent then developed "nonsensical strategies and repeated behaviors in pursuit of a goal that cannot be met."

Why it's dangerous for agents: Agents loop. They take actions, observe results, and plan next steps — all within the same context. One hallucinated observation can cascade through dozens of subsequent turns. The model doesn't know the information is wrong; it treats everything in context as ground truth.

Real-world example: We saw this in our own BerserKI system. A research agent hallucinated a URL during web scraping. That bad URL propagated into a summary, which was handed to the content agent, which wrote an article citing a source that didn't exist. Three agents, one poisoned context, one broken output.

2. Context Distraction

What happens: The context grows so long that the model over-focuses on its own history, neglecting what it learned during training.

Gemini's Pokémon agent demonstrated this clearly: "As the context grew significantly beyond 100K tokens, the agent showed a tendency toward repeating actions from its vast history rather than synthesizing novel plans."

A Databricks study found that model correctness begins falling around 32K tokens for Llama 3.1 405B, and even earlier for smaller models. The models don't hit their context window limit — they hit a *distraction ceiling* well before that.

Why it matters: If your agent runs for 50+ turns, it's probably past the distraction ceiling. It's no longer reasoning from first principles — it's copying patterns from its own recent history.

3. Context Confusion

What happens: Irrelevant information in the context influences the model's response.

The Berkeley Function-Calling Leaderboard proved this systematically: every model performs worse when given more tools. And when given tools that aren't relevant, models will sometimes call them anyway.

A striking example: quantized Llama 3.1 8B failed a benchmark when given 46 tools, even though the context was well within the window. When given only 19 tools, it succeeded. Same model, same task, fewer distractions — dramatically better results.

The MCP problem: This is why the dream of "connect every MCP server and let the model figure it out" doesn't work. 50 tool definitions in your context isn't power — it's confusion. Every tool description competes for the model's attention budget. We covered MCP's evolving architecture in our Server-Side MCP guide — the protocol is powerful, but it requires disciplined context management to avoid confusion.

4. Context Clash

What happens: Accumulated information in the context contradicts itself, and the model can't resolve the conflict.

A Microsoft and Salesforce study demonstrated this with devastating clarity. They took benchmark prompts and "sharded" information across multiple turns (simulating how real conversations work — you don't dump everything in one message). The result: an average accuracy drop of 39%. OpenAI's o3 dropped from 98.1 to 64.1.

The reason: when information arrives in stages, the model makes assumptions and attempts partial answers in early turns. These wrong early answers stay in the context and conflict with later, correct information. The model can't recover: "When LLMs take a wrong turn in a conversation, they get lost and do not recover."

Why agents are vulnerable: Agents gather information from multiple sources — tool calls, document retrieval, sub-agent reports. This information can easily disagree with itself. Combine this with MCP tool descriptions from third parties, and you have a recipe for context clash.

6 Proven Techniques to Prevent Context Rot

These aren't theoretical — they're drawn from Drew Breunig's follow-up, Anthropic's engineering guide, and our own production experience at BerserKI.

1. RAG: Load Only What's Relevant

Don't stuff your entire knowledge base into the context. Use retrieval-augmented generation to pull in only the documents that are relevant to the current task.

Every time a model increases its context window, someone declares "RAG is dead." It's not. A focused 5K-token RAG injection outperforms a 500K-token full-document dump on virtually every retrieval task.

Tools:

  • Chroma — Open-source, runs locally, Python-native
  • pgvector — Vector search in PostgreSQL (use what you already run)
  • Pinecone — Managed, scales to billions of vectors
  • Weaviate — Open-source with hybrid search

For local setups, Chroma + Ollama is the simplest stack. Embed with a local model, retrieve with vector similarity, inject into context. Done.

2. Tool Loadout: Curate Your Tools Per Task

Don't give your agent 50 tools and hope for the best. Select a subset of relevant tools per task — a "loadout."

Research from the "Less is More" paper shows that using an LLM-powered tool recommender to dynamically select tools improved Llama 3.1 8B performance by 44%. Even when dynamic selection didn't improve accuracy, it reduced power consumption by 18% and improved speed by 77%.

Practical approach: The RAG MCP paper stores tool descriptions in a vector database and retrieves only the most relevant ones for each prompt. If you have more than 15-20 tools, you need this pattern.

3. Context Quarantine: Isolate Sub-Tasks

Break complex tasks into isolated sub-contexts. Each sub-agent gets its own clean context with only the information it needs.

Anthropic's multi-agent research system demonstrated that a multi-agent system with Claude Opus 4 as lead and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on research tasks. The key insight: "Subagents facilitate compression by operating in parallel with their own context windows."

This is the approach we use at BerserKI. Each agent (Skald for content, Sleipnir for research, Völundr for infrastructure) operates in its own context. They share results through structured handoffs — not by dumping everything into one massive prompt.

When to quarantine: Any task that involves gathering information from 3+ sources, or any workflow exceeding ~30 turns.

4. Context Pruning: Remove the Cruft

Periodically review your context and remove information that's no longer relevant.

Provence is a compact (1.75GB) context pruner that can cut 95% of irrelevant content from a document while preserving the information needed to answer a specific question. It's fast, accurate, and works as a pre-processing step before context injection.

Simpler approach: Maintain your context as a structured dictionary (sections for goals, history, documents, tool results). Before each LLM call, rebuild the context string from the dictionary — dropping old tool results, removing completed sub-tasks, and keeping only active goals.

5. Context Summarization: Compress, Don't Accumulate

When your agent's conversation history grows past 20-30 turns, summarize the older portions instead of carrying them verbatim.

This is the oldest technique in the book — ChatGPT has done this since early versions. But the implementation matters. Naive summarization loses critical details. Better: hierarchical summarization where you keep the last 5-10 turns verbatim and summarize everything before that into a structured recap.


[SYSTEM] Context summary (turns 1-45):
- User requested analysis of 3 repositories
- Found security vulnerability in repo-2/auth.py (SQL injection)
- Patch drafted and tested, awaiting review
- Repo-1 and repo-3: no issues found

[RECENT] Turns 46-50: (full conversation)

6. Context Offloading: External Memory

Store information outside the context window entirely and retrieve it on demand through tool calls.

Memory systems:

  • Mem0 — Memory layer for AI apps, supports personalization
  • Zep — Long-term memory for chat agents
  • Letta (MemGPT) — Self-editing memory with virtual context management

The MEMORY.md pattern: The simplest form of context offloading is a structured markdown file that the agent reads at startup and updates after significant events. This is what we use in our BerserKI agents — each agent has a MEMORY.md with key decisions, active tasks, and constraints. It's read fresh on each heartbeat, keeping the context small and accurate.

For more sophisticated setups, pair context offloading with Claude's context caching (Anthropic) or Gemini's context caching — both let you pre-compute attention for static context portions, reducing latency and cost.

The Prevention Checklist

Use this before deploying any agent to production:

  • [ ] Measure your effective context length. Test your model at 10K, 32K, 64K, and 128K. Find where accuracy drops and stay below that.
  • [ ] Audit your tool count. If you have more than 15 tools, implement dynamic tool loading.
  • [ ] Implement summarization. Set a turn threshold (15-30 turns) where older history gets compressed.
  • [ ] Validate agent outputs. Add a verification step after critical tool calls to catch poisoned context early.
  • [ ] Structure your context. Use sections (goals, history, documents, tools) and rebuild per-turn, not append-only.
  • [ ] Monitor degradation. Log agent accuracy over conversation length. If performance drops after turn N, you know where to intervene.

When Long Context Actually Works

Context rot doesn't mean long context windows are useless. They're excellent for:

  • Summarization — Feed in a long document, get a summary. The model reads once, doesn't loop.
  • Fact retrieval — "Find X in this document." Needle-in-a-haystack works well up to the distraction ceiling.
  • Code analysis — Single-pass analysis of large codebases where the model reads and responds once.

Long context fails for:

  • Multi-turn agent workflows — Context accumulates errors over time
  • Tool-heavy tasks — Too many tools confuse the model
  • Multi-source synthesis — Contradictory information causes context clash
  • Iterative refinement — Early wrong answers poison later turns

The pattern is clear: single-pass, read-once tasks tolerate long context. Multi-turn, accumulative workflows don't.

Conclusion

Context rot isn't a bug in any specific model — it's an architectural property of how transformers work. Every model suffers from it. The question isn't whether your agent will experience context rot, but when, and whether you've built defenses against it.

The good news: the techniques are well-understood. RAG for selective retrieval. Tool loadouts for focused action spaces. Context quarantine for isolated sub-tasks. Pruning and summarization for long-running workflows. External memory for persistent state.

The shift from prompt engineering to context engineering is the defining challenge of AI agent development in 2026. The teams that master it will build agents that work reliably over hundreds of turns. The teams that ignore it will keep wondering why their agents "get dumber" halfway through a task.

Start by measuring. Find your model's distraction ceiling. Then build your context management strategy around staying below it.

If you're running agents on local hardware, the cost of long context is double — you pay in both quality degradation *and* inference speed. A 14B model on an RTX 4090 generates ~45 tok/s at 4K context. Push that to 64K context and you'll see both speed and quality drop. Context engineering isn't just an accuracy tool — it's a performance optimization.

The era of "just make the context window bigger" is over. The era of context engineering has begun.


*For more on running AI models locally, check our Best Hardware for Local LLMs and Best Local LLMs for 24GB GPUs guides. Running Ollama? See our Ollama vs LM Studio comparison.*

FAQ

What is context rot in AI agents?

Context rot happens when an agent's performance degrades as the context window fills with irrelevant, outdated, or contradictory information. The model loses track of the current goal and makes worse decisions over time.

How do you prevent context rot?

Key strategies: (1) Summarize completed steps instead of keeping full history; (2) Use working memory for current task state; (3) Apply context compression before each LLM call; (4) Implement sliding expiry for old messages.

At what context length does context rot start?

Most models show measurable degradation around 50-70% context fill. At 80%+ capacity, retrieval of early instructions degrades significantly — the 'lost in the middle' problem is well-documented.

Does context rot affect all LLMs equally?

No. Models trained with longer windows (Gemini 1.5, Claude 3.5+) are more resistant. The threshold is higher for 128K+ context, but context rot still occurs eventually.

What is the difference between context rot and hallucination?

Hallucination is generating false information not in the context. Context rot is loss of coherence due to context saturation, causing the agent to 'forget' instructions, which can trigger hallucination-like behavior.

Frequently Asked Questions

What is context rot in AI agents?
Context rot happens when an agent's performance degrades as the context window fills with irrelevant, outdated, or contradictory information. The model loses track of the current goal and makes worse decisions over time.
How do you prevent context rot?
Key strategies: (1) Summarize completed steps instead of keeping full history; (2) Use working memory for current task state; (3) Apply context compression before each LLM call; (4) Implement sliding expiry for old messages.
At what context length does context rot start?
Most models show measurable degradation around 50-70% context fill. At 80%+ capacity, retrieval of early instructions degrades significantly — the 'lost in the middle' problem is well-documented.
Does context rot affect all LLMs equally?
No. Models trained with longer windows (Gemini 1.5, Claude 3.5+) are more resistant. The threshold is higher for 128K+ context, but context rot still occurs eventually.
What is the difference between context rot and hallucination?
Hallucination is generating false information not in the context. Context rot is loss of coherence due to context saturation, causing the agent to 'forget' instructions, which can trigger hallucination-like behavior.

🔧 Tools in This Article

All tools →

Related Guides

All guides →