How to Build Agent Memory That Actually Works
Every LLM forgets everything between sessions. Close the conversation, and the model loses all context — what it learned, what it decided, what worked…
Every LLM forgets everything between sessions. Close the conversation, and the model loses all context — what it learned, what it decided, what worked, what failed. Open a new session and it starts from zero.
This is the fundamental problem of agent memory. And it's why most AI agents feel like they have amnesia: they can reason brilliantly within a single session but can't learn or accumulate knowledge across sessions.
The solution isn't a single technique. It's a three-tier memory architecture — the same pattern used by Claude Code, Devin (Cognition), and the production agent systems we've been writing about in this series. Each tier handles a different type of information, at a different speed, with different persistence guarantees.
This guide walks through the full architecture, with implementation details you can use today — whether you're building on APIs, with frameworks, or locally with Ollama.
Why Memory Is Hard for LLMs
Before building, understand why naive approaches fail:
Context windows aren't memory. A 200K-token context window feels like memory — the model can reference anything in it. But it's working memory at best: expensive, temporary, and subject to context rot. Chroma Research showed accuracy drops 20–50% as context fills. Using the context window as your primary memory store guarantees degradation.
Conversation history isn't knowledge. Saving and replaying the full conversation transcript is the simplest "memory" — and the worst. A 50-turn conversation contains maybe 5% actionable information and 95% intermediate reasoning, tool outputs, and dead ends. Replaying it all causes context distraction: the model over-indexes on its own prior patterns instead of reasoning fresh.
Embeddings don't understand structure. Vector databases are powerful for semantic search, but they don't capture relationships, hierarchies, or temporal ordering. "The deployment failed on March 5th because the config was wrong" and "the config was fixed on March 6th" live as separate vectors with no causal link. Retrieval might return the failure without the fix.
The three-tier architecture addresses each of these limitations with a different mechanism.
The Three-Tier Architecture
Tier 1: Working Memory (In-Context)
What it is: The information present in the active context window during a session.
Lifetime: One session (or until compaction).
Contents:
- System prompt and rules
- Active task description
- Recent conversation turns (last 3–5)
- Current tool definitions
- Retrieved data relevant to the current step
Budget: 15K–30K tokens for API models, 8K–15K for local 32K-window models.
The key principle: Working memory should contain only what's needed for the *current* reasoning step. Everything else belongs in Tier 2 or 3.
Implementation: Session State Management
The most effective working memory pattern is explicit state tracking. Instead of relying on raw conversation history, maintain a structured state block that gets updated each turn:
## Session State (updated turn 12)
### Active Task
Refactoring auth module — migrating from JWT to session tokens
### Key Decisions
- Using Redis for session store (decided turn 3, benchmarked turn 7)
- Keeping backward compatibility for 30 days (stakeholder requirement)
### Current Step
Writing migration script for existing user sessions
### Blockers
None
### Files Modified This Session
- src/auth/session.py (new)
- src/auth/jwt.py (deprecated, not deleted)
- tests/auth/test_session.py (new)
This state block replaces 12 turns of raw history with ~200 tokens of structured context. The model knows exactly what's happening without wading through old tool outputs and intermediate reasoning.
Compaction trigger: When working memory exceeds 75% of budget, compress older turns into a summary. Preserve: decisions, findings, blockers, and current state. Discard: intermediate reasoning, superseded tool outputs, exploratory dead ends.
Tier 2: Session Memory (Structured Files)
What it is: Persistent structured files that survive across sessions. The agent reads them at startup and updates them during work.
Lifetime: Days to weeks. Updated frequently.
Contents:
- MEMORY.md — accumulated knowledge, decisions, preferences
- Task logs — what was done, what worked, what failed
- Project state — current status of ongoing work
- Scratchpads — working notes for complex multi-session tasks
The key principle: Session memory is *synthesized*, not logged. It contains conclusions, not conversations.
Implementation: The MEMORY.md Pattern
This pattern, popularized by Claude Code and used in OpenClaw agents, is the simplest and most effective session memory mechanism:
# MEMORY.md
Last updated: 2026-03-18
## Project: API Refactoring
- Status: Phase 2 of 3 (auth module)
- Architecture: Redis session store, 30-day JWT backward compatibility
- Completed: User model migration, database schema update
- Next: Session token implementation, integration tests
- Blockers: Redis cluster config needs DevOps review
## Learned Preferences
- User prefers TypeScript over JavaScript for new files
- Test framework: Vitest (not Jest)
- PR descriptions should include performance impact
## Key Decisions (with rationale)
- Redis over Memcached: need persistence for session recovery (decided 2026-03-15)
- 30-day compat window: stakeholder requirement from product team (2026-03-16)
- No breaking API changes: mobile app v2.1 still uses old auth (2026-03-16)
## Known Issues
- Rate limiter doesn't account for session token requests yet
- Staging environment Redis is single-node (prod is clustered)
Rules for MEMORY.md
1. Overwrite, don't append. When information changes, replace the old entry. MEMORY.md reflects *current* state, not history.
2. Keep it under 2K tokens. If it grows beyond that, split into topic-specific files (memory/auth-refactor.md, memory/preferences.md).
3. Include rationale. "Redis over Memcached" means nothing in 3 weeks. "Redis over Memcached: need persistence for session recovery" is useful forever.
4. Date significant decisions. Temporal context prevents the agent from reopening settled questions.
5. Read at session start. The agent's first action in any new session is reading MEMORY.md. This is its "where was I?" moment.
Task Logs: What MEMORY.md Isn't
MEMORY.md is for state, not history. For historical records — what the agent did and when — use separate task logs:
# logs/2026-03-18-auth-migration.md
## Session Summary
- Duration: 45 minutes (32 turns)
- Task: Implement Redis session store
- Outcome: Core implementation complete, 14/14 tests passing
- Files created: session.py, test_session.py, redis_config.py
- Files modified: auth_middleware.py, requirements.txt
- Issues encountered: Redis connection pooling needed explicit max_connections
- Follow-up needed: Load testing, DevOps review of cluster config
Task logs feed into MEMORY.md updates but don't live there. The agent can retrieve them via Tier 3 when it needs historical detail.
Tier 3: Long-Term Memory (Vector Database + Semantic Search)
What it is: A searchable database of all accumulated knowledge — past conversations, documents, research findings, error resolutions.
Lifetime: Weeks to months. Grows continuously.
Contents:
- Embedded conversation summaries
- Document chunks from knowledge bases
- Error/resolution pairs
- Research findings and analysis results
- Anything that might be relevant in a future session
The key principle: Long-term memory is *retrieved by relevance*, not loaded by default. The agent queries it when needed, pulling only the specific information that matches the current task.
Implementation: ChromaDB + Embeddings
For local setups, ChromaDB is the standard:
import chromadb
from chromadb.utils import embedding_functions
# Use Ollama embeddings for fully local operation
ollama_ef = embedding_functions.OllamaEmbeddingFunction(
model_name="nomic-embed-text",
url="http://localhost:11434"
)
client = chromadb.PersistentClient(path="./agent_memory")
collection = client.get_or_create_collection(
name="agent_knowledge",
embedding_function=ollama_ef
)
# Store a session summary
collection.add(
documents=["Redis session store implementation complete. Used connection "
"pooling with max_connections=50. Tests passing. Key insight: "
"need explicit pool cleanup on app shutdown to avoid leaked connections."],
metadatas=[{"type": "session_summary", "date": "2026-03-18",
"project": "auth-refactor"}],
ids=["session-2026-03-18-auth"]
)
# Retrieve relevant memories for a new session
results = collection.query(
query_texts=["Redis connection issues"],
n_results=3
)
# Returns the session summary above + any other Redis-related memories
What to Embed
Not everything belongs in long-term memory. Embed:
- Session summaries (not raw transcripts). ~200 words per session capturing outcomes and insights.
- Error/resolution pairs. "Error X happened because of Y, fixed by Z." These are gold for future sessions.
- Research findings. Conclusions from investigation, with sources.
- Configuration decisions with context. Why something was configured a specific way.
Don't embed:
- Raw conversation history (too noisy, too long)
- Intermediate reasoning steps (only conclusions matter)
- Tool outputs (summarize first, embed the summary)
- Temporary information (meeting times, one-off requests)
Retrieval Strategy
The agent queries long-term memory when:
1. Starting a new session on an existing project → retrieve recent session summaries
2. Encountering an error → search for similar errors and their resolutions
3. Making a decision that might have precedent → check for prior decisions on the same topic
4. User asks about prior work → semantic search over session summaries
Limit retrieval to 3–5 results, with a relevance threshold. Injecting 10 "maybe relevant" memories causes the same context confusion as any other irrelevant context.
How the Tiers Work Together
A complete session flow:
Agent starts new session
│
├─ 1. Read MEMORY.md (Tier 2) → load current state, decisions, blockers
│
├─ 2. Read task log for yesterday's session (Tier 2) → where did I leave off?
│
├─ 3. Load system prompt + rules + task description (Tier 1)
│ Context so far: ~5K tokens. Focused and relevant.
│
├─ User: "Continue the Redis migration. I think we had connection pool issues?"
│
├─ 4. Query ChromaDB (Tier 3): "Redis connection pool issues"
│ → Returns: session summary mentioning max_connections fix
│ → Injected into working memory: ~200 tokens
│
├─ 5. Agent continues work with full context of prior sessions
│ Working memory: ~6K tokens. Knows state, history, blockers.
│
├─ ... (30 turns of productive work) ...
│
├─ 6. Compaction trigger at 75% (Tier 1 → Tier 2)
│ Summarize turns 1-25. Update MEMORY.md. Continue with compressed state.
│
├─ Session ends
│
├─ 7. Write session summary to task log (Tier 2)
│
└─ 8. Embed session summary into ChromaDB (Tier 3)
Each tier handles what it's best at. Working memory handles the current reasoning. Session memory handles structured state. Long-term memory handles the full knowledge history.
Memory Anti-Patterns
1. The Infinite Append Log
Appending every fact to MEMORY.md without curation. After a week, it's 10K tokens of contradictory notes. The fix: overwrite stale entries. MEMORY.md is a snapshot, not a changelog.
2. The Context Dump Resume
Replaying the full conversation from the last session as "memory." This is the most common mistake and causes immediate context rot. The fix: write a structured summary, not a transcript.
3. The Over-Retriever
Querying the vector database on every turn, injecting 10+ chunks "just in case." This fills working memory with noise and triggers context distraction. The fix: query only when the current task genuinely needs historical context. Most turns don't.
4. The Never-Forgetter
Keeping everything forever. Long-term memory should have a decay mechanism. Session summaries from 6 months ago about a deprecated feature aren't helping. The fix: periodic review and pruning, or metadata-based expiry.
5. The Missing Rationale
Storing decisions without reasons. "Using Redis" is useless in 3 weeks. "Using Redis because we need persistent sessions for recovery after server restarts" is useful forever. The fix: always store the *why* alongside the *what*.
Framework-Specific Implementation
Claude Code / OpenClaw
Both use the MEMORY.md pattern natively. The agent reads workspace files at session start and can write to them freely. For Tier 3, add a ChromaDB instance and expose it as a tool:
# OpenClaw agent config snippet
tools:
- name: memory_search
description: "Search long-term memory for relevant past work"
# Backed by ChromaDB query
- name: memory_store
description: "Store a finding or decision in long-term memory"
# Backed by ChromaDB add
LangChain / LangGraph
Use ConversationSummaryBufferMemory for Tier 1 (auto-compaction), a filesystem or Redis store for Tier 2, and ChromaDB with langchain_chroma for Tier 3. The RunnableWithMessageHistory class handles the plumbing.
Local Ollama Agents
The three-tier pattern works identically with local models. ChromaDB runs locally with zero API cost. nomic-embed-text generates embeddings on CPU in ~50ms. The only adjustment: smaller Tier 1 budgets (8K–15K tokens for 32K-window models) make Tier 2 and Tier 3 even more important.
Practical Checklist
Starting from zero? Implement in this order:
1. MEMORY.md (30 minutes). Create a structured state file. Read at session start, update at session end. This alone provides 80% of the value.
2. Session state block (1 hour). Add explicit state tracking in your agent loop. Update a structured block each turn instead of relying on raw history.
3. Compaction (2 hours). Implement summarization when working memory exceeds 75% of your context budget. Compress older turns, preserve recent ones.
4. Task logs (1 hour). Write a summary after each session. Date, outcome, files changed, follow-ups needed.
5. ChromaDB (half a day). Set up a persistent vector store. Embed session summaries automatically. Add a search tool to your agent.
6. Retrieval tuning (ongoing). Adjust chunk sizes, result counts, and relevance thresholds based on real usage. Start conservative (3 results, high threshold) and loosen as needed.
Steps 1–3 require no external dependencies. You can build effective agent memory with nothing but markdown files and a summarization prompt.
FAQ
What is the three-tier agent memory architecture?
Three tiers: Working memory (current context window — system prompt, recent turns, retrieved data), session memory (persistent structured files like MEMORY.md that track state across sessions), and long-term memory (vector database for semantic search over accumulated knowledge). Each tier handles different types of information at different speeds.
How does MEMORY.md work for AI agents?
MEMORY.md is a structured file the agent reads at session start and updates during work. It contains current project state, key decisions with rationale, learned preferences, and blockers. The critical rule: overwrite stale entries rather than appending, keeping it a current snapshot under 2K tokens.
Do I need a vector database for agent memory?
Not immediately. MEMORY.md + task logs (Tier 1 and 2) provide 80% of the value with zero infrastructure. Add a vector database (ChromaDB is the simplest) when your agent accumulates enough knowledge that structured files can't cover it — typically after 50+ sessions or when working across multiple projects.
How is agent memory different from RAG?
RAG retrieves from a static knowledge base (documentation, product data). Agent memory retrieves from the agent's *own* accumulated experience — past sessions, decisions, errors, and findings. The retrieval technology (embeddings + vector search) is similar, but the data source and update patterns are fundamentally different.
What's the best vector database for local AI agent memory?
ChromaDB with nomic-embed-text embeddings via Ollama is the standard local stack. Zero API cost, embeddings in ~50ms on CPU, persistent storage, and Python-native. For higher scale (millions of vectors), consider Qdrant, which also runs locally.
How often should an agent update its memory?
Working memory (Tier 1): every turn. Session memory (Tier 2): at major milestones and session end. Long-term memory (Tier 3): at session end (embed session summary) and when encountering significant new knowledge. Don't update on every turn — the overhead isn't worth it for Tier 2 and 3.
*This article completes the Context Engineering content cluster. Series: Context Rot · Context Window Failures · Single vs Multi-Agent · RAG vs Long Context*
*Sources: Anthropic — Context Engineering · Cognition — Don't Build Multi-Agents · Chroma Research — Context Rot · LangChain — Context Engineering · Letta — Agent Memory*
Recommended Hardware
Recommended Products
- NVIDIA GeForce RTX 5090 GPU — Essential for running large language models locally, the RTX 5090 provides the powerful GPU processing needed to handle extensive memory and computational demands.
- HP Z8 G5 Workstation — Built for professional use, this workstation offers robust performance and ample storage, ideal for developing and deploying AI agents with complex memory architectures.
- QNAP TS-253D NAS — Perfect for storing and managing large datasets and models, a QNAP TS-253D NAS provides reliable and scalable storage solutions necessary for agent memory systems.
Frequently Asked Questions
What is the three-tier agent memory architecture?
How does MEMORY.md work for AI agents?
Do I need a vector database for agent memory?
How is agent memory different from RAG?
What's the best vector database for local AI agent memory?
How often should an agent update its memory?
🔧 Tools in This Article
All tools →Related Guides
All guides →Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now
Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now Meta's April 14, 2026 announcement of an expanded Broadcom partnership is a useful reminder that AI competition is increasingly fought below the API layer. Meta said it...
2 min read
AI ToolsMeta Muse Spark April 2026: What It Means for Consumer AI Assistants
Meta Muse Spark April 2026: What It Means for Consumer AI Assistants Meta's April 8, 2026 announcement of Muse Spark matters because it is not just another model launch. Meta is trying to reposition Meta AI around multimodal perception,...
2 min read
AI ToolsProject Glasswing April 2026: The AI Cybersecurity Shift Is Here
Project Glasswing April 2026: The AI Cybersecurity Shift Is Here Anthropic's April 7, 2026 announcement of Project Glasswing is one of the clearest recent signs that frontier AI labs now see cybersecurity as a central deployment battleground, not a...
2 min read