AI Tools

How to Build Agent Memory That Actually Works

Every LLM forgets everything between sessions. Close the conversation, and the model loses all context — what it learned, what it decided, what worked…

March 16, 2026·10 min read·1,986 words

Every LLM forgets everything between sessions. Close the conversation, and the model loses all context — what it learned, what it decided, what worked, what failed. Open a new session and it starts from zero.

This is the fundamental problem of agent memory. And it's why most AI agents feel like they have amnesia: they can reason brilliantly within a single session but can't learn or accumulate knowledge across sessions.

The solution isn't a single technique. It's a three-tier memory architecture — the same pattern used by Claude Code, Devin (Cognition), and the production agent systems we've been writing about in this series. Each tier handles a different type of information, at a different speed, with different persistence guarantees.

This guide walks through the full architecture, with implementation details you can use today — whether you're building on APIs, with frameworks, or locally with Ollama.

Why Memory Is Hard for LLMs

Before building, understand why naive approaches fail:

Context windows aren't memory. A 200K-token context window feels like memory — the model can reference anything in it. But it's working memory at best: expensive, temporary, and subject to context rot. Chroma Research showed accuracy drops 20–50% as context fills. Using the context window as your primary memory store guarantees degradation.

Conversation history isn't knowledge. Saving and replaying the full conversation transcript is the simplest "memory" — and the worst. A 50-turn conversation contains maybe 5% actionable information and 95% intermediate reasoning, tool outputs, and dead ends. Replaying it all causes context distraction: the model over-indexes on its own prior patterns instead of reasoning fresh.

Embeddings don't understand structure. Vector databases are powerful for semantic search, but they don't capture relationships, hierarchies, or temporal ordering. "The deployment failed on March 5th because the config was wrong" and "the config was fixed on March 6th" live as separate vectors with no causal link. Retrieval might return the failure without the fix.

The three-tier architecture addresses each of these limitations with a different mechanism.

The Three-Tier Architecture

Tier 1: Working Memory (In-Context)

What it is: The information present in the active context window during a session.

Lifetime: One session (or until compaction).

Contents:

System prompt and rules
Active task description
Recent conversation turns (last 3–5)
Current tool definitions
Retrieved data relevant to the current step

Budget: 15K–30K tokens for API models, 8K–15K for local 32K-window models.

The key principle: Working memory should contain only what's needed for the *current* reasoning step. Everything else belongs in Tier 2 or 3.

Implementation: Session State Management

The most effective working memory pattern is explicit state tracking. Instead of relying on raw conversation history, maintain a structured state block that gets updated each turn:


## Session State (updated turn 12)

### Active Task
Refactoring auth module — migrating from JWT to session tokens

### Key Decisions
- Using Redis for session store (decided turn 3, benchmarked turn 7)
- Keeping backward compatibility for 30 days (stakeholder requirement)

### Current Step
Writing migration script for existing user sessions

### Blockers
None

### Files Modified This Session
- src/auth/session.py (new)
- src/auth/jwt.py (deprecated, not deleted)
- tests/auth/test_session.py (new)

This state block replaces 12 turns of raw history with ~200 tokens of structured context. The model knows exactly what's happening without wading through old tool outputs and intermediate reasoning.

Compaction trigger: When working memory exceeds 75% of budget, compress older turns into a summary. Preserve: decisions, findings, blockers, and current state. Discard: intermediate reasoning, superseded tool outputs, exploratory dead ends.

Tier 2: Session Memory (Structured Files)

What it is: Persistent structured files that survive across sessions. The agent reads them at startup and updates them during work.

Lifetime: Days to weeks. Updated frequently.

Contents:

MEMORY.md — accumulated knowledge, decisions, preferences
Task logs — what was done, what worked, what failed
Project state — current status of ongoing work
Scratchpads — working notes for complex multi-session tasks

The key principle: Session memory is *synthesized*, not logged. It contains conclusions, not conversations.

Implementation: The MEMORY.md Pattern

This pattern, popularized by Claude Code and used in OpenClaw agents, is the simplest and most effective session memory mechanism:


# MEMORY.md

Last updated: 2026-03-18

## Project: API Refactoring
- Status: Phase 2 of 3 (auth module)
- Architecture: Redis session store, 30-day JWT backward compatibility
- Completed: User model migration, database schema update
- Next: Session token implementation, integration tests
- Blockers: Redis cluster config needs DevOps review

## Learned Preferences
- User prefers TypeScript over JavaScript for new files
- Test framework: Vitest (not Jest)
- PR descriptions should include performance impact

## Key Decisions (with rationale)
- Redis over Memcached: need persistence for session recovery (decided 2026-03-15)
- 30-day compat window: stakeholder requirement from product team (2026-03-16)
- No breaking API changes: mobile app v2.1 still uses old auth (2026-03-16)

## Known Issues
- Rate limiter doesn't account for session token requests yet
- Staging environment Redis is single-node (prod is clustered)

Rules for MEMORY.md

1. Overwrite, don't append. When information changes, replace the old entry. MEMORY.md reflects *current* state, not history.

2. Keep it under 2K tokens. If it grows beyond that, split into topic-specific files (memory/auth-refactor.md, memory/preferences.md).

3. Include rationale. "Redis over Memcached" means nothing in 3 weeks. "Redis over Memcached: need persistence for session recovery" is useful forever.

4. Date significant decisions. Temporal context prevents the agent from reopening settled questions.

5. Read at session start. The agent's first action in any new session is reading MEMORY.md. This is its "where was I?" moment.

Task Logs: What MEMORY.md Isn't

MEMORY.md is for state, not history. For historical records — what the agent did and when — use separate task logs:


# logs/2026-03-18-auth-migration.md

## Session Summary
- Duration: 45 minutes (32 turns)
- Task: Implement Redis session store
- Outcome: Core implementation complete, 14/14 tests passing
- Files created: session.py, test_session.py, redis_config.py
- Files modified: auth_middleware.py, requirements.txt
- Issues encountered: Redis connection pooling needed explicit max_connections
- Follow-up needed: Load testing, DevOps review of cluster config

Task logs feed into MEMORY.md updates but don't live there. The agent can retrieve them via Tier 3 when it needs historical detail.

Tier 3: Long-Term Memory (Vector Database + Semantic Search)

What it is: A searchable database of all accumulated knowledge — past conversations, documents, research findings, error resolutions.

Lifetime: Weeks to months. Grows continuously.

Contents:

Embedded conversation summaries
Document chunks from knowledge bases
Error/resolution pairs
Research findings and analysis results
Anything that might be relevant in a future session

The key principle: Long-term memory is *retrieved by relevance*, not loaded by default. The agent queries it when needed, pulling only the specific information that matches the current task.

Implementation: ChromaDB + Embeddings

For local setups, ChromaDB is the standard:


import chromadb
from chromadb.utils import embedding_functions

# Use Ollama embeddings for fully local operation
ollama_ef = embedding_functions.OllamaEmbeddingFunction(
    model_name="nomic-embed-text",
    url="http://localhost:11434"
)

client = chromadb.PersistentClient(path="./agent_memory")
collection = client.get_or_create_collection(
    name="agent_knowledge",
    embedding_function=ollama_ef
)

# Store a session summary
collection.add(
    documents=["Redis session store implementation complete. Used connection "
               "pooling with max_connections=50. Tests passing. Key insight: "
               "need explicit pool cleanup on app shutdown to avoid leaked connections."],
    metadatas=[{"type": "session_summary", "date": "2026-03-18", 
                "project": "auth-refactor"}],
    ids=["session-2026-03-18-auth"]
)

# Retrieve relevant memories for a new session
results = collection.query(
    query_texts=["Redis connection issues"],
    n_results=3
)
# Returns the session summary above + any other Redis-related memories

What to Embed

Not everything belongs in long-term memory. Embed:

Session summaries (not raw transcripts). ~200 words per session capturing outcomes and insights.
Error/resolution pairs. "Error X happened because of Y, fixed by Z." These are gold for future sessions.
Research findings. Conclusions from investigation, with sources.
Configuration decisions with context. Why something was configured a specific way.

Don't embed:

Raw conversation history (too noisy, too long)
Intermediate reasoning steps (only conclusions matter)
Tool outputs (summarize first, embed the summary)
Temporary information (meeting times, one-off requests)

Retrieval Strategy

The agent queries long-term memory when:

1. Starting a new session on an existing project → retrieve recent session summaries

2. Encountering an error → search for similar errors and their resolutions

3. Making a decision that might have precedent → check for prior decisions on the same topic

4. User asks about prior work → semantic search over session summaries

Limit retrieval to 3–5 results, with a relevance threshold. Injecting 10 "maybe relevant" memories causes the same context confusion as any other irrelevant context.

How the Tiers Work Together

A complete session flow:


Agent starts new session
│
├─ 1. Read MEMORY.md (Tier 2) → load current state, decisions, blockers
│
├─ 2. Read task log for yesterday's session (Tier 2) → where did I leave off?
│
├─ 3. Load system prompt + rules + task description (Tier 1)
│     Context so far: ~5K tokens. Focused and relevant.
│
├─ User: "Continue the Redis migration. I think we had connection pool issues?"
│
├─ 4. Query ChromaDB (Tier 3): "Redis connection pool issues"
│     → Returns: session summary mentioning max_connections fix
│     → Injected into working memory: ~200 tokens
│
├─ 5. Agent continues work with full context of prior sessions
│     Working memory: ~6K tokens. Knows state, history, blockers.
│
├─ ... (30 turns of productive work) ...
│
├─ 6. Compaction trigger at 75% (Tier 1 → Tier 2)
│     Summarize turns 1-25. Update MEMORY.md. Continue with compressed state.
│
├─ Session ends
│
├─ 7. Write session summary to task log (Tier 2)
│
└─ 8. Embed session summary into ChromaDB (Tier 3)

Each tier handles what it's best at. Working memory handles the current reasoning. Session memory handles structured state. Long-term memory handles the full knowledge history.

Memory Anti-Patterns

1. The Infinite Append Log

Appending every fact to MEMORY.md without curation. After a week, it's 10K tokens of contradictory notes. The fix: overwrite stale entries. MEMORY.md is a snapshot, not a changelog.

2. The Context Dump Resume

Replaying the full conversation from the last session as "memory." This is the most common mistake and causes immediate context rot. The fix: write a structured summary, not a transcript.

3. The Over-Retriever

Querying the vector database on every turn, injecting 10+ chunks "just in case." This fills working memory with noise and triggers context distraction. The fix: query only when the current task genuinely needs historical context. Most turns don't.

4. The Never-Forgetter

Keeping everything forever. Long-term memory should have a decay mechanism. Session summaries from 6 months ago about a deprecated feature aren't helping. The fix: periodic review and pruning, or metadata-based expiry.

5. The Missing Rationale

Storing decisions without reasons. "Using Redis" is useless in 3 weeks. "Using Redis because we need persistent sessions for recovery after server restarts" is useful forever. The fix: always store the *why* alongside the *what*.

Framework-Specific Implementation

Claude Code / OpenClaw

Both use the MEMORY.md pattern natively. The agent reads workspace files at session start and can write to them freely. For Tier 3, add a ChromaDB instance and expose it as a tool:


# OpenClaw agent config snippet
tools:
  - name: memory_search
    description: "Search long-term memory for relevant past work"
    # Backed by ChromaDB query
  - name: memory_store  
    description: "Store a finding or decision in long-term memory"
    # Backed by ChromaDB add

LangChain / LangGraph

Use ConversationSummaryBufferMemory for Tier 1 (auto-compaction), a filesystem or Redis store for Tier 2, and ChromaDB with langchain_chroma for Tier 3. The RunnableWithMessageHistory class handles the plumbing.

Local Ollama Agents

The three-tier pattern works identically with local models. ChromaDB runs locally with zero API cost. nomic-embed-text generates embeddings on CPU in ~50ms. The only adjustment: smaller Tier 1 budgets (8K–15K tokens for 32K-window models) make Tier 2 and Tier 3 even more important.

Practical Checklist

Starting from zero? Implement in this order:

1. MEMORY.md (30 minutes). Create a structured state file. Read at session start, update at session end. This alone provides 80% of the value.

2. Session state block (1 hour). Add explicit state tracking in your agent loop. Update a structured block each turn instead of relying on raw history.

3. Compaction (2 hours). Implement summarization when working memory exceeds 75% of your context budget. Compress older turns, preserve recent ones.

4. Task logs (1 hour). Write a summary after each session. Date, outcome, files changed, follow-ups needed.

5. ChromaDB (half a day). Set up a persistent vector store. Embed session summaries automatically. Add a search tool to your agent.

6. Retrieval tuning (ongoing). Adjust chunk sizes, result counts, and relevance thresholds based on real usage. Start conservative (3 results, high threshold) and loosen as needed.

Steps 1–3 require no external dependencies. You can build effective agent memory with nothing but markdown files and a summarization prompt.

FAQ

What is the three-tier agent memory architecture?

Three tiers: Working memory (current context window — system prompt, recent turns, retrieved data), session memory (persistent structured files like MEMORY.md that track state across sessions), and long-term memory (vector database for semantic search over accumulated knowledge). Each tier handles different types of information at different speeds.

How does MEMORY.md work for AI agents?

MEMORY.md is a structured file the agent reads at session start and updates during work. It contains current project state, key decisions with rationale, learned preferences, and blockers. The critical rule: overwrite stale entries rather than appending, keeping it a current snapshot under 2K tokens.

Do I need a vector database for agent memory?

Not immediately. MEMORY.md + task logs (Tier 1 and 2) provide 80% of the value with zero infrastructure. Add a vector database (ChromaDB is the simplest) when your agent accumulates enough knowledge that structured files can't cover it — typically after 50+ sessions or when working across multiple projects.

How is agent memory different from RAG?

RAG retrieves from a static knowledge base (documentation, product data). Agent memory retrieves from the agent's *own* accumulated experience — past sessions, decisions, errors, and findings. The retrieval technology (embeddings + vector search) is similar, but the data source and update patterns are fundamentally different.

What's the best vector database for local AI agent memory?

ChromaDB with nomic-embed-text embeddings via Ollama is the standard local stack. Zero API cost, embeddings in ~50ms on CPU, persistent storage, and Python-native. For higher scale (millions of vectors), consider Qdrant, which also runs locally.

How often should an agent update its memory?

*This article completes the Context Engineering content cluster. Series: Context Rot · Context Window Failures · Single vs Multi-Agent · RAG vs Long Context*

*Sources: Anthropic — Context Engineering · Cognition — Don't Build Multi-Agents · Chroma Research — Context Rot · LangChain — Context Engineering · Letta — Agent Memory*

Recommended Hardware

Frequently Asked Questions

What is the three-tier agent memory architecture?

Three tiers: Working memory (current context window — system prompt, recent turns, retrieved data), session memory (persistent structured files like MEMORY.md that track state across sessions), and long-term memory (vector database for semantic search over accumulated knowledge). Each tier handles different types of information at different speeds.

How does MEMORY.md work for AI agents?

Do I need a vector database for agent memory?

How is agent memory different from RAG?

RAG retrieves from a static knowledge base (documentation, product data). Agent memory retrieves from the agent's own accumulated experience — past sessions, decisions, errors, and findings. The retrieval technology (embeddings + vector search) is similar, but the data source and update patterns are fundamentally different.

What's the best vector database for local AI agent memory?

ChromaDB with nomic-embed-text embeddings via Ollama is the standard local stack. Zero API cost, embeddings in 50ms on CPU, persistent storage, and Python-native. For higher scale (millions of vectors), consider Qdrant, which also runs locally.

How often should an agent update its memory?

Working memory (Tier 1): every turn. Session memory (Tier 2): at major milestones and session end. Long-term memory (Tier 3): at session end (embed session summary) and when encountering significant new knowledge. Don't update on every turn — the overhead isn't worth it for Tier 2 and 3. --- This article completes the Context Engineering content cluster. Series: Context Rot · Context Window Failures · Single vs Multi-Agent · RAG vs Long Context Sources: Anthropic — Context Engineering · Cognition — Don't Build Multi-Agents · Chroma Research — Context Rot · LangChain — Context Engineering · Letta — Agent Memory

🔧 Tools in This Article

Make (Integromat)

Claude Code

LangChain

OpenClaw

ChromaDB

Ollama

Qdrant

Devin

Related Guides

All guides →

AI Tools

Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now

Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now Meta's April 14, 2026 announcement of an expanded Broadcom partnership is a useful reminder that AI competition is increasingly fought below the API layer. Meta said it...

2 min read

AI Tools

Meta Muse Spark April 2026: What It Means for Consumer AI Assistants

Meta Muse Spark April 2026: What It Means for Consumer AI Assistants Meta's April 8, 2026 announcement of Muse Spark matters because it is not just another model launch. Meta is trying to reposition Meta AI around multimodal perception,...

2 min read

AI Tools

Project Glasswing April 2026: The AI Cybersecurity Shift Is Here

Project Glasswing April 2026: The AI Cybersecurity Shift Is Here Anthropic's April 7, 2026 announcement of Project Glasswing is one of the clearest recent signs that frontier AI labs now see cybersecurity as a central deployment battleground, not a...

2 min read

Why Memory Is Hard for LLMs

The Three-Tier Architecture

Tier 1: Working Memory (In-Context)

Implementation: Session State Management

Tier 2: Session Memory (Structured Files)

Implementation: The MEMORY.md Pattern

Rules for MEMORY.md

Task Logs: What MEMORY.md Isn't

Tier 3: Long-Term Memory (Vector Database + Semantic Search)

Implementation: ChromaDB + Embeddings

What to Embed

Retrieval Strategy

How the Tiers Work Together

Memory Anti-Patterns

1. The Infinite Append Log

2. The Context Dump Resume

3. The Over-Retriever

4. The Never-Forgetter

5. The Missing Rationale

Framework-Specific Implementation

Claude Code / OpenClaw

LangChain / LangGraph

Local Ollama Agents

Practical Checklist

FAQ

What is the three-tier agent memory architecture?

How does MEMORY.md work for AI agents?

Do I need a vector database for agent memory?

How is agent memory different from RAG?

What's the best vector database for local AI agent memory?

How often should an agent update its memory?

Recommended Hardware

Recommended Products

Frequently Asked Questions

🔧 Tools in This Article

Related Guides

Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now

Meta Muse Spark April 2026: What It Means for Consumer AI Assistants

Project Glasswing April 2026: The AI Cybersecurity Shift Is Here