AI Tools

Prompt Caching: Cut Your AI Costs 90%

If you're running AI agents or LLM-powered applications in production, your API bill is probably your second biggest line item after salaries. The…

March 18, 2026·8 min read·1,711 words

If you're running AI agents or LLM-powered applications in production, your API bill is probably your second biggest line item after salaries. The culprit? Sending the same tokens over and over—system prompts, tool definitions, conversation history, and reference documents that barely change between requests.

Prompt caching fixes this. Every major provider now offers it, and the savings are real: up to 90% off input token costs with Anthropic and Google, and 50% automatically with OpenAI.

This guide breaks down exactly how caching works across providers, when to use each strategy, and how to structure your prompts for maximum cache hits. If you're building AI agents with context engineering, this is table stakes.

What Is Prompt Caching?

Prompt caching stores the computed representation (KV cache) of your input tokens on the provider's servers. When your next request starts with the same prefix, the provider skips the expensive computation and reads from cache instead.

Think of it like a compiled binary vs. interpreting source code every time. The model doesn't re-process tokens it has already seen—it jumps straight to the new part.

The result:

Lower cost: Cached tokens are billed at a fraction of regular input pricing
Lower latency: Skip re-computation of the cached prefix, often 50-80% faster
Same output quality: Caching affects processing, not model behavior

How Each Provider Handles It

The three major providers take meaningfully different approaches. Understanding the differences is key to optimizing your setup.

Anthropic: Explicit Control, Maximum Savings

Anthropic gives you explicit control over what gets cached. You mark specific content blocks with cache_control, and the system caches everything up to that breakpoint.

Pricing:

Cache writes: 25% more expensive than base input tokens (one-time cost)
Cache reads: 90% cheaper than base input tokens
Default TTL: 5 minutes (refreshed on each hit)
Extended TTL: 1 hour (available at additional cost)

Minimum cacheable tokens: 1,024 for Haiku, 2,048 for Sonnet and Opus.

How it works in practice:


import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a senior code reviewer. Here are the project guidelines: ...[2000+ tokens of context]...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "Review this pull request: ..."}
    ]
)

# Check cache performance
print(response.usage.cache_creation_input_tokens)  # First request: tokens written
print(response.usage.cache_read_input_tokens)       # Subsequent: tokens read from cache

The first request pays the 25% write premium. Every subsequent request within the TTL window gets 90% off those cached tokens. For a system prompt of 4,000 tokens hit 20 times, the math is:

Without caching: 4,000 × 20 = 80,000 tokens at full price
With caching: 4,000 × 1.25 (write) + 4,000 × 19 × 0.10 (reads) = 5,000 + 7,600 = 12,600 effective tokens
Savings: 84%

For longer-running sessions with consistent system prompts, extend the TTL:


"cache_control": {"type": "ephemeral", "ttl": "1h"}

OpenAI: Automatic, Zero Effort

OpenAI takes the opposite approach—caching is fully automatic. No code changes, no opt-in, no cache markers. If your request shares a prefix with a recent request, you get the discount.

Pricing:

Cache writes: No additional cost
Cache reads: 50% cheaper than base input tokens
TTL: Varies (typically 5-10 minutes for low traffic, longer for high volume)
No minimum token requirement advertised, but effective from 1,024 tokens

Supported models: GPT-4o, GPT-4o-mini, o1, o3, and newer models.

How it works:


from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "You are a senior code reviewer. Here are the project guidelines: ...[long context]..."
        },
        {
            "role": "user",
            "content": "Review this pull request: ..."
        }
    ]
)

# Check if caching was applied
cached = response.usage.prompt_tokens_details.cached_tokens
print(f"Cached tokens: {cached}")

The beauty of OpenAI's approach is simplicity. The trade-off: you get less control and a smaller discount (50% vs. 90%).

Key optimization: OpenAI matches prefixes from the start. Structure your messages with static content first:

1. System prompt (most stable)

2. Tool/function definitions (rarely change)

3. Reference documents (change occasionally)

4. Conversation history (grows each turn)

5. Current user message (always changes)

This ordering maximizes the cacheable prefix length.

Google Gemini: Both Implicit and Explicit

Google offers two modes: implicit caching (automatic, like OpenAI) and explicit context caching (controlled, like Anthropic).

Implicit caching (enabled by default since May 2025):

Automatic, no code changes
Savings passed on transparently
Works on most Gemini models

Explicit context caching:

Create named cache objects with custom TTLs
Up to 90% discount on Gemini 2.5+ models (75% on 2.0 models)
Minimum 32,768 tokens for explicit caches
Storage cost: based on token count and cache duration


from google import genai

client = genai.Client()

# Create an explicit cache
cache = client.caches.create(
    model="gemini-2.5-pro",
    config={
        "contents": [
            {
                "role": "user",
                "parts": [{"text": "Here is the entire codebase: ...[large context]..."}]
            }
        ],
        "ttl": "3600s",
        "display_name": "codebase-context"
    }
)

# Use the cache in requests
response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents="Find security vulnerabilities in the authentication module.",
    config={"cached_content": cache.name}
)

Google's explicit caching shines for very large contexts—think full codebases, legal document sets, or research paper collections that you query repeatedly over hours.

Provider Comparison at a Glance

Anthropic Claude:

Activation: Explicit (cache_control)
Read discount: 90%
Write cost: +25%
Default TTL: 5 minutes
Min tokens: 1,024–2,048
Best for: Controlled, high-frequency agent loops

OpenAI GPT-4o / o-series:

Activation: Automatic
Read discount: 50%
Write cost: Free
Default TTL: ~5–10 minutes
Min tokens: ~1,024
Best for: Drop-in savings, minimal optimization

Google Gemini 2.5+:

Activation: Implicit + explicit
Read discount: Up to 90%
Write cost: Varies (explicit has storage cost)
Default TTL: Configurable (explicit)
Min tokens: 32,768 (explicit)
Best for: Massive context windows, long-lived caches

Practical Strategies That Maximize Cache Hits

1. Front-Load Static Content

The single most impactful change: put everything that doesn't change at the beginning of your prompt. All providers use prefix-based matching.


✅ System prompt → Tool definitions → Reference docs → History → User query
❌ User query → System prompt → Reference docs → History

This applies to all providers but matters most for OpenAI (automatic prefix matching) and Anthropic (cache breakpoints apply to preceding content).

2. Separate Stable vs. Volatile Content

Structure your prompts into clearly separated blocks:


# Block 1: System identity (changes: never)
system_prompt = "You are a financial analyst..."

# Block 2: Tool definitions (changes: on deployment)
tools = [{"name": "get_stock_price", ...}, ...]

# Block 3: Knowledge base (changes: daily)
knowledge = "Current market data as of 2026-03-18: ..."

# Block 4: Conversation (changes: every turn)
messages = [...]

With Anthropic, place cache_control breakpoints after blocks 1, 2, and 3. With OpenAI, just keep this order consistent.

3. Use Multi-Turn Caching for Agents

AI agents running multi-turn conversations benefit enormously. Each turn re-sends the full conversation history, meaning the cumulative system prompt + earlier turns are identical across requests.

For a 10-turn agent conversation with a 3,000-token system prompt:

Without caching: ~30,000 system prompt tokens re-processed
With Anthropic caching: 3,000 (write) + 27,000 × 0.10 = 5,700 effective tokens
Savings: 81%

This compounds with conversation history caching—each new turn caches everything before it.

4. Batch Similar Requests

If you're processing multiple items against the same context (e.g., reviewing 50 PRs against the same coding standards), batch them in quick succession. Cache TTLs are typically 5 minutes, so sending requests within that window maximizes hits.

5. Monitor Cache Hit Rates

All providers return cache metrics in API responses. Track them:


# Anthropic
cache_hits = response.usage.cache_read_input_tokens
cache_misses = response.usage.cache_creation_input_tokens

# OpenAI
cached = response.usage.prompt_tokens_details.cached_tokens

# Use these to calculate effective cost savings
hit_rate = cache_hits / (cache_hits + cache_misses)

If your hit rate is below 70%, your prompt structure likely has volatile content too early in the prefix.

When Caching Won't Help (and What to Do Instead)

Prompt caching isn't a silver bullet. It won't help when:

Every request is unique: If prompts share no common prefix, there's nothing to cache
Below minimum token thresholds: Short prompts under 1,024 tokens won't trigger caching
Infrequent requests: If requests are more than 5-10 minutes apart, the cache expires
High context variance: If your "static" content actually changes frequently

For these cases, focus on avoiding context window failures by reducing total token usage through better context selection, summarization, and retrieval strategies instead.

Integrating with the Vercel AI SDK

If you're using the Vercel AI SDK (popular for Next.js AI applications), caching integrates cleanly:

Anthropic caching via Vercel AI SDK:


import { streamText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

const result = await streamText({
  model: anthropic('claude-sonnet-4-20250514'),
  messages: [
    {
      role: 'system',
      content: 'You are a helpful assistant with access to a large knowledge base...',
      providerMetadata: {
        anthropic: {
          cacheControl: { type: 'ephemeral' }
        }
      }
    },
    ...userMessages
  ]
});

Vercel AI Gateway automatic caching:


import { streamText } from 'ai';

const result = await streamText({
  model: 'anthropic/claude-sonnet-4-20250514',
  system: 'You are a helpful assistant...',
  prompt: userMessage,
  providerOptions: {
    gateway: {
      caching: 'auto'
    }
  }
});

The Gateway approach works across providers and automatically applies the best caching strategy for each.

Real-World Cost Savings: A Worked Example

Let's calculate savings for a typical AI agent handling customer support:

Setup:

4,000-token system prompt
2,000 tokens of tool definitions
Average 8-turn conversations
500 conversations per day
Using Claude Sonnet at $3/MTok input

Without caching:

Per conversation: 6,000 base tokens × 8 turns = 48,000 input tokens (re-sent base)
Daily: 48,000 × 500 = 24M tokens
Daily cost: 24M × $3/MTok = $72/day

With Anthropic prompt caching (90% read discount):

First turn: 6,000 tokens at 1.25× = 7,500 effective
Turns 2-8: 6,000 tokens at 0.10× each = 4,200 effective
Per conversation: 7,500 + 4,200 = 11,700 effective tokens (vs. 48,000)
Daily: 11,700 × 500 = 5.85M effective tokens
Daily cost: ~$17.55/day

Monthly savings: ~$1,634 or roughly 75% reduction.

And this only accounts for the system prompt + tools. Conversation history caching pushes savings higher as turns accumulate.

FAQ

Does prompt caching affect output quality?

No. Caching stores the intermediate computation (KV cache) from processing input tokens. The model produces identical outputs whether tokens are cached or freshly processed. There is zero quality difference.

Can I use prompt caching with streaming responses?

Yes. All providers support caching with both streaming and non-streaming responses. The caching happens at the input processing stage, before any output generation begins.

What happens when the cache expires?

The next request pays full input price (or the write premium on Anthropic) and creates a new cache entry. There's no error or degradation—the request just costs more and takes slightly longer.

Is there a maximum cache size?

Anthropic and OpenAI don't publish hard limits, but practical limits align with model context windows (200K tokens for Claude, 128K for GPT-4o). Google's explicit caching supports up to the full context window of the model.

Should I use Anthropic or OpenAI caching?

It depends on your usage pattern. If you have high-frequency, repetitive requests (agent loops, batch processing), Anthropic's 90% discount outweighs the 25% write premium quickly. For moderate usage where simplicity matters, OpenAI's automatic 50% discount requires zero effort. For very large, long-lived contexts, consider Google's explicit caching.

Does caching work with function/tool calling?

Yes. Tool definitions are part of the prompt and are cached like any other content. In fact, tool-heavy prompts benefit significantly because tool definitions are static and often large.

How do I know if caching is actually working?

All providers return cache metrics in API responses. Check cache_read_input_tokens (Anthropic), prompt_tokens_details.cached_tokens (OpenAI), or the usage object (Google). If these values are zero, your prompts likely aren't sharing a common prefix.

Getting Started Today

The fastest path to savings:

1. OpenAI users: You're probably already getting 50% savings. Check cached_tokens in your API responses to confirm

2. Anthropic users: Add cache_control: { type: "ephemeral" } to your system prompt and tool definitions. Monitor cache_read_input_tokens

3. Google users: Implicit caching is on by default. For large, reusable contexts, explore explicit CachedContent objects

Then restructure your prompts with static content first, monitor your cache hit rates, and adjust TTLs based on your request frequency.

Prompt caching is one of the highest-ROI optimizations in the context engineering toolkit. It's low effort, high reward, and available today across all major providers.

*Building AI agents? Read our guide to context engineering for AI agents and learn how to avoid context window failures that silently break your applications.*

> 💡 Cutting inference costs locally? Qwen 3.5 vs Qwen 2.5: Should You Upgrade? covers per-token speed and VRAM tradeoffs for local setups.

Frequently Asked Questions

Does prompt caching affect output quality?

Can I use prompt caching with streaming responses?

Yes. All providers support caching with both streaming and non-streaming responses. The caching happens at the input processing stage, before any output generation begins.

What happens when the cache expires?

The next request pays full input price (or the write premium on Anthropic) and creates a new cache entry. There's no error or degradation—the request just costs more and takes slightly longer.

Is there a maximum cache size?

Should I use Anthropic or OpenAI caching?

Does caching work with function/tool calling?

Yes. Tool definitions are part of the prompt and are cached like any other content. In fact, tool-heavy prompts benefit significantly because tool definitions are static and often large.

How do I know if caching is actually working?

All providers return cache metrics in API responses. Check cache read input tokens (Anthropic), prompt tokens details.cached tokens (OpenAI), or the usage object (Google). If these values are zero, your prompts likely aren't sharing a common prefix.

🔧 Tools in This Article

All tools →

🛠️

Vercel AI SDK

Free (open-source)

Related Guides

All guides →

AI Tools

Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now

Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now Meta's April 14, 2026 announcement of an expanded Broadcom partnership is a useful reminder that AI competition is increasingly fought below the API layer. Meta said it...

2 min read

AI Tools

Meta Muse Spark April 2026: What It Means for Consumer AI Assistants

Meta Muse Spark April 2026: What It Means for Consumer AI Assistants Meta's April 8, 2026 announcement of Muse Spark matters because it is not just another model launch. Meta is trying to reposition Meta AI around multimodal perception,...

2 min read

AI Tools

Project Glasswing April 2026: The AI Cybersecurity Shift Is Here

Project Glasswing April 2026: The AI Cybersecurity Shift Is Here Anthropic's April 7, 2026 announcement of Project Glasswing is one of the clearest recent signs that frontier AI labs now see cybersecurity as a central deployment battleground, not a...

2 min read