Prompt Caching: Cut Your AI Costs 90%
If you're running AI agents or LLM-powered applications in production, your API bill is probably your second biggest line item after salaries. The…
If you're running AI agents or LLM-powered applications in production, your API bill is probably your second biggest line item after salaries. The culprit? Sending the same tokens over and over—system prompts, tool definitions, conversation history, and reference documents that barely change between requests.
Prompt caching fixes this. Every major provider now offers it, and the savings are real: up to 90% off input token costs with Anthropic and Google, and 50% automatically with OpenAI.
This guide breaks down exactly how caching works across providers, when to use each strategy, and how to structure your prompts for maximum cache hits. If you're building AI agents with context engineering, this is table stakes.
What Is Prompt Caching?
Prompt caching stores the computed representation (KV cache) of your input tokens on the provider's servers. When your next request starts with the same prefix, the provider skips the expensive computation and reads from cache instead.
Think of it like a compiled binary vs. interpreting source code every time. The model doesn't re-process tokens it has already seen—it jumps straight to the new part.
The result:
- Lower cost: Cached tokens are billed at a fraction of regular input pricing
- Lower latency: Skip re-computation of the cached prefix, often 50-80% faster
- Same output quality: Caching affects processing, not model behavior
How Each Provider Handles It
The three major providers take meaningfully different approaches. Understanding the differences is key to optimizing your setup.
Anthropic: Explicit Control, Maximum Savings
Anthropic gives you explicit control over what gets cached. You mark specific content blocks with cache_control, and the system caches everything up to that breakpoint.
Pricing:
- Cache writes: 25% more expensive than base input tokens (one-time cost)
- Cache reads: 90% cheaper than base input tokens
- Default TTL: 5 minutes (refreshed on each hit)
- Extended TTL: 1 hour (available at additional cost)
Minimum cacheable tokens: 1,024 for Haiku, 2,048 for Sonnet and Opus.
How it works in practice:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a senior code reviewer. Here are the project guidelines: ...[2000+ tokens of context]...",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "Review this pull request: ..."}
]
)
# Check cache performance
print(response.usage.cache_creation_input_tokens) # First request: tokens written
print(response.usage.cache_read_input_tokens) # Subsequent: tokens read from cache
The first request pays the 25% write premium. Every subsequent request within the TTL window gets 90% off those cached tokens. For a system prompt of 4,000 tokens hit 20 times, the math is:
- Without caching: 4,000 × 20 = 80,000 tokens at full price
- With caching: 4,000 × 1.25 (write) + 4,000 × 19 × 0.10 (reads) = 5,000 + 7,600 = 12,600 effective tokens
- Savings: 84%
For longer-running sessions with consistent system prompts, extend the TTL:
"cache_control": {"type": "ephemeral", "ttl": "1h"}
OpenAI: Automatic, Zero Effort
OpenAI takes the opposite approach—caching is fully automatic. No code changes, no opt-in, no cache markers. If your request shares a prefix with a recent request, you get the discount.
Pricing:
- Cache writes: No additional cost
- Cache reads: 50% cheaper than base input tokens
- TTL: Varies (typically 5-10 minutes for low traffic, longer for high volume)
- No minimum token requirement advertised, but effective from 1,024 tokens
Supported models: GPT-4o, GPT-4o-mini, o1, o3, and newer models.
How it works:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a senior code reviewer. Here are the project guidelines: ...[long context]..."
},
{
"role": "user",
"content": "Review this pull request: ..."
}
]
)
# Check if caching was applied
cached = response.usage.prompt_tokens_details.cached_tokens
print(f"Cached tokens: {cached}")
The beauty of OpenAI's approach is simplicity. The trade-off: you get less control and a smaller discount (50% vs. 90%).
Key optimization: OpenAI matches prefixes from the start. Structure your messages with static content first:
1. System prompt (most stable)
2. Tool/function definitions (rarely change)
3. Reference documents (change occasionally)
4. Conversation history (grows each turn)
5. Current user message (always changes)
This ordering maximizes the cacheable prefix length.
Google Gemini: Both Implicit and Explicit
Google offers two modes: implicit caching (automatic, like OpenAI) and explicit context caching (controlled, like Anthropic).
Implicit caching (enabled by default since May 2025):
- Automatic, no code changes
- Savings passed on transparently
- Works on most Gemini models
Explicit context caching:
- Create named cache objects with custom TTLs
- Up to 90% discount on Gemini 2.5+ models (75% on 2.0 models)
- Minimum 32,768 tokens for explicit caches
- Storage cost: based on token count and cache duration
from google import genai
client = genai.Client()
# Create an explicit cache
cache = client.caches.create(
model="gemini-2.5-pro",
config={
"contents": [
{
"role": "user",
"parts": [{"text": "Here is the entire codebase: ...[large context]..."}]
}
],
"ttl": "3600s",
"display_name": "codebase-context"
}
)
# Use the cache in requests
response = client.models.generate_content(
model="gemini-2.5-pro",
contents="Find security vulnerabilities in the authentication module.",
config={"cached_content": cache.name}
)
Google's explicit caching shines for very large contexts—think full codebases, legal document sets, or research paper collections that you query repeatedly over hours.
Provider Comparison at a Glance
Anthropic Claude:
- Activation: Explicit (
cache_control) - Read discount: 90%
- Write cost: +25%
- Default TTL: 5 minutes
- Min tokens: 1,024–2,048
- Best for: Controlled, high-frequency agent loops
OpenAI GPT-4o / o-series:
- Activation: Automatic
- Read discount: 50%
- Write cost: Free
- Default TTL: ~5–10 minutes
- Min tokens: ~1,024
- Best for: Drop-in savings, minimal optimization
Google Gemini 2.5+:
- Activation: Implicit + explicit
- Read discount: Up to 90%
- Write cost: Varies (explicit has storage cost)
- Default TTL: Configurable (explicit)
- Min tokens: 32,768 (explicit)
- Best for: Massive context windows, long-lived caches
Practical Strategies That Maximize Cache Hits
1. Front-Load Static Content
The single most impactful change: put everything that doesn't change at the beginning of your prompt. All providers use prefix-based matching.
✅ System prompt → Tool definitions → Reference docs → History → User query
❌ User query → System prompt → Reference docs → History
This applies to all providers but matters most for OpenAI (automatic prefix matching) and Anthropic (cache breakpoints apply to preceding content).
2. Separate Stable vs. Volatile Content
Structure your prompts into clearly separated blocks:
# Block 1: System identity (changes: never)
system_prompt = "You are a financial analyst..."
# Block 2: Tool definitions (changes: on deployment)
tools = [{"name": "get_stock_price", ...}, ...]
# Block 3: Knowledge base (changes: daily)
knowledge = "Current market data as of 2026-03-18: ..."
# Block 4: Conversation (changes: every turn)
messages = [...]
With Anthropic, place cache_control breakpoints after blocks 1, 2, and 3. With OpenAI, just keep this order consistent.
3. Use Multi-Turn Caching for Agents
AI agents running multi-turn conversations benefit enormously. Each turn re-sends the full conversation history, meaning the cumulative system prompt + earlier turns are identical across requests.
For a 10-turn agent conversation with a 3,000-token system prompt:
- Without caching: ~30,000 system prompt tokens re-processed
- With Anthropic caching: 3,000 (write) + 27,000 × 0.10 = 5,700 effective tokens
- Savings: 81%
This compounds with conversation history caching—each new turn caches everything before it.
4. Batch Similar Requests
If you're processing multiple items against the same context (e.g., reviewing 50 PRs against the same coding standards), batch them in quick succession. Cache TTLs are typically 5 minutes, so sending requests within that window maximizes hits.
5. Monitor Cache Hit Rates
All providers return cache metrics in API responses. Track them:
# Anthropic
cache_hits = response.usage.cache_read_input_tokens
cache_misses = response.usage.cache_creation_input_tokens
# OpenAI
cached = response.usage.prompt_tokens_details.cached_tokens
# Use these to calculate effective cost savings
hit_rate = cache_hits / (cache_hits + cache_misses)
If your hit rate is below 70%, your prompt structure likely has volatile content too early in the prefix.
When Caching Won't Help (and What to Do Instead)
Prompt caching isn't a silver bullet. It won't help when:
- Every request is unique: If prompts share no common prefix, there's nothing to cache
- Below minimum token thresholds: Short prompts under 1,024 tokens won't trigger caching
- Infrequent requests: If requests are more than 5-10 minutes apart, the cache expires
- High context variance: If your "static" content actually changes frequently
For these cases, focus on avoiding context window failures by reducing total token usage through better context selection, summarization, and retrieval strategies instead.
Integrating with the Vercel AI SDK
If you're using the Vercel AI SDK (popular for Next.js AI applications), caching integrates cleanly:
Anthropic caching via Vercel AI SDK:
import { streamText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
const result = await streamText({
model: anthropic('claude-sonnet-4-20250514'),
messages: [
{
role: 'system',
content: 'You are a helpful assistant with access to a large knowledge base...',
providerMetadata: {
anthropic: {
cacheControl: { type: 'ephemeral' }
}
}
},
...userMessages
]
});
Vercel AI Gateway automatic caching:
import { streamText } from 'ai';
const result = await streamText({
model: 'anthropic/claude-sonnet-4-20250514',
system: 'You are a helpful assistant...',
prompt: userMessage,
providerOptions: {
gateway: {
caching: 'auto'
}
}
});
The Gateway approach works across providers and automatically applies the best caching strategy for each.
Real-World Cost Savings: A Worked Example
Let's calculate savings for a typical AI agent handling customer support:
Setup:
- 4,000-token system prompt
- 2,000 tokens of tool definitions
- Average 8-turn conversations
- 500 conversations per day
- Using Claude Sonnet at $3/MTok input
Without caching:
- Per conversation: 6,000 base tokens × 8 turns = 48,000 input tokens (re-sent base)
- Daily: 48,000 × 500 = 24M tokens
- Daily cost: 24M × $3/MTok = $72/day
With Anthropic prompt caching (90% read discount):
- First turn: 6,000 tokens at 1.25× = 7,500 effective
- Turns 2-8: 6,000 tokens at 0.10× each = 4,200 effective
- Per conversation: 7,500 + 4,200 = 11,700 effective tokens (vs. 48,000)
- Daily: 11,700 × 500 = 5.85M effective tokens
- Daily cost: ~$17.55/day
Monthly savings: ~$1,634 or roughly 75% reduction.
And this only accounts for the system prompt + tools. Conversation history caching pushes savings higher as turns accumulate.
FAQ
Does prompt caching affect output quality?
No. Caching stores the intermediate computation (KV cache) from processing input tokens. The model produces identical outputs whether tokens are cached or freshly processed. There is zero quality difference.
Can I use prompt caching with streaming responses?
Yes. All providers support caching with both streaming and non-streaming responses. The caching happens at the input processing stage, before any output generation begins.
What happens when the cache expires?
The next request pays full input price (or the write premium on Anthropic) and creates a new cache entry. There's no error or degradation—the request just costs more and takes slightly longer.
Is there a maximum cache size?
Anthropic and OpenAI don't publish hard limits, but practical limits align with model context windows (200K tokens for Claude, 128K for GPT-4o). Google's explicit caching supports up to the full context window of the model.
Should I use Anthropic or OpenAI caching?
It depends on your usage pattern. If you have high-frequency, repetitive requests (agent loops, batch processing), Anthropic's 90% discount outweighs the 25% write premium quickly. For moderate usage where simplicity matters, OpenAI's automatic 50% discount requires zero effort. For very large, long-lived contexts, consider Google's explicit caching.
Does caching work with function/tool calling?
Yes. Tool definitions are part of the prompt and are cached like any other content. In fact, tool-heavy prompts benefit significantly because tool definitions are static and often large.
How do I know if caching is actually working?
All providers return cache metrics in API responses. Check cache_read_input_tokens (Anthropic), prompt_tokens_details.cached_tokens (OpenAI), or the usage object (Google). If these values are zero, your prompts likely aren't sharing a common prefix.
Getting Started Today
The fastest path to savings:
1. OpenAI users: You're probably already getting 50% savings. Check cached_tokens in your API responses to confirm
2. Anthropic users: Add cache_control: { type: "ephemeral" } to your system prompt and tool definitions. Monitor cache_read_input_tokens
3. Google users: Implicit caching is on by default. For large, reusable contexts, explore explicit CachedContent objects
Then restructure your prompts with static content first, monitor your cache hit rates, and adjust TTLs based on your request frequency.
Prompt caching is one of the highest-ROI optimizations in the context engineering toolkit. It's low effort, high reward, and available today across all major providers.
*Building AI agents? Read our guide to context engineering for AI agents and learn how to avoid context window failures that silently break your applications.*
> 💡 Cutting inference costs locally? Qwen 3.5 vs Qwen 2.5: Should You Upgrade? covers per-token speed and VRAM tradeoffs for local setups.
Frequently Asked Questions
Does prompt caching affect output quality?
Can I use prompt caching with streaming responses?
What happens when the cache expires?
Is there a maximum cache size?
Should I use Anthropic or OpenAI caching?
Does caching work with function/tool calling?
How do I know if caching is actually working?
🔧 Tools in This Article
All tools →Related Guides
All guides →Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now
Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now Meta's April 14, 2026 announcement of an expanded Broadcom partnership is a useful reminder that AI competition is increasingly fought below the API layer. Meta said it...
2 min read
AI ToolsMeta Muse Spark April 2026: What It Means for Consumer AI Assistants
Meta Muse Spark April 2026: What It Means for Consumer AI Assistants Meta's April 8, 2026 announcement of Muse Spark matters because it is not just another model launch. Meta is trying to reposition Meta AI around multimodal perception,...
2 min read
AI ToolsProject Glasswing April 2026: The AI Cybersecurity Shift Is Here
Project Glasswing April 2026: The AI Cybersecurity Shift Is Here Anthropic's April 7, 2026 announcement of Project Glasswing is one of the clearest recent signs that frontier AI labs now see cybersecurity as a central deployment battleground, not a...
2 min read