· ai-tools / llm / prompt-caching
Prompt caching in 2026 — Anthropic, OpenAI, and Gemini compared
Prompt caching cuts costs 90%. Anthropic requires explicit markers, OpenAI caches automatically, Gemini bills hourly. Here is which one fits your workload.
By Ethan
2,563 words · 13 min read
Sending a 50,000-token system prompt on every API call costs $0.15 per call at Sonnet 4.6’s base rate. At 10,000 calls a day, that is $1,500 in context you resend unchanged every time. All three major LLM providers now offer prompt caching: cache that prefix and read it back at 10% of the original cost. The mechanics differ enough to matter: Anthropic requires explicit markers, OpenAI caches automatically with no code changes, and Gemini adds an hourly storage fee that changes the math for low-traffic workloads.
Pick Anthropic if you need verified latency numbers, explicit control over cache breakpoints, and predictable per-token costs. Pick OpenAI if you want caching that works without touching your API calls. Pick Gemini if you are caching large multimodal assets — video, PDFs — and can absorb the storage billing model.
Who this is for
Developers adding caching to an existing LLM integration, or choosing a provider with caching as a requirement. If you are not yet hitting the 1,024-token minimum threshold, caching does not apply.
What is prompt caching
When you call an LLM, the model runs prefill on every input token before generating the first output token. That prefill step dominates time-to-first-token (TTFT). Prompt caching stores the key-value (KV) tensor representation of a prompt prefix on the provider’s servers. On a subsequent request with an identical prefix, the model skips prefill for those tokens entirely and reads the cached tensors directly.
The result: lower TTFT and a lower per-token cost for the cached portion. The catch: the prefix must be byte-identical — one changed character is a cache miss. Per-request context like timestamps, user IDs, or session tokens must go after the cache boundary.
Anthropic — explicit prefix caching
Anthropic’s caching is developer-controlled. You place a cache_control: {type: "ephemeral"} marker in your API request to define where the cached prefix ends. The system hashes everything up to that block and checks the cache store. On a hit, those tokens cost $0.30/M for Sonnet 4.6 instead of $3.00/M — a 90% discount.
Mechanics
Cache hierarchy: tools → system → messages. A change to tools invalidates system and messages cache. A change to system invalidates messages. Static system prompts belong in system; per-user context goes in messages after the breakpoint.
You can set up to 4 breakpoints per request. The cache looks back up to 20 blocks from each breakpoint to find a prior write, so growing conversation threads get hits even as they accumulate turns. Each cache hit resets the 5-minute TTL at no additional cost.
For workloads with gaps longer than 5 minutes, a 1-hour TTL is available at 2× the base input cost.
Pricing (per 1M tokens)
| Model | Base input | 5-min write | 1-hour write | Cache read | Output |
|---|---|---|---|---|---|
| Claude Opus 4.8 | $5.00 | $6.25 | $10.00 | $0.50 | $25.00 |
| Claude Sonnet 4.6 | $3.00 | $3.75 | $6.00 | $0.30 | $15.00 |
| Claude Haiku 4.5 | $1.00 | $1.25 | $2.00 | $0.10 | $5.00 |
Write cost is 1.25× base for the 5-minute option. Read cost is 0.10× base.
Minimum token size: 1,024 for Opus 4.8 and Sonnet 4.6; 4,096 for Haiku 4.5 and older Opus variants.
Latency impact (Anthropic-published benchmarks)
| Scenario | Cached tokens | TTFT before | TTFT after | Reduction |
|---|---|---|---|---|
| Chat with a 100k-token book | ~100,000 | 11.5 s | 2.4 s | 79% |
| Multi-turn with long system prompt | Long system | ~10 s | ~2.5 s | 75% |
| Many-shot prompting | ~10,000 | 1.6 s | 1.1 s | 31% |
These numbers come from Anthropic’s benchmark post at claude.com/blog/prompt-caching. Cost reductions in the same scenarios: 90%, 53%, and 86%.
Gotchas
Exact match required. Any byte change — a trailing space, a different image detail setting, a timestamp in the prompt — is a cache miss. Audit your prompt construction code before assuming the prefix is stable.
Concurrent first requests. The cache entry only becomes available after the first response begins streaming. Parallel requests that land before the first response completes will all be cache misses. Warm the cache with one serial request before opening the floodgates.
Thinking blocks. On some model variants, non-tool-result user content strips previous thinking blocks from the cache. Check the model-specific documentation before relying on cached multi-turn thinking chains.
Code example
import anthropic
client = anthropic.Anthropic()
# First request — cache write (pays 1.25× base on cached tokens)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{"type": "text", "text": "You are an expert analyzing legal documents."},
{
"type": "text",
"text": "<full legal document — 50,000 tokens>",
"cache_control": {"type": "ephemeral"}, # cache everything up to here
},
],
messages=[{"role": "user", "content": "Summarize the key liability clauses."}],
)
# Check usage to confirm cache write
print(response.usage.cache_creation_input_tokens) # > 0 on first call
print(response.usage.cache_read_input_tokens) # 0 on first call
# Subsequent request within 5 minutes — cache read (pays 0.10× base)
response2 = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{"type": "text", "text": "You are an expert analyzing legal documents."},
{
"type": "text",
"text": "<full legal document — identical to above>",
"cache_control": {"type": "ephemeral"},
},
],
messages=[{"role": "user", "content": "What are the termination conditions?"}],
)
print(response2.usage.cache_read_input_tokens) # > 0 on cache hit
print(response2.usage.cache_creation_input_tokens) # 0 on cache hit
For a complete implementation walkthrough — covering prompt caching alongside tool use and the Message Batches API — see Claude API in 2026: Prompt Caching, Tool Use & Batches.
OpenAI — automatic caching
OpenAI’s caching is fully automatic. No API flag required, no code changes. The system routes requests based on a hash of the prompt prefix (approximately the first 256 tokens) and serves cached results when a matching prefix is found on the same server.
Mechanics
Two opt-in parameters improve hit rates.
prompt_cache_key: A routing hint that groups requests sharing a long common prefix onto the same server. Without it, requests can land on different machines and miss the cache for identical prefixes. Keep each (prefix, cache_key) combination under 15 requests per minute; above that, requests overflow to non-cached servers.
prompt_cache_retention: "24h": Extends the cache from in-memory (5–10 minute inactivity, max 1 hour) to 24-hour GPU-local storage. Available on GPT-5.5, GPT-5.5-pro, GPT-5.4, GPT-5.2, GPT-5.1, GPT-5, GPT-5-codex, and GPT-4.1 as of mid-2026. This is the most significant competitive advantage OpenAI has in this space: overnight batch workflows and serverless deployments can stay warm across the idle gap without any warming logic.
Pricing
No additional write fee. Cache reads are priced at up to 90% off normal input token cost. OpenAI’s pricing page returned HTTP 403 during the data collection for this article, so exact per-model rates were not independently verified. The 90% reduction figure is stated in the developer docs at developers.openai.com/api/docs/guides/prompt-caching.
Minimum tokens: 1,024. Below that, cached_tokens: 0 appears in the response — no error, no penalty.
Latency impact
OpenAI claims up to 80% latency reduction. No model-specific benchmark table or methodology is published. Treat this as an upper-bound estimate.
Gotchas
No manual cache clearing. Eviction is automatic on inactivity. If your prompt prefix is stale and producing wrong answers, you cannot force a refresh — you wait for the TTL to expire or change the prefix.
Rate-induced misses. Above 15 requests per minute for the same prefix, overflow to non-cached servers increases. Use prompt_cache_key to keep traffic concentrated on one routing path.
Rate limits still apply. Cached tokens count toward your TPM limit the same as uncached tokens.
usage.prompt_tokens_details.cached_tokens in the response shows how many tokens were served from cache. Monitor this in production to confirm the cache is working.
Code example
from openai import OpenAI
client = OpenAI()
STATIC_SYSTEM = """You are a senior financial analyst specializing in equity research.
[... 2,000+ tokens of background knowledge ...]"""
# First call — automatic cache miss (no code change required)
response = client.responses.create(
model="gpt-5",
prompt_cache_key="equity-analyst-v1", # routing hint for consistent hashing
prompt_cache_retention="24h", # GPT-5.x models only
input=[
{"role": "system", "content": STATIC_SYSTEM},
{"role": "user", "content": "Analyze AAPL Q2 2026 earnings."},
],
)
# Subsequent calls — cache hit when prefix matches
response2 = client.responses.create(
model="gpt-5",
prompt_cache_key="equity-analyst-v1",
prompt_cache_retention="24h",
input=[
{"role": "system", "content": STATIC_SYSTEM}, # must be byte-identical
{"role": "user", "content": "Compare MSFT Q2 2026 to AAPL."},
],
)
cached = response2.usage.prompt_tokens_details.cached_tokens
print(f"Tokens from cache: {cached}")
Gemini — two modes, one extra bill
Gemini has two caching modes. Implicit caching (Gemini 2.5+ only) requires no code changes and passes savings on when a cache hit occurs. Explicit caching lets you create a named CachedContent object, reference it in requests, and pay a separate hourly storage fee while the cache is alive.
Pricing (per 1M tokens)
| Model | Base input | Cache read | Storage per hour |
|---|---|---|---|
| Gemini 2.5 Flash | $0.30 | $0.03 | $1.00 |
| Gemini 2.5 Pro (≤200k ctx) | $1.25 | $0.125 | $4.50 |
| Gemini 2.5 Pro (>200k ctx) | $2.50 | $0.25 | $4.50 |
| Gemini 3.5 Flash | $1.50 | $0.15 | $1.00 |
| Gemini 3.1 Pro Preview (≤200k) | $2.00 | $0.20 | $4.50 |
Cache reads are 90% off base input price — same multiplier as Anthropic and OpenAI. The difference is the storage charge: Gemini explicit caching costs $1.00–$4.50 per 1M cached tokens per hour, regardless of how many requests read the cache.
Break-even on Gemini 2.5 Pro: Storage costs $4.50/M/hr. Each cache read saves $1.125/M ($1.25 − $0.125). You need roughly 4 reads per hour per 1M cached tokens before storage cost breaks even. For low-traffic workloads, explicit caching can cost more than re-sending the tokens.
Minimum tokens: 2,048 for Gemini 2.5 Flash and Pro; 4,096 for Gemini 3.5 Flash and 3.1 Pro Preview.
Latency impact
No latency figures published. Google notes there is no latency SLA tied to implicit cache hits. usage_metadata.cached_content_token_count in the response tells you how many tokens came from cache, but TTFT benchmarks are not part of the documentation.
Gotchas
Storage accumulates. A 1M-token cache on Gemini 2.5 Pro costs $4.50/hr in storage. Leave it running 8 hours and you have spent $36 before a single query. Delete caches when done.
Ordering is mandatory. Cached content is always a prefix. Static context must come at the start of the prompt — you cannot cache a middle section.
No content inspection. You cannot retrieve what is in a cache — only metadata: name, model, expire_time. Debug cache construction before creating it, not after.
ttl and expire_time are mutually exclusive on cache updates. Pass one or the other, not both.
Code example
from google import genai
from google.genai import types
import io
import httpx
client = genai.Client()
# Upload large document once (pay full cost once)
pdf_bytes = httpx.get("https://example.com/large-report.pdf").content
document = client.files.upload(
file=io.BytesIO(pdf_bytes),
config=dict(mime_type="application/pdf"),
)
# Create cache — starts the hourly storage clock
cache = client.caches.create(
model="gemini-2.5-flash",
config=types.CreateCachedContentConfig(
display_name="quarterly-report-2026",
system_instruction="You are a financial analyst. Answer based on the provided report.",
contents=[document],
ttl="3600s",
),
)
# Multiple queries — each pays cache read rate, not full input price
for question in [
"What was total revenue in Q1?",
"Summarize the risk factors.",
"Compare operating margins to prior year.",
]:
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=question,
config=types.GenerateContentConfig(cached_content=cache.name),
)
meta = response.usage_metadata
print(f"Cached tokens: {meta.cached_content_token_count}")
# Delete when done — stop paying storage
client.caches.delete(cache.name)
Comparison
| Feature | Anthropic Claude | OpenAI | Google Gemini |
|---|---|---|---|
| Caching type | Explicit (developer-controlled) | Automatic | Implicit (auto, 2.5+) + explicit |
| Min tokens | 1,024 (most models); 4,096 (Haiku 4.5, older Opus) | 1,024 | 2,048 (2.5 Flash/Pro); 4,096 (others) |
| TTL | 5 min default; 1 hr at 2× base input cost | 5–10 min inactivity, max 1 hr; 24 hr on GPT-5.x | 1 hr default; configurable |
| Extra write cost | +25% over base (5-min); +100% (1-hr) | None | Hourly storage ($1.00–$4.50/M/hr) |
| Cache read discount | 90% off | Up to 90% off | 90% off |
| Published TTFT reduction | 31–79% (benchmarked) | Up to 80% (claimed, no methodology) | None published |
| Max cache breakpoints | 4 per request | N/A (automatic) | N/A (one prefix per request) |
| Manual cache management | Partial (TTL refresh on hit) | None | Full CRUD |
| Multimodal support | Images, documents | Images, tools | Video, PDF, audio, images |
| Extended retention | 1 hr (opt-in) | 24 hr on GPT-5.x | Configurable TTL |
| Cross-request sharing | Workspace-level (Claude API) | Organization-level | Project-level |
When to pick which
Anthropic if you want the most control and the most transparent data. The benchmarks are real numbers from a public, citable source — you can run the same workload and compare. Four explicit breakpoints per request let you cache tools separately from system context separately from shared conversation history. The 90% read discount is on par with competitors. The main limitations: you have to instrument your prompts with markers, and the 5-minute default TTL is short for infrequently-used sessions (the 1-hour option helps, at a cost).
OpenAI if you want caching with no code changes, particularly for serverless or edge workloads where warming a cache ahead of time is not practical. The 24-hour retention on GPT-5.x is a concrete advantage for overnight batch pipelines. What you give up: no published latency benchmarks, no granular control over cache breakpoints, no way to force a cache refresh when the cached content is stale.
Gemini if you are caching large multimodal assets. Uploading a 2-hour video once and querying it 50 times is exactly the workload Gemini’s explicit caching was built for — re-sending large binary content per request is prohibitively expensive, and the storage fee is irrelevant when the alternative is gigabytes of data over the wire. For text-only workloads with modest traffic, run the break-even calculation before adding explicit caching: at Gemini 2.5 Pro rates, you need at least 4 reads per hour per 1M cached tokens to cover the storage cost.
To understand what prompt caching costs look like inside a running production system, see the real cost of running an AI agent team in 2026.
Measure what you cache
The cache only saves money if it is actually being hit. Add these to every response log:
- Anthropic:
usage.cache_creation_input_tokensandusage.cache_read_input_tokens - OpenAI:
usage.prompt_tokens_details.cached_tokens - Gemini:
usage_metadata.cached_content_token_count
If your hit rate is below 60% on prompts you expect to be cacheable, something is breaking cache keys — timestamps in the static portion of the prompt, per-request metadata placed before the breakpoint, or a model detail setting that varies by request. Fix the construction logic before concluding caching is not working.
If caching reduces your per-call cost but your overall API spend is still high, LLM cost routing can cut costs further by directing simpler queries to cheaper models — an additional 80% reduction is common on classification or summarisation workloads.
Observability tools like Braintrust and LangSmith expose cache hit rates and cost attribution across sessions, which makes it easier to identify which part of a prompt is responsible for misses.
Primary sources
- Anthropic caching mechanics and pricing: platform.claude.com/docs/en/docs/build-with-claude/prompt-caching
- Anthropic TTFT and cost benchmarks: claude.com/blog/prompt-caching
- OpenAI caching guide: developers.openai.com/api/docs/guides/prompt-caching
- Gemini caching mechanics: ai.google.dev/gemini-api/docs/caching
- Gemini pricing: ai.google.dev/pricing