· ai-tools / llm / prompt-caching

Prompt caching in 2026 — Anthropic, OpenAI, and Gemini compared

Prompt caching cuts costs 90%. Anthropic requires explicit markers, OpenAI caches automatically, Gemini bills hourly. Here is which one fits your workload.

By

2,563 words · 13 min read

Sending a 50,000-token system prompt on every API call costs $0.15 per call at Sonnet 4.6’s base rate. At 10,000 calls a day, that is $1,500 in context you resend unchanged every time. All three major LLM providers now offer prompt caching: cache that prefix and read it back at 10% of the original cost. The mechanics differ enough to matter: Anthropic requires explicit markers, OpenAI caches automatically with no code changes, and Gemini adds an hourly storage fee that changes the math for low-traffic workloads.

Pick Anthropic if you need verified latency numbers, explicit control over cache breakpoints, and predictable per-token costs. Pick OpenAI if you want caching that works without touching your API calls. Pick Gemini if you are caching large multimodal assets — video, PDFs — and can absorb the storage billing model.

Who this is for

Developers adding caching to an existing LLM integration, or choosing a provider with caching as a requirement. If you are not yet hitting the 1,024-token minimum threshold, caching does not apply.

What is prompt caching

When you call an LLM, the model runs prefill on every input token before generating the first output token. That prefill step dominates time-to-first-token (TTFT). Prompt caching stores the key-value (KV) tensor representation of a prompt prefix on the provider’s servers. On a subsequent request with an identical prefix, the model skips prefill for those tokens entirely and reads the cached tensors directly.

The result: lower TTFT and a lower per-token cost for the cached portion. The catch: the prefix must be byte-identical — one changed character is a cache miss. Per-request context like timestamps, user IDs, or session tokens must go after the cache boundary.

Anthropic — explicit prefix caching

Anthropic’s caching is developer-controlled. You place a cache_control: {type: "ephemeral"} marker in your API request to define where the cached prefix ends. The system hashes everything up to that block and checks the cache store. On a hit, those tokens cost $0.30/M for Sonnet 4.6 instead of $3.00/M — a 90% discount.

Mechanics

Cache hierarchy: tools → system → messages. A change to tools invalidates system and messages cache. A change to system invalidates messages. Static system prompts belong in system; per-user context goes in messages after the breakpoint.

You can set up to 4 breakpoints per request. The cache looks back up to 20 blocks from each breakpoint to find a prior write, so growing conversation threads get hits even as they accumulate turns. Each cache hit resets the 5-minute TTL at no additional cost.

For workloads with gaps longer than 5 minutes, a 1-hour TTL is available at 2× the base input cost.

Pricing (per 1M tokens)

ModelBase input5-min write1-hour writeCache readOutput
Claude Opus 4.8$5.00$6.25$10.00$0.50$25.00
Claude Sonnet 4.6$3.00$3.75$6.00$0.30$15.00
Claude Haiku 4.5$1.00$1.25$2.00$0.10$5.00

Write cost is 1.25× base for the 5-minute option. Read cost is 0.10× base.

Minimum token size: 1,024 for Opus 4.8 and Sonnet 4.6; 4,096 for Haiku 4.5 and older Opus variants.

Latency impact (Anthropic-published benchmarks)

ScenarioCached tokensTTFT beforeTTFT afterReduction
Chat with a 100k-token book~100,00011.5 s2.4 s79%
Multi-turn with long system promptLong system~10 s~2.5 s75%
Many-shot prompting~10,0001.6 s1.1 s31%

These numbers come from Anthropic’s benchmark post at claude.com/blog/prompt-caching. Cost reductions in the same scenarios: 90%, 53%, and 86%.

Gotchas

Exact match required. Any byte change — a trailing space, a different image detail setting, a timestamp in the prompt — is a cache miss. Audit your prompt construction code before assuming the prefix is stable.

Concurrent first requests. The cache entry only becomes available after the first response begins streaming. Parallel requests that land before the first response completes will all be cache misses. Warm the cache with one serial request before opening the floodgates.

Thinking blocks. On some model variants, non-tool-result user content strips previous thinking blocks from the cache. Check the model-specific documentation before relying on cached multi-turn thinking chains.

Code example

import anthropic

client = anthropic.Anthropic()

# First request — cache write (pays 1.25× base on cached tokens)
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {"type": "text", "text": "You are an expert analyzing legal documents."},
        {
            "type": "text",
            "text": "<full legal document — 50,000 tokens>",
            "cache_control": {"type": "ephemeral"},  # cache everything up to here
        },
    ],
    messages=[{"role": "user", "content": "Summarize the key liability clauses."}],
)

# Check usage to confirm cache write
print(response.usage.cache_creation_input_tokens)  # > 0 on first call
print(response.usage.cache_read_input_tokens)       # 0 on first call

# Subsequent request within 5 minutes — cache read (pays 0.10× base)
response2 = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {"type": "text", "text": "You are an expert analyzing legal documents."},
        {
            "type": "text",
            "text": "<full legal document — identical to above>",
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[{"role": "user", "content": "What are the termination conditions?"}],
)

print(response2.usage.cache_read_input_tokens)        # > 0 on cache hit
print(response2.usage.cache_creation_input_tokens)    # 0 on cache hit

For a complete implementation walkthrough — covering prompt caching alongside tool use and the Message Batches API — see Claude API in 2026: Prompt Caching, Tool Use & Batches.

OpenAI — automatic caching

OpenAI’s caching is fully automatic. No API flag required, no code changes. The system routes requests based on a hash of the prompt prefix (approximately the first 256 tokens) and serves cached results when a matching prefix is found on the same server.

Mechanics

Two opt-in parameters improve hit rates.

prompt_cache_key: A routing hint that groups requests sharing a long common prefix onto the same server. Without it, requests can land on different machines and miss the cache for identical prefixes. Keep each (prefix, cache_key) combination under 15 requests per minute; above that, requests overflow to non-cached servers.

prompt_cache_retention: "24h": Extends the cache from in-memory (5–10 minute inactivity, max 1 hour) to 24-hour GPU-local storage. Available on GPT-5.5, GPT-5.5-pro, GPT-5.4, GPT-5.2, GPT-5.1, GPT-5, GPT-5-codex, and GPT-4.1 as of mid-2026. This is the most significant competitive advantage OpenAI has in this space: overnight batch workflows and serverless deployments can stay warm across the idle gap without any warming logic.

Pricing

No additional write fee. Cache reads are priced at up to 90% off normal input token cost. OpenAI’s pricing page returned HTTP 403 during the data collection for this article, so exact per-model rates were not independently verified. The 90% reduction figure is stated in the developer docs at developers.openai.com/api/docs/guides/prompt-caching.

Minimum tokens: 1,024. Below that, cached_tokens: 0 appears in the response — no error, no penalty.

Latency impact

OpenAI claims up to 80% latency reduction. No model-specific benchmark table or methodology is published. Treat this as an upper-bound estimate.

Gotchas

No manual cache clearing. Eviction is automatic on inactivity. If your prompt prefix is stale and producing wrong answers, you cannot force a refresh — you wait for the TTL to expire or change the prefix.

Rate-induced misses. Above 15 requests per minute for the same prefix, overflow to non-cached servers increases. Use prompt_cache_key to keep traffic concentrated on one routing path.

Rate limits still apply. Cached tokens count toward your TPM limit the same as uncached tokens.

usage.prompt_tokens_details.cached_tokens in the response shows how many tokens were served from cache. Monitor this in production to confirm the cache is working.

Code example

from openai import OpenAI

client = OpenAI()

STATIC_SYSTEM = """You are a senior financial analyst specializing in equity research.
[... 2,000+ tokens of background knowledge ...]"""

# First call — automatic cache miss (no code change required)
response = client.responses.create(
    model="gpt-5",
    prompt_cache_key="equity-analyst-v1",     # routing hint for consistent hashing
    prompt_cache_retention="24h",             # GPT-5.x models only
    input=[
        {"role": "system", "content": STATIC_SYSTEM},
        {"role": "user",   "content": "Analyze AAPL Q2 2026 earnings."},
    ],
)

# Subsequent calls — cache hit when prefix matches
response2 = client.responses.create(
    model="gpt-5",
    prompt_cache_key="equity-analyst-v1",
    prompt_cache_retention="24h",
    input=[
        {"role": "system", "content": STATIC_SYSTEM},  # must be byte-identical
        {"role": "user",   "content": "Compare MSFT Q2 2026 to AAPL."},
    ],
)

cached = response2.usage.prompt_tokens_details.cached_tokens
print(f"Tokens from cache: {cached}")

Gemini — two modes, one extra bill

Gemini has two caching modes. Implicit caching (Gemini 2.5+ only) requires no code changes and passes savings on when a cache hit occurs. Explicit caching lets you create a named CachedContent object, reference it in requests, and pay a separate hourly storage fee while the cache is alive.

Pricing (per 1M tokens)

ModelBase inputCache readStorage per hour
Gemini 2.5 Flash$0.30$0.03$1.00
Gemini 2.5 Pro (≤200k ctx)$1.25$0.125$4.50
Gemini 2.5 Pro (>200k ctx)$2.50$0.25$4.50
Gemini 3.5 Flash$1.50$0.15$1.00
Gemini 3.1 Pro Preview (≤200k)$2.00$0.20$4.50

Cache reads are 90% off base input price — same multiplier as Anthropic and OpenAI. The difference is the storage charge: Gemini explicit caching costs $1.00–$4.50 per 1M cached tokens per hour, regardless of how many requests read the cache.

Break-even on Gemini 2.5 Pro: Storage costs $4.50/M/hr. Each cache read saves $1.125/M ($1.25 − $0.125). You need roughly 4 reads per hour per 1M cached tokens before storage cost breaks even. For low-traffic workloads, explicit caching can cost more than re-sending the tokens.

Minimum tokens: 2,048 for Gemini 2.5 Flash and Pro; 4,096 for Gemini 3.5 Flash and 3.1 Pro Preview.

Latency impact

No latency figures published. Google notes there is no latency SLA tied to implicit cache hits. usage_metadata.cached_content_token_count in the response tells you how many tokens came from cache, but TTFT benchmarks are not part of the documentation.

Gotchas

Storage accumulates. A 1M-token cache on Gemini 2.5 Pro costs $4.50/hr in storage. Leave it running 8 hours and you have spent $36 before a single query. Delete caches when done.

Ordering is mandatory. Cached content is always a prefix. Static context must come at the start of the prompt — you cannot cache a middle section.

No content inspection. You cannot retrieve what is in a cache — only metadata: name, model, expire_time. Debug cache construction before creating it, not after.

ttl and expire_time are mutually exclusive on cache updates. Pass one or the other, not both.

Code example

from google import genai
from google.genai import types
import io
import httpx

client = genai.Client()

# Upload large document once (pay full cost once)
pdf_bytes = httpx.get("https://example.com/large-report.pdf").content
document = client.files.upload(
    file=io.BytesIO(pdf_bytes),
    config=dict(mime_type="application/pdf"),
)

# Create cache — starts the hourly storage clock
cache = client.caches.create(
    model="gemini-2.5-flash",
    config=types.CreateCachedContentConfig(
        display_name="quarterly-report-2026",
        system_instruction="You are a financial analyst. Answer based on the provided report.",
        contents=[document],
        ttl="3600s",
    ),
)

# Multiple queries — each pays cache read rate, not full input price
for question in [
    "What was total revenue in Q1?",
    "Summarize the risk factors.",
    "Compare operating margins to prior year.",
]:
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=question,
        config=types.GenerateContentConfig(cached_content=cache.name),
    )
    meta = response.usage_metadata
    print(f"Cached tokens: {meta.cached_content_token_count}")

# Delete when done — stop paying storage
client.caches.delete(cache.name)

Comparison

FeatureAnthropic ClaudeOpenAIGoogle Gemini
Caching typeExplicit (developer-controlled)AutomaticImplicit (auto, 2.5+) + explicit
Min tokens1,024 (most models); 4,096 (Haiku 4.5, older Opus)1,0242,048 (2.5 Flash/Pro); 4,096 (others)
TTL5 min default; 1 hr at 2× base input cost5–10 min inactivity, max 1 hr; 24 hr on GPT-5.x1 hr default; configurable
Extra write cost+25% over base (5-min); +100% (1-hr)NoneHourly storage ($1.00–$4.50/M/hr)
Cache read discount90% offUp to 90% off90% off
Published TTFT reduction31–79% (benchmarked)Up to 80% (claimed, no methodology)None published
Max cache breakpoints4 per requestN/A (automatic)N/A (one prefix per request)
Manual cache managementPartial (TTL refresh on hit)NoneFull CRUD
Multimodal supportImages, documentsImages, toolsVideo, PDF, audio, images
Extended retention1 hr (opt-in)24 hr on GPT-5.xConfigurable TTL
Cross-request sharingWorkspace-level (Claude API)Organization-levelProject-level

When to pick which

Anthropic if you want the most control and the most transparent data. The benchmarks are real numbers from a public, citable source — you can run the same workload and compare. Four explicit breakpoints per request let you cache tools separately from system context separately from shared conversation history. The 90% read discount is on par with competitors. The main limitations: you have to instrument your prompts with markers, and the 5-minute default TTL is short for infrequently-used sessions (the 1-hour option helps, at a cost).

OpenAI if you want caching with no code changes, particularly for serverless or edge workloads where warming a cache ahead of time is not practical. The 24-hour retention on GPT-5.x is a concrete advantage for overnight batch pipelines. What you give up: no published latency benchmarks, no granular control over cache breakpoints, no way to force a cache refresh when the cached content is stale.

Gemini if you are caching large multimodal assets. Uploading a 2-hour video once and querying it 50 times is exactly the workload Gemini’s explicit caching was built for — re-sending large binary content per request is prohibitively expensive, and the storage fee is irrelevant when the alternative is gigabytes of data over the wire. For text-only workloads with modest traffic, run the break-even calculation before adding explicit caching: at Gemini 2.5 Pro rates, you need at least 4 reads per hour per 1M cached tokens to cover the storage cost.

To understand what prompt caching costs look like inside a running production system, see the real cost of running an AI agent team in 2026.

Measure what you cache

The cache only saves money if it is actually being hit. Add these to every response log:

  • Anthropic: usage.cache_creation_input_tokens and usage.cache_read_input_tokens
  • OpenAI: usage.prompt_tokens_details.cached_tokens
  • Gemini: usage_metadata.cached_content_token_count

If your hit rate is below 60% on prompts you expect to be cacheable, something is breaking cache keys — timestamps in the static portion of the prompt, per-request metadata placed before the breakpoint, or a model detail setting that varies by request. Fix the construction logic before concluding caching is not working.

If caching reduces your per-call cost but your overall API spend is still high, LLM cost routing can cut costs further by directing simpler queries to cheaper models — an additional 80% reduction is common on classification or summarisation workloads.

Observability tools like Braintrust and LangSmith expose cache hit rates and cost attribution across sessions, which makes it easier to identify which part of a prompt is responsible for misses.

Primary sources