Prompt Caching: When It Pays for Itself and When It Doesn't


Prompt Caching: When It Pays for Itself and When It Doesn't

You are making repeated API calls and the bill is climbing, or responses feel sluggish. Prompt caching is the first thing worth reaching for — but it only works if your calls are structured in a particular way.

The jargon

Prompt caching — Anthropic stores a snapshot of your prompt prefix on their servers. If the next request starts with an identical prefix, Claude skips re-processing it. You pay less and wait less.

Cache prefix — the part of your prompt you mark as "worth caching." Everything from the start of the prompt up to and including that marker.

TTL (time to live) — how long the cache snapshot is kept. Currently five minutes for standard caching.

The lesson

Caching works on prefixes, not arbitrary chunks. Anthropic reads your prompt left to right and checks whether the beginning matches a stored snapshot. The moment there is a mismatch, caching stops. So the structure of your prompt determines whether caching helps at all.

The golden rule: put what changes at the end, not the beginning.

A system prompt containing a 10,000-word document, followed by a short user question that changes every call — that's a good candidate. The long document sits at the start, gets cached, and the variable question goes at the end. You pay to process the document once per five-minute window, not once per call.

A prompt where you personalise the opening ("Hi Sarah, here is your report…") and the document comes after — caching gives you nothing. The prefix never matches.

How it works

You opt in by adding "cache_control": {"type": "ephemeral"} to the content block you want cached. Here is a minimal example using the Python SDK:

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a contract analyst. Here is the full contract:\n\n" + long_contract_text,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_question}]
)

The cache marker goes on the last content block you want included in the cached prefix. Everything up to and including that block is stored. Subsequent blocks are processed fresh each call.

Check response.usage to confirm it worked. You will see cache_read_input_tokens go up and input_tokens go down on the second call.

When to reach for it / when not to

Good fit: a long system prompt or reference document that stays constant across many calls in a short window — legal docs, codebases, product catalogues, lengthy instructions.

Poor fit: prompts shorter than roughly 1,024 tokens (the minimum cacheable size), one-off single calls, or anything where the "constant" part keeps changing between calls. You pay a small cache-write surcharge on the first call, so a cache that never gets read costs you slightly more, not less.

Also remember the five-minute TTL. If your calls are spaced further apart than that, the cache expires and you pay to write it again each time.

Try it

Take any n8n workflow or script that calls the Claude API more than once with the same system prompt. Add the cache_control block to that system prompt, run two calls in quick succession, and compare usage in both responses. The numbers will tell you immediately whether the cache hit.


Don't miss what's next. Subscribe to My Claude Daily Learning: