TroubleshootingIntermediate

How to cut Claude API costs with prompt caching

Cache a large stable prefix so repeated requests pay roughly a tenth of the input price for the cached part.

10 minIntermediate

If many of your requests share a large fixed chunk, such as a long system prompt, a document, or a set of examples, you are paying full input price for the same tokens over and over. Prompt caching stores that prefix so later requests read it at roughly a tenth of the cost. This guide shows how to add caching correctly and verify it is actually hitting.

The Anthropic SDK and an API key
A request with a large prefix that repeats across calls
A prefix above the cacheable minimum (4096 tokens on Opus, 2048 on Sonnet and Fable)

Step 1: Understand the one rule

Caching is a prefix match. Any byte change anywhere before a breakpoint invalidates the cache from that point on. Render order is tools, then system, then messages. So put stable content first and volatile content, like timestamps or the user's question, last.

Silent cache killers

A datetime.now() in your system prompt, an unsorted JSON dump, a per-request UUID, or a tool set that changes per user will all quietly stop the cache from ever reading. The request still works, it just never gets cheaper.

Step 2: Add a cache breakpoint

Mark the last block of the stable section with cache_control. Here the large system prompt is cached, and the per-request question stays uncached at the end.

cached.py

from anthropic import Anthropic

client = Anthropic()
BIG_CONTEXT = open("handbook.txt").read()  # the same every request

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": BIG_CONTEXT,
        "cache_control": {"type": "ephemeral"},
    }],
    messages=[{"role": "user", "content": "What is the PTO policy?"}],
)

Step 3: Verify the cache is working

Read the usage object. On the first call you pay a write, shown as cache_creation_input_tokens. On the second identical call you should see cache_read_input_tokens populated and input_tokens drop to just the uncached question.

verify.py

print("created:", resp.usage.cache_creation_input_tokens)
print("read:   ", resp.usage.cache_read_input_tokens)
print("uncached:", resp.usage.input_tokens)

Terminal — cache hit on second call

# first call

created: 12044 read: 0 uncached: 9

# second call (same prefix)

created: 0 read: 12044 uncached: 9

read is populated and uncached input is tiny: the cache is working.

Step 4: Know the break-even

A cache write costs about 1.25 times normal input for the default five minute window, and a read costs about a tenth. With the five minute window you break even at two requests. The one hour window costs about two times to write, so it needs at least three reads to pay off, but it survives longer gaps in traffic.

If read stays at zero

Diff the rendered bytes of two requests. A single differing character in the prefix means no read. The usual culprit is a dynamic value sneaking into the system prompt or tool list.

Result: the 12,044 token handbook moved from full price on every call to a cache read on every call after the first. Across a day of repeated policy questions the input bill dropped by roughly 85 percent on the cached portion.