How to cut Claude API costs with prompt caching
Cache a large stable prefix so repeated requests pay roughly a tenth of the input price for the cached part.
If many of your requests share a large fixed chunk, such as a long system prompt, a document, or a set of examples, you are paying full input price for the same tokens over and over. Prompt caching stores that prefix so later requests read it at roughly a tenth of the cost. This guide shows how to add caching correctly and verify it is actually hitting.
- The Anthropic SDK and an API key
- A request with a large prefix that repeats across calls
- A prefix above the cacheable minimum (4096 tokens on Opus, 2048 on Sonnet and Fable)
Step 1: Understand the one rule
Caching is a prefix match. Any byte change anywhere before a breakpoint invalidates the cache from that point on. Render order is tools, then system, then messages. So put stable content first and volatile content, like timestamps or the user's question, last.
Step 2: Add a cache breakpoint
Mark the last block of the stable section with cache_control. Here the large system prompt is cached, and the per-request question stays uncached at the end.
from anthropic import Anthropic
client = Anthropic()
BIG_CONTEXT = open("handbook.txt").read() # the same every request
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
system=[{
"type": "text",
"text": BIG_CONTEXT,
"cache_control": {"type": "ephemeral"},
}],
messages=[{"role": "user", "content": "What is the PTO policy?"}],
)Step 3: Verify the cache is working
Read the usage object. On the first call you pay a write, shown as cache_creation_input_tokens. On the second identical call you should see cache_read_input_tokens populated and input_tokens drop to just the uncached question.
print("created:", resp.usage.cache_creation_input_tokens)
print("read: ", resp.usage.cache_read_input_tokens)
print("uncached:", resp.usage.input_tokens)Step 4: Know the break-even
A cache write costs about 1.25 times normal input for the default five minute window, and a read costs about a tenth. With the five minute window you break even at two requests. The one hour window costs about two times to write, so it needs at least three reads to pay off, but it survives longer gaps in traffic.
Result: the 12,044 token handbook moved from full price on every call to a cache read on every call after the first. Across a day of repeated policy questions the input bill dropped by roughly 85 percent on the cached portion.
Watch related tutorials
1:42:18
28:14
41:09
9:47
8:23
52:31