How to Cache Long Context in the Gemini API to Cut Costs
Set up explicit context caching so repeated questions about the same large document cost a fraction of the price.
If you ask many questions about the same large document, re-sending it every time wastes tokens and money. Gemini's context caching lets you process a big input once, store it for a set time, and reuse it across requests at a reduced rate. This guide creates a cache, queries it, and cleans it up.
What you need
- A Gemini API key and the google-genai SDK
- A large input you will reuse: a long PDF, transcript, or codebase dump
- A model that supports caching, such as gemini-2.5-flash or gemini-2.5-pro
Step 1: Upload the document
Use the File API to upload the large file, exactly as you would for a normal long-context request. You will reference this handle when creating the cache.
import os
from google import genai
from google.genai import types
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
doc = client.files.upload(file="handbook.pdf")Step 2: Create the cache
Create a cached content object that holds the document plus a system instruction. Set a time-to-live (TTL) so it expires automatically. You are billed to create the cache and a small amount to store it, but each query that uses it is cheaper.
cache = client.caches.create(
model="gemini-2.5-flash",
config=types.CreateCachedContentConfig(
system_instruction="You are a precise assistant. Answer only from the handbook.",
contents=[doc],
ttl="3600s", # keep for one hour
),
)
print("cache name:", cache.name)Step 3: Query against the cache
Now send questions that reference the cache instead of re-uploading the document. The cached tokens are billed at the lower cached rate, and you only pay full price for the new question and the answer.
resp = client.models.generate_content(
model="gemini-2.5-flash",
contents="What is the refund window described in the handbook?",
config=types.GenerateContentConfig(cached_content=cache.name),
)
print(resp.text)
print("cached tokens:", resp.usage_metadata.cached_content_token_count)Step 4: Delete the cache when done
Caches expire on their own at the TTL, but you can delete one early to stop paying storage. Always clean up caches your script no longer needs.
Result
Repeated questions about the same big document now reuse the cached tokens, often cutting the per-query input cost substantially. The usage metadata's cached_content_token_count confirms the cache is actually being hit.
Watch related tutorials
1:42:18
28:14
41:09
9:47
8:23
52:31