GeminiAdvanced

How to Cache Long Context in the Gemini API to Cut Costs

Set up explicit context caching so repeated questions about the same large document cost a fraction of the price.

10 minAdvanced

If you ask many questions about the same large document, re-sending it every time wastes tokens and money. Gemini's context caching lets you process a big input once, store it for a set time, and reuse it across requests at a reduced rate. This guide creates a cache, queries it, and cleans it up.

What you need

  • A Gemini API key and the google-genai SDK
  • A large input you will reuse: a long PDF, transcript, or codebase dump
  • A model that supports caching, such as gemini-2.5-flash or gemini-2.5-pro

Step 1: Upload the document

Use the File API to upload the large file, exactly as you would for a normal long-context request. You will reference this handle when creating the cache.

cache.py
import os
from google import genai
from google.genai import types

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
doc = client.files.upload(file="handbook.pdf")

Step 2: Create the cache

Create a cached content object that holds the document plus a system instruction. Set a time-to-live (TTL) so it expires automatically. You are billed to create the cache and a small amount to store it, but each query that uses it is cheaper.

cache.py
cache = client.caches.create(
    model="gemini-2.5-flash",
    config=types.CreateCachedContentConfig(
        system_instruction="You are a precise assistant. Answer only from the handbook.",
        contents=[doc],
        ttl="3600s",  # keep for one hour
    ),
)
print("cache name:", cache.name)
Terminal - cache created
$ python cache.py
uploading handbook.pdf ... done
cache name: cachedContents/abc123xyz
ttl: 3600s
The cache returns a name you reference in later calls.

Step 3: Query against the cache

Now send questions that reference the cache instead of re-uploading the document. The cached tokens are billed at the lower cached rate, and you only pay full price for the new question and the answer.

cache.py
resp = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="What is the refund window described in the handbook?",
    config=types.GenerateContentConfig(cached_content=cache.name),
)
print(resp.text)
print("cached tokens:", resp.usage_metadata.cached_content_token_count)

Step 4: Delete the cache when done

Caches expire on their own at the TTL, but you can delete one early to stop paying storage. Always clean up caches your script no longer needs.

python - cleanup
$>>> client.caches.delete(name=cache.name)
deleted cachedContents/abc123xyz
$
Caching pays off above a threshold
There is a minimum token count before caching is allowed, and a fixed storage cost. Caching is worth it when the cached content is large and you will query it several times. For one or two questions, a plain request is cheaper.

Result

Repeated questions about the same big document now reuse the cached tokens, often cutting the per-query input cost substantially. The usage metadata's cached_content_token_count confirms the cache is actually being hit.

Watch related tutorials

Tags
#long-context#caching#api#cost