How to fix a context length exceeded error in the Claude API
Diagnose and resolve a request that overflows the model context window by counting tokens and trimming the prompt.
When a request packs more tokens into the prompt than the model can hold, the call fails or the response stops early. The fix is not to guess at sizes. You measure the real token count with the count_tokens endpoint, then trim or restructure until the request fits with room for the reply. This guide walks through that loop for Claude.
- Python or Node with the official Anthropic SDK installed
- An ANTHROPIC_API_KEY set in your environment
- The exact prompt or file that triggered the failure
- Five minutes to run two or three count_tokens calls
Step 1: Read the actual error
A request that is simply too big returns a 413 request_too_large before the model even runs. A request that fits on the way in but exhausts the window during generation comes back with stop_reason equal to model_context_window_exceeded. These are different problems. The first means shrink the input. The second means leave more headroom for output.
Step 2: Count the tokens for real
Do not estimate with a character count or a third party tokenizer. Claude has its own tokenizer, so the only accurate number comes from count_tokens with the same model id you plan to call. Run it against the prompt that failed.
from anthropic import Anthropic
client = Anthropic()
text = open("big_report.txt").read()
resp = client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": text}],
)
print(resp.input_tokens)Step 3: Confirm the window for your model
Claude Opus 4.8, Opus 4.7, Opus 4.6, Sonnet 4.6, and Fable 5 all have a 1M token context window. Haiku 4.5 has 200K. So a 250K token prompt is fine on Opus but will overflow Haiku. Pick the model whose window is comfortably larger than your measured count plus the reply.
| Model | Context window | Max output |
|---|---|---|
| claude-opus-4-8 | 1M | 128K |
| claude-sonnet-4-6 | 1M | 64K |
| claude-haiku-4-5 | 200K | 64K |
Step 4: Trim, split, or move to a bigger window
If you are on Haiku and overflowing, switching to Opus 4.8 with its 1M window may be the whole fix. If you are already on a 1M model and still over, the document is genuinely huge. Split it into chunks, summarize each chunk in its own request, then send the summaries in a final pass. Never silently cut the middle out of a document, since that quietly drops information the user expected.
def chunk(text, size_chars=400_000):
for i in range(0, len(text), size_chars):
yield text[i:i + size_chars]
partials = []
for piece in chunk(text):
r = client.messages.create(
model="claude-opus-4-8",
max_tokens=2000,
messages=[{"role": "user",
"content": f"Summarize this section:\n{piece}"}],
)
partials.append(r.content[0].text)
final = client.messages.create(
model="claude-opus-4-8",
max_tokens=4000,
messages=[{"role": "user",
"content": "Combine these section summaries:\n"
+ "\n\n".join(partials)}],
)
print(final.content[0].text)Result: count_tokens reported 612,000 tokens for the report. On Haiku 4.5 that overflowed instantly. Moving to Opus 4.8 cleared the 413. For a 1.4M token archive, the chunk-and-combine pass brought every individual request under the window and produced a single clean summary.
Watch related tutorials
1:42:18
28:14
41:09
9:47
8:23
52:31