TroubleshootingIntermediate

How to fix a context length exceeded error in the Claude API

Diagnose and resolve a request that overflows the model context window by counting tokens and trimming the prompt.

8 minIntermediate

When a request packs more tokens into the prompt than the model can hold, the call fails or the response stops early. The fix is not to guess at sizes. You measure the real token count with the count_tokens endpoint, then trim or restructure until the request fits with room for the reply. This guide walks through that loop for Claude.

Python or Node with the official Anthropic SDK installed
An ANTHROPIC_API_KEY set in your environment
The exact prompt or file that triggered the failure
Five minutes to run two or three count_tokens calls

Step 1: Read the actual error

A request that is simply too big returns a 413 request_too_large before the model even runs. A request that fits on the way in but exhausts the window during generation comes back with stop_reason equal to model_context_window_exceeded. These are different problems. The first means shrink the input. The second means leave more headroom for output.

Terminal — request rejected

$ python summarize.py big_report.txt

anthropic.APIStatusError: 413 request_too_large

message: prompt is too large for the model context window

request_id: req_011CSHoEeqs5C35K2UUqR7Fy

A 413 means the input never fit; the model did not run.

Step 2: Count the tokens for real

Do not estimate with a character count or a third party tokenizer. Claude has its own tokenizer, so the only accurate number comes from count_tokens with the same model id you plan to call. Run it against the prompt that failed.

count.py

from anthropic import Anthropic

client = Anthropic()
text = open("big_report.txt").read()

resp = client.messages.count_tokens(
    model="claude-opus-4-8",
    messages=[{"role": "user", "content": text}],
)
print(resp.input_tokens)

Budget for the answer too

The context window holds input plus output. If you ask for max_tokens of 8000, leave at least that many tokens of headroom below the window limit, plus a small margin for the system prompt and thinking.

Step 3: Confirm the window for your model

Claude Opus 4.8, Opus 4.7, Opus 4.6, Sonnet 4.6, and Fable 5 all have a 1M token context window. Haiku 4.5 has 200K. So a 250K token prompt is fine on Opus but will overflow Haiku. Pick the model whose window is comfortably larger than your measured count plus the reply.

Model	Context window	Max output
claude-opus-4-8	1M	128K
claude-sonnet-4-6	1M	64K
claude-haiku-4-5	200K	64K

Step 4: Trim, split, or move to a bigger window

If you are on Haiku and overflowing, switching to Opus 4.8 with its 1M window may be the whole fix. If you are already on a 1M model and still over, the document is genuinely huge. Split it into chunks, summarize each chunk in its own request, then send the summaries in a final pass. Never silently cut the middle out of a document, since that quietly drops information the user expected.

chunk.py

def chunk(text, size_chars=400_000):
    for i in range(0, len(text), size_chars):
        yield text[i:i + size_chars]

partials = []
for piece in chunk(text):
    r = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=2000,
        messages=[{"role": "user",
                   "content": f"Summarize this section:\n{piece}"}],
    )
    partials.append(r.content[0].text)

final = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=4000,
    messages=[{"role": "user",
               "content": "Combine these section summaries:\n"
                          + "\n\n".join(partials)}],
)
print(final.content[0].text)

Result: count_tokens reported 612,000 tokens for the report. On Haiku 4.5 that overflowed instantly. Moving to Opus 4.8 cleared the 413. For a 1.4M token archive, the chunk-and-combine pass brought every individual request under the window and produced a single clean summary.