TroubleshootingIntermediate

How to Fix 'Context Length Exceeded' Token Limit Errors

Resolve maximum context length errors by counting tokens, trimming input, and chunking long documents.

10 minIntermediate

A context_length_exceeded error means your prompt plus the requested output is larger than the model's context window. Every model has a fixed ceiling measured in tokens, not characters. The fix is to measure how many tokens you are sending, trim what you can, and split anything too big into chunks.

Your model's context window size, from the provider docs
A token counting library such as tiktoken
The long input that triggered the error

Step 1: Read what the error tells you

The message usually states the model maximum and how many tokens you tried to send. That gap is exactly how much you need to cut, plus headroom for the response.

zsh - api

{ "error": { "code": "context_length_exceeded",

"message": "This model maximum context length is 128000 tokens. However, your messages resulted in 131420 tokens." } }

Step 2: Count tokens before you send

Stop guessing. Count tokens locally so you can reject or trim oversized input before it ever hits the API. A rough rule is about four characters per token in English, but a real counter is exact.

count.py

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
text = open("doc.txt").read()
n = len(enc.encode(text))
print(f"{n} tokens")
if n > 120000:
    print("Too big, must chunk")

Step 3: Trim what you do not need

Before chunking, remove fat. Drop old chat history, strip boilerplate, and reserve room for the answer by leaving the max_tokens output budget out of your input total.

Token budget

Context window: 128000 tokens

System prompt: 420

Document: 118000

Reserved for answer: 4000

Total used: 122420 OK

Always subtract your output budget from the window before filling it with input.

Step 4: Chunk long documents

If the input is genuinely larger than the window, split it into overlapping chunks, process each, then combine the results. Overlap a few hundred tokens between chunks so you do not cut a sentence or idea in half.

chunk.py

def chunk(tokens, size=20000, overlap=500):
    out = []
    start = 0
    while start < len(tokens):
        out.append(tokens[start:start + size])
        start += size - overlap
    return out

A bigger model is not always the fix

Jumping to a larger context window costs more per call and can lose detail in the middle of very long inputs. Trimming and chunking often gives better answers for less money.

Result

After counting tokens and splitting a 200-page PDF into overlapping 20k-token chunks, a summarizer that always failed now processes the whole document and merges the chunk summaries into one clean output.