TroubleshootingAdvanced

How to cap spend on a Claude agent with a task budget

Give an agentic loop a token budget so the model paces itself and finishes gracefully instead of running away.

8 minAdvanced

An agent that loops through many tool calls can burn far more tokens than you expected before it finishes. A task budget tells the model how many tokens it has for the whole loop. The model sees a running countdown and self-moderates, wrapping up gracefully rather than being cut off. This is different from max_tokens, which is a hard per-response ceiling the model is not aware of.

  • The Anthropic SDK and an API key
  • A model that supports task budgets (Opus 4.7, Opus 4.8, or Fable 5)
  • An agentic loop with tools

Step 1: Add task_budget to the request

Task budgets are beta. Use the beta streaming endpoint, pass the task budget beta header, and set task_budget inside output_config. The minimum total is 20000 tokens. Stream so the large max_tokens does not trip an HTTP timeout.

budgeted_agent.py
from anthropic import Anthropic

client = Anthropic()

with client.beta.messages.stream(
    model="claude-opus-4-8",
    max_tokens=128000,
    betas=["task-budgets-2026-03-13"],
    output_config={
        "effort": "high",
        "task_budget": {"type": "tokens", "total": 64000},
    },
    messages=[{"role": "user", "content": "Audit the repo and list issues."}],
    tools=tools,
) as stream:
    response = stream.get_final_message()
Budget vs max_tokens
max_tokens is an enforced ceiling per response that the model cannot see. task_budget is a softer total for the whole loop that the model does see, so it can prioritize and finish cleanly. Use both: max_tokens as a hard cap, task_budget for graceful pacing.

Step 2: Track spend across the loop

To show progress, accumulate output_tokens from usage across iterations, plus the tokens of the tool results you feed back in. Leave the budget's remaining field unset in a normal loop; the server tracks the countdown itself.

track.py
spent = 0
spent += response.usage.output_tokens
print(f"output so far: {spent} tokens")
Terminal — agent pacing to budget
iter 1 output: 8,400 (budget 64,000)
iter 2 output: 19,100
iter 3 output: 41,700
iter 4 output: 58,900 <- wrapping up, near budget
done stop_reason: end_turn
The model spends most of the budget, then writes its summary before running out.

Step 3: Tune the budget per task type

Set a generous budget for open-ended exploration and a tight one for latency-sensitive jobs. If the budget is too small for the task, the model may finish less thoroughly and say it ran out of budget. Measure a few real runs, then set the total a bit above the median.

Pair with effort
Effort controls how deeply the model thinks per step; task budget caps the cumulative spend across the whole loop. They are complementary. Lower effort with a fixed budget keeps cost predictable on routine agent runs.

Result: the repo audit agent that previously ran past 100,000 output tokens before stopping now paces itself to a 64,000 token budget, delivering a complete issue list and a clean end_turn instead of an abrupt cutoff.

Watch related tutorials

Tags
#cost#task budget#agents#claude