Observability, Cost, and Evals
Once an AI automation handles real volume, two questions decide whether it survives: is it still correct, and is it bankrupting you? You answer both by instrumenting the system. You cannot improve what you cannot see, and AI steps fail in quiet, drifting ways that classic monitoring misses.
Log every model call
For each AI step, record the input, the output, the model used, token counts, latency, and cost. Tools like Langfuse or Helicone do this with a proxy, or you can log to your own table. This trace is what lets you debug a bad answer a week later.
{
"ts": "2026-06-21T09:14:02Z",
"model": "claude-sonnet-4-6",
"step": "classify_ticket",
"input_tokens": 412,
"output_tokens": 6,
"cost_usd": 0.0021,
"latency_ms": 740,
"output": "Billing"
}Cut cost without cutting quality
- Route simple cases to a small model and only escalate hard ones to a big model.
- Cache repeated prompts and reuse stable context to avoid paying for the same tokens twice.
- Trim the prompt: less retrieved context and shorter system text is cheaper and often sharper.
- Filter junk before the AI step so you never pay to classify spam.
Evals: catch regressions
Keep a small set of real inputs with known-correct outputs. Whenever you change a prompt or swap a model, run that set and compare. This eval set is your seatbelt: it tells you a tweak that looked better actually broke ten other cases.