Generative AI

AI agent observability: how to see what your agent is actually doing

Konstantin Semenenko

July 3, 2026

minutes read

AI agent observability is the practice of logging every step an agent takes, model, tokens, tool calls, decisions, and cost, so failures are diagnosable instead of mysterious. It matters because agent failures are silent: the agent returns a confident, well-formatted answer that is wrong, with nothing logged by default. Without step-level tracing, you cannot tell where a multi-step run went wrong, or why the bill is what it is. Observability is what turns "the agent is acting weird" into "step 7 called the wrong tool."

AI agent observability is instrumenting an agent so you can see every step it takes: which model ran, how many tokens it used, which tools it called with what arguments, what it decided, and what each step cost. It matters because agent failures are quiet by nature, the agent completes a task, returns a confident and well-formatted output, and is wrong, with nothing in standard monitoring to flag it. Without step-level visibility, a failed multi-step run is a black box: you know the output is wrong but not which step corrupted it, and you cannot tell why the token bill looks the way it does. Observability is what converts "the agent is behaving strangely" into "step 7 passed a malformed argument to the wrong tool."

‍

We run agents in production, so this is a practical guide to what to log, why the usual monitoring misses agent failures, and how observability ties directly to both reliability and cost.

‍

Why normal monitoring misses agent failures

Traditional software fails loudly. A database query throws an error, an API returns a 500, and your monitoring catches it. Agent failures are different: they are semantic, not syntactic. A hallucinated fact, a misread instruction, a tool call with a subtly wrong argument, none of these throw an error. Each individual step looks coherent, so standard monitoring sees a successful run that produced a wrong result.

‍

This is the core problem. An agent can misread an instruction at step two and silently propagate that error across twenty downstream steps, and because every step "succeeded," nothing fires. The corruption is upstream of any visible failure. Worse, the human factor compounds it: 84% of CIOs report no formal process for tracking AI accuracy, so cost overruns, accuracy drift, and anomalies accumulate invisibly until the damage is done. You cannot fix what you cannot see, and agents are built in a way that hides exactly the things you need to see.

‍

What to log: the minimum viable trace

Observability starts with logging every LLM call and every tool call with enough detail to reconstruct what happened. The minimum useful trace captures, per step:

Which model ran, so you can tell whether a cheap model or an escalated expensive one handled a step.
Input and output tokens, and cost, per call, so spend is attributable to specific steps rather than a single blended number.
The tool called, its arguments, and its result, so you can catch a wrong tool choice or a malformed argument at the boundary where it happened.
The decision the agent made, the reasoning or plan step, so a wrong turn is traceable to where it was taken.
Success or failure of each step, including tool errors that the agent might otherwise swallow and continue past.

‍

The principle is that any step should be reconstructable after the fact. If you cannot answer "what did the agent do at step 7 and why," you do not have observability, you have a log of the final answer.

‍

Observability is a reliability control

Logging is not passive record-keeping; it is how you catch the failure modes that break agents. Step-level tracing is what lets you catch a corrupted step at step 2 instead of discovering the bad output at step 20. It is how you detect an agent silently swallowing a tool error and carrying a wrong result forward. It is how you notice context drift, where the agent is acting on stale assumptions, because you can see the exact context each step used.

‍

This is why observability pairs with verification rather than replacing it. Quality gates decide whether work ships; observability tells you why it failed when it does not. Together they turn agent failures from mysteries into diagnosable, fixable events, which is the whole point of the failure catalog we wrote in 21 ways AI agents fail in production. Nearly every failure mode there is easier to catch, and some only catchable, with real step-level observability underneath.

‍

Observability is also a cost control

The same trace that catches failures also explains the bill. Agent cost is notoriously hard to reason about because a single user request can trigger many internal LLM calls, and the expensive spend often hides in a thin slice of escalated or looping traffic. Without per-step token and cost logging, you see a large invoice and no explanation.

‍

With it, you can answer the questions that actually reduce cost: which steps escalated to the expensive model, where retries and loops burned tokens, how much went to re-read context. That is exactly how the analysis in what a $303,030 AI bill taught us was even possible, every call was logged with model, tokens, and cost, so the spend could be split into buckets and the expensive slice found. You cannot optimize a bill you cannot itemize, and observability is the itemization.

‍

Making it real without drowning in data

The risk with observability is logging so much that no one looks at it. The fix is to instrument the things that predict failure and cost, not everything. Per-step model, tokens, cost, tool calls, and decisions are the high-value signals. On top of that, a few aggregate metrics earn their place: escalation rate (what share of traffic hit the expensive model), loop and retry counts (where cost and failures hide), and per-step error rates. Alert on the aggregates, keep the traces for when you need to debug a specific run.

‍

The goal is not maximal logging; it is being able to answer, quickly, "where did this run go wrong, and where is the money going." If your instrumentation answers those two questions, it is doing its job.

‍

The takeaway

AI agent observability is logging every step, model, tokens, tool calls, decisions, and cost, so agent behavior is diagnosable instead of mysterious. It matters because agent failures are silent and semantic: confident wrong answers that standard monitoring never flags. Real step-level tracing is both a reliability control, catching corrupted steps and swallowed errors before they propagate, and a cost control, itemizing the spend so you can find the expensive slice. Instrument the signals that predict failure and cost, alert on the aggregates, and keep the traces for debugging.

‍

If you are running agents in production and cannot see what they are doing or why they cost what they do, building that observability in is part of how our AI Dev Team work ships reliable systems. For the framework that pairs observability with verification, see inside MCAF.

‍

FAQ

What is AI agent observability? The practice of instrumenting an agent so you can see every step it takes, which model ran, tokens and cost, tool calls and their results, and the decisions made, so failures and spend are diagnosable rather than a black box.

‍

Why don't normal monitoring tools catch agent failures? Because agent failures are semantic, not syntactic. A hallucination, a misread instruction, or a subtly wrong tool argument does not throw an error, so each step looks successful while producing a wrong result. Standard monitoring sees a passing run.

‍

What should you log for an AI agent? Per step: which model ran, input and output tokens and cost, the tool called with its arguments and result, the decision or plan the agent made, and whether the step succeeded, including swallowed tool errors. Enough to reconstruct what happened at any step.

‍

How does observability reduce AI agent cost? It itemizes spend by step, so you can see which steps escalated to expensive models and where retries and loops burned tokens. Without per-step cost logging you see a large bill and no explanation; with it, you can target the expensive slice.

‍

Is observability the same as verification? No, they pair. Verification (tests and quality gates) decides whether work ships. Observability tells you why it failed when it does not, and where the cost went. Together they turn agent failures from mysteries into fixable events.

“You can’t monetize pain. You can only monetize value. The moment users feel cared for, they’ll see paying as an investment in themselves — not a cost.”

News & Insights

View all

You know what you want to build. Let's go ship it.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

AI agent observability: how to see what your agent is actually doing

Why normal monitoring misses agent failures

What to log: the minimum viable trace

Observability is a reliability control

Observability is also a cost control

Making it real without drowning in data

The takeaway

FAQ

News & Insights

How much does AI save in customer support?

The AI productivity paradox: why time saved isn't money saved

AI ROI by industry: where the returns are highest

You know what you want to build. Let's go ship it.

managed code