Your AI agent works in development. It passes tests. You deploy it. Then a user reports: "It gave me a completely wrong answer." Now what?
Without observability, debugging an AI agent is like debugging a web app with no logs — impossible. You can't see which tools it called, what the LLM returned at each step, why it chose one path over another, or where the reasoning broke down.
This guide covers everything you need to make your AI agent observable: what to trace, how to structure logs, which tools to use, and how to build dashboards that actually help you debug production issues.
Why Agent Observability Is Different
Traditional application monitoring tracks request/response pairs. AI agent observability needs to track multi-step reasoning chains where each step involves an LLM call, a tool invocation, or a decision point.
| Traditional App | AI Agent |
|---|---|
| Deterministic flow | Non-deterministic (LLM decides the path) |
| Fixed number of steps | Variable steps (1 to 50+) |
| Errors are clear | Errors can be subtle (correct format, wrong content) |
| Latency is predictable | Latency varies 10x based on reasoning path |
| Cost is fixed per request | Cost varies based on tokens consumed |
| One service call | Multiple LLM + tool calls per request |
You need three pillars of observability for agents: traces (the full execution path), logs (what happened at each step), and metrics (aggregate performance data).
Pillar 1: Distributed Tracing for Agents
A trace captures the full lifecycle of a single agent request — every LLM call, tool invocation, and decision point.
Trace Structure
A typical agent trace looks like this:
Request: "What were our sales last quarter?"
│
├── [Span] LLM Decision (420ms, 850 tokens)
│ └── Decision: Call tool "query_database"
│
├── [Span] Tool: query_database (180ms)
│ ├── Input: SELECT SUM(amount) FROM sales WHERE quarter='Q4-2025'
│ └── Output: {"total": 1247500}
│
├── [Span] LLM Decision (380ms, 620 tokens)
│ └── Decision: Call tool "format_currency"
│
├── [Span] Tool: format_currency (2ms)
│ └── Output: "$1,247,500"
│
└── [Span] LLM Response (290ms, 430 tokens)
└── "Your sales last quarter were $1,247,500..."
Total: 1,272ms | 1,900 tokens | $0.008 | 5 spans
Implementation with OpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="localhost:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("ai-agent")
class TracedAgent:
def run(self, user_input: str) -> str:
with tracer.start_as_current_span("agent_request") as span:
span.set_attribute("user.input", user_input[:200])
span.set_attribute("agent.version", "1.2.0")
steps = 0
total_tokens = 0
while True:
steps += 1
# Trace LLM decision
with tracer.start_as_current_span("llm_decision") as llm_span:
decision = self.llm.decide(user_input, self.context)
llm_span.set_attribute("llm.model", "gpt-4o")
llm_span.set_attribute("llm.tokens.input", decision.input_tokens)
llm_span.set_attribute("llm.tokens.output", decision.output_tokens)
llm_span.set_attribute("llm.decision_type", decision.type)
total_tokens += decision.total_tokens
if decision.type == "respond":
span.set_attribute("agent.steps", steps)
span.set_attribute("agent.total_tokens", total_tokens)
span.set_attribute("agent.cost_usd", total_tokens * 0.000003)
return decision.content
# Trace tool execution
with tracer.start_as_current_span("tool_call") as tool_span:
tool_span.set_attribute("tool.name", decision.tool)
tool_span.set_attribute("tool.input", str(decision.args)[:500])
result = self.tools.execute(decision.tool, decision.args)
tool_span.set_attribute("tool.output", str(result)[:500])
tool_span.set_attribute("tool.success", result.success)
Pillar 2: Structured Logging
Traces show the flow. Logs capture the details. For AI agents, structured JSON logs are essential — you'll need to filter, aggregate, and search them programmatically.
What to Log at Each Step
| Event | Required Fields | Optional Fields |
|---|---|---|
| Request received | request_id, user_id, input (truncated) | session_id, source |
| LLM call | model, tokens_in, tokens_out, latency_ms, decision | temperature, prompt_hash |
| Tool call | tool_name, input, output, success, latency_ms | retry_count, error_type |
| Guardrail triggered | guardrail_name, reason, action_taken | input_that_triggered, severity |
| Response sent | request_id, latency_total_ms, total_tokens, cost_usd | user_satisfaction |
| Error | error_type, error_message, step, stack_trace | recovery_action |
Implementation
import json
import logging
import time
from uuid import uuid4
class AgentLogger:
def __init__(self):
self.logger = logging.getLogger("agent")
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(message)s"))
self.logger.addHandler(handler)
self.logger.setLevel(logging.INFO)
def _log(self, event: str, **kwargs):
entry = {
"timestamp": time.time(),
"event": event,
**kwargs
}
self.logger.info(json.dumps(entry))
def request_start(self, request_id: str, user_input: str):
self._log("request_start",
request_id=request_id,
input_preview=user_input[:200],
input_length=len(user_input))
def llm_call(self, request_id: str, model: str, tokens_in: int,
tokens_out: int, latency_ms: int, decision: str):
self._log("llm_call",
request_id=request_id,
model=model,
tokens_in=tokens_in,
tokens_out=tokens_out,
latency_ms=latency_ms,
decision_type=decision,
cost_usd=round((tokens_in * 0.000003 + tokens_out * 0.000015), 6))
def tool_call(self, request_id: str, tool: str, success: bool,
latency_ms: int, error: str = None):
self._log("tool_call",
request_id=request_id,
tool=tool,
success=success,
latency_ms=latency_ms,
error=error)
def request_end(self, request_id: str, total_ms: int,
total_tokens: int, steps: int, cost_usd: float):
self._log("request_end",
request_id=request_id,
total_ms=total_ms,
total_tokens=total_tokens,
steps=steps,
cost_usd=cost_usd)
Pillar 3: Metrics and Dashboards
Metrics give you the bird's-eye view. While traces help you debug individual requests, metrics tell you how your agent is performing overall.
Essential Agent Metrics
| Metric | Type | Alert Threshold |
|---|---|---|
| Request latency (p50, p95, p99) | Histogram | p95 > 30s |
| Tokens per request | Histogram | p99 > 10,000 |
| Cost per request | Histogram | p99 > $0.50 |
| Steps per request | Histogram | Mean > 8 |
| Tool success rate | Counter | < 95% |
| LLM error rate | Counter | > 2% |
| Guardrail trigger rate | Counter | > 10% |
| Daily cost | Gauge | > budget * 0.8 |
| Requests per minute | Counter | Spike > 3x baseline |
Prometheus Metrics Example
from prometheus_client import Histogram, Counter, Gauge
# Latency
agent_latency = Histogram(
"agent_request_duration_seconds",
"Time to complete an agent request",
buckets=[0.5, 1, 2, 5, 10, 20, 30, 60]
)
# Cost
agent_cost = Histogram(
"agent_request_cost_usd",
"Cost per agent request in USD",
buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
)
# Token usage
agent_tokens = Histogram(
"agent_tokens_total",
"Total tokens per request",
["model"],
buckets=[100, 500, 1000, 2000, 5000, 10000]
)
# Tool calls
tool_calls = Counter(
"agent_tool_calls_total",
"Total tool calls",
["tool_name", "status"] # status: success, failure, blocked
)
# Daily spend
daily_spend = Gauge(
"agent_daily_spend_usd",
"Running total spend for today"
)
Observability Tools Compared
You don't have to build everything from scratch. Here are the main tools for agent observability in 2026:
| Tool | Best For | Price | Key Feature |
|---|---|---|---|
| LangSmith | LangChain agents | Free tier + $39/mo | Full trace visualization, playground replay |
| Langfuse | Any framework (open source) | Free (self-host) / $59/mo | Open source, prompt management |
| Arize Phoenix | LLM evaluation | Free (open source) | Embedding visualization, eval workflows |
| OpenTelemetry + Jaeger | Custom agents | Free (open source) | Standard protocol, any backend |
| Helicone | LLM API proxy | Free tier + $20/mo | Zero-code setup, cost tracking |
| Braintrust | Evals + observability | Free tier + usage | Eval-first, CI integration |
| Datadog LLM Observability | Enterprise | Contact sales | Full APM integration, RBAC |
Quick Setup: Langfuse (Open Source)
# pip install langfuse
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
langfuse = Langfuse(
public_key="pk-...",
secret_key="sk-...",
host="https://cloud.langfuse.com" # or self-hosted
)
@observe() # Automatically traces this function
def run_agent(user_input: str) -> str:
langfuse_context.update_current_observation(
input=user_input,
metadata={"version": "1.2.0"}
)
# LLM call — automatically captured
decision = call_llm(user_input)
# Tool call — create a child span
with langfuse_context.observe(name="tool_call") as span:
result = execute_tool(decision.tool, decision.args)
span.update(
input=decision.args,
output=result,
metadata={"tool": decision.tool}
)
response = generate_response(result)
langfuse_context.update_current_observation(
output=response,
usage={"total_tokens": 1500}
)
return response
Debugging Common Agent Failures
Here are the most common production issues and how observability helps you diagnose them:
1. Wrong Answer (Hallucination)
Symptom: Agent returns confident but incorrect information.
Debug with traces: Check the tool call outputs. Did the tool return correct data? If yes, the LLM misinterpreted it. If no, the tool query was wrong (often a hallucinated SQL query or wrong API parameter).
# Look for in your logs:
# 1. Tool output vs final response — do they match?
# 2. Did the LLM call the right tool?
# 3. Were the tool arguments correct?
# Common fix: Add output verification step
verification = llm.generate(f"""
Given this tool output: {tool_result}
And this response draft: {agent_response}
Does the response accurately reflect the data? YES/NO + explanation
""")
2. Infinite Loop
Symptom: Request never completes, high token usage.
Debug with traces: Look at the steps count. If it's near your max, check the last 5 actions — you'll usually see a pattern: the agent tries tool A, gets an error, tries tool A again with slightly different args, gets the same error, etc.
# Prevention: Track action patterns
if steps[-5:] == steps[-10:-5]: # Same 5 actions repeated
logger.warning("Loop detected", pattern=steps[-5:])
3. Slow Responses
Symptom: Latency spikes to 30s+.
Debug with traces: Look at the span durations. Is one LLM call taking 15s (model congestion)? Is a tool call timing out? Is the agent taking too many steps?
The fix depends on what's slow:
- LLM slow: Model routing to faster model for simple steps
- Tool slow: Add timeouts, use cached results
- Too many steps: Improve system prompt, add few-shot examples
4. Unexpected Tool Usage
Symptom: Agent calls tools it shouldn't, or calls them with wrong arguments.
Debug with traces: Check the LLM decision span. What was in the context when it decided to call that tool? Usually it's a prompt injection, ambiguous user input, or missing tool description.
Building an Agent Debug Dashboard
Here's what an effective agent dashboard should show:
Overview Panel
- Requests per minute (last 24h graph)
- p50/p95 latency (last 24h graph)
- Error rate (target: < 2%)
- Daily cost (running total vs budget)
Deep Dive Panel
- Slowest requests (clickable to trace view)
- Most expensive requests (token usage breakdown)
- Failed requests (error categorization)
- Guardrail triggers (which guardrails fire most)
Tool Performance Panel
- Success rate per tool
- Average latency per tool
- Most called tools (shows agent behavior patterns)
- Tool errors by type
Cost Panel
- Cost per request histogram
- Cost breakdown by model
- Daily/weekly/monthly spend trends
- Cost per user (identify expensive patterns)
Advanced: Replay and Time Travel Debugging
The killer feature of good agent observability: the ability to replay a past request with identical context.
class TraceRecorder:
"""Record everything needed to replay an agent execution."""
def record(self, request_id: str):
return {
"request_id": request_id,
"timestamp": time.time(),
"user_input": self.user_input,
"system_prompt": self.system_prompt,
"tool_results": self.tool_results, # Ordered list
"llm_responses": self.llm_responses, # Each step
"context_at_each_step": self.contexts,
"final_response": self.response,
}
class TraceReplayer:
"""Replay a past request, optionally with modifications."""
def replay(self, trace: dict, modifications: dict = None):
"""Replay with same inputs. Optionally change system prompt,
tool behavior, or model to test alternatives."""
config = {**trace}
if modifications:
config.update(modifications)
# Re-run with original tool outputs (deterministic replay)
# or with live tools (test if fix works)
return self.agent.run(
config["user_input"],
system_prompt=config.get("system_prompt"),
mock_tools=config.get("tool_results") if not modifications else None
)
Replay debugging lets you answer "would my fix have prevented this bug?" without waiting for the same user input to happen again.
Observability Anti-Patterns
1. Logging Everything
Full prompt/response logging for every request will blow up your storage costs and create a PII liability. Log metadata by default, full payloads only for sampled requests or errors.
2. No Sampling
At scale, trace 100% of errors but sample successful requests. 10% sampling for successful requests gives you enough data without the cost.
# Sampling strategy
def should_trace_full(request) -> bool:
if request.is_error:
return True # Always trace errors
if request.cost_usd > 0.10:
return True # Always trace expensive requests
if request.latency_ms > 10000:
return True # Always trace slow requests
return random.random() < 0.10 # 10% sample for normal requests
3. Alerting on Symptoms, Not Causes
"Latency is high" is a symptom. "Model API latency p95 exceeds 10s" is a cause. Alert on specific, actionable metrics.
4. No Baseline
You can't spot anomalies without knowing what's normal. Run your agent for a week, establish baselines for latency, cost, steps, and token usage, then set alerts relative to those baselines.
Checklist: Agent Observability Essentials
- Tracing: Every request gets a trace with spans for LLM calls, tool calls, and decisions
- Logging: Structured JSON logs with request_id, costs, tokens, latency
- Metrics: Latency (p50/p95/p99), cost per request, tokens per request, error rate, tool success rate
- Cost tracking: Per-request, daily, and monthly cost with budget alerts
- Error categorization: LLM errors vs tool errors vs guardrail blocks vs user errors
- Dashboard: Overview + deep dive + tools + cost panels
- Alerts: Latency spikes, cost overruns, error rate increase, guardrail trigger rate
- Replay: Ability to reproduce past requests for debugging
- Sampling: Full traces for errors, sampled for success
Want to stay current on agent observability tools and practices? AI Agents Weekly covers production patterns, new tools, and real-world debugging stories 3x/week.
Conclusion
Observability is the difference between "the agent seems to work" and "the agent provably works." Without traces, you're guessing. Without structured logs, you're grepping. Without metrics, you're reacting instead of preventing.
Start with the basics: structured logging with request IDs and cost tracking. Add tracing when you need to debug specific failures. Build dashboards when you have enough data to establish baselines. The investment pays off the first time you diagnose a production issue in minutes instead of hours.
Your agent's reliability is only as good as your ability to see what it's doing.