AI Agent Observability: Tracing, Logging & Debugging in Production (2026 Guide)

Your AI agent works in development. It passes tests. You deploy it. Then a user reports: "It gave me a completely wrong answer." Now what?

Without observability, debugging an AI agent is like debugging a web app with no logs — impossible. You can't see which tools it called, what the LLM returned at each step, why it chose one path over another, or where the reasoning broke down.

This guide covers everything you need to make your AI agent observable: what to trace, how to structure logs, which tools to use, and how to build dashboards that actually help you debug production issues.

Why Agent Observability Is Different

Traditional application monitoring tracks request/response pairs. AI agent observability needs to track multi-step reasoning chains where each step involves an LLM call, a tool invocation, or a decision point.

Traditional App	AI Agent
Deterministic flow	Non-deterministic (LLM decides the path)
Fixed number of steps	Variable steps (1 to 50+)
Errors are clear	Errors can be subtle (correct format, wrong content)
Latency is predictable	Latency varies 10x based on reasoning path
Cost is fixed per request	Cost varies based on tokens consumed
One service call	Multiple LLM + tool calls per request

You need three pillars of observability for agents: traces (the full execution path), logs (what happened at each step), and metrics (aggregate performance data).

Pillar 1: Distributed Tracing for Agents

A trace captures the full lifecycle of a single agent request — every LLM call, tool invocation, and decision point.

Trace Structure

A typical agent trace looks like this:

Request: "What were our sales last quarter?"
│
├── [Span] LLM Decision (420ms, 850 tokens)
│   └── Decision: Call tool "query_database"
│
├── [Span] Tool: query_database (180ms)
│   ├── Input: SELECT SUM(amount) FROM sales WHERE quarter='Q4-2025'
│   └── Output: {"total": 1247500}
│
├── [Span] LLM Decision (380ms, 620 tokens)
│   └── Decision: Call tool "format_currency"
│
├── [Span] Tool: format_currency (2ms)
│   └── Output: "$1,247,500"
│
└── [Span] LLM Response (290ms, 430 tokens)
    └── "Your sales last quarter were $1,247,500..."

Total: 1,272ms | 1,900 tokens | $0.008 | 5 spans

Implementation with OpenTelemetry

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="localhost:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("ai-agent")

class TracedAgent:
    def run(self, user_input: str) -> str:
        with tracer.start_as_current_span("agent_request") as span:
            span.set_attribute("user.input", user_input[:200])
            span.set_attribute("agent.version", "1.2.0")

            steps = 0
            total_tokens = 0

            while True:
                steps += 1

                # Trace LLM decision
                with tracer.start_as_current_span("llm_decision") as llm_span:
                    decision = self.llm.decide(user_input, self.context)
                    llm_span.set_attribute("llm.model", "gpt-4o")
                    llm_span.set_attribute("llm.tokens.input", decision.input_tokens)
                    llm_span.set_attribute("llm.tokens.output", decision.output_tokens)
                    llm_span.set_attribute("llm.decision_type", decision.type)
                    total_tokens += decision.total_tokens

                if decision.type == "respond":
                    span.set_attribute("agent.steps", steps)
                    span.set_attribute("agent.total_tokens", total_tokens)
                    span.set_attribute("agent.cost_usd", total_tokens * 0.000003)
                    return decision.content

                # Trace tool execution
                with tracer.start_as_current_span("tool_call") as tool_span:
                    tool_span.set_attribute("tool.name", decision.tool)
                    tool_span.set_attribute("tool.input", str(decision.args)[:500])
                    result = self.tools.execute(decision.tool, decision.args)
                    tool_span.set_attribute("tool.output", str(result)[:500])
                    tool_span.set_attribute("tool.success", result.success)

Tip: Truncate inputs and outputs in span attributes to prevent trace storage from exploding. 200-500 chars is usually enough for debugging. Store full payloads only when needed for replay.

Pillar 2: Structured Logging

Traces show the flow. Logs capture the details. For AI agents, structured JSON logs are essential — you'll need to filter, aggregate, and search them programmatically.

What to Log at Each Step

Event	Required Fields	Optional Fields
Request received	request_id, user_id, input (truncated)	session_id, source
LLM call	model, tokens_in, tokens_out, latency_ms, decision	temperature, prompt_hash
Tool call	tool_name, input, output, success, latency_ms	retry_count, error_type
Guardrail triggered	guardrail_name, reason, action_taken	input_that_triggered, severity
Response sent	request_id, latency_total_ms, total_tokens, cost_usd	user_satisfaction
Error	error_type, error_message, step, stack_trace	recovery_action

Implementation

import json
import logging
import time
from uuid import uuid4

class AgentLogger:
    def __init__(self):
        self.logger = logging.getLogger("agent")
        handler = logging.StreamHandler()
        handler.setFormatter(logging.Formatter("%(message)s"))
        self.logger.addHandler(handler)
        self.logger.setLevel(logging.INFO)

    def _log(self, event: str, **kwargs):
        entry = {
            "timestamp": time.time(),
            "event": event,
            **kwargs
        }
        self.logger.info(json.dumps(entry))

    def request_start(self, request_id: str, user_input: str):
        self._log("request_start",
                  request_id=request_id,
                  input_preview=user_input[:200],
                  input_length=len(user_input))

    def llm_call(self, request_id: str, model: str, tokens_in: int,
                 tokens_out: int, latency_ms: int, decision: str):
        self._log("llm_call",
                  request_id=request_id,
                  model=model,
                  tokens_in=tokens_in,
                  tokens_out=tokens_out,
                  latency_ms=latency_ms,
                  decision_type=decision,
                  cost_usd=round((tokens_in * 0.000003 + tokens_out * 0.000015), 6))

    def tool_call(self, request_id: str, tool: str, success: bool,
                  latency_ms: int, error: str = None):
        self._log("tool_call",
                  request_id=request_id,
                  tool=tool,
                  success=success,
                  latency_ms=latency_ms,
                  error=error)

    def request_end(self, request_id: str, total_ms: int,
                    total_tokens: int, steps: int, cost_usd: float):
        self._log("request_end",
                  request_id=request_id,
                  total_ms=total_ms,
                  total_tokens=total_tokens,
                  steps=steps,
                  cost_usd=cost_usd)

Pillar 3: Metrics and Dashboards

Metrics give you the bird's-eye view. While traces help you debug individual requests, metrics tell you how your agent is performing overall.

Essential Agent Metrics

Metric	Type	Alert Threshold
Request latency (p50, p95, p99)	Histogram	p95 > 30s
Tokens per request	Histogram	p99 > 10,000
Cost per request	Histogram	p99 > $0.50
Steps per request	Histogram	Mean > 8
Tool success rate	Counter	< 95%
LLM error rate	Counter	> 2%
Guardrail trigger rate	Counter	> 10%
Daily cost	Gauge	> budget * 0.8
Requests per minute	Counter	Spike > 3x baseline

Prometheus Metrics Example

from prometheus_client import Histogram, Counter, Gauge

# Latency
agent_latency = Histogram(
    "agent_request_duration_seconds",
    "Time to complete an agent request",
    buckets=[0.5, 1, 2, 5, 10, 20, 30, 60]
)

# Cost
agent_cost = Histogram(
    "agent_request_cost_usd",
    "Cost per agent request in USD",
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
)

# Token usage
agent_tokens = Histogram(
    "agent_tokens_total",
    "Total tokens per request",
    ["model"],
    buckets=[100, 500, 1000, 2000, 5000, 10000]
)

# Tool calls
tool_calls = Counter(
    "agent_tool_calls_total",
    "Total tool calls",
    ["tool_name", "status"]  # status: success, failure, blocked
)

# Daily spend
daily_spend = Gauge(
    "agent_daily_spend_usd",
    "Running total spend for today"
)

Observability Tools Compared

You don't have to build everything from scratch. Here are the main tools for agent observability in 2026:

Tool	Best For	Price	Key Feature
LangSmith	LangChain agents	Free tier + $39/mo	Full trace visualization, playground replay
Langfuse	Any framework (open source)	Free (self-host) / $59/mo	Open source, prompt management
Arize Phoenix	LLM evaluation	Free (open source)	Embedding visualization, eval workflows
OpenTelemetry + Jaeger	Custom agents	Free (open source)	Standard protocol, any backend
Helicone	LLM API proxy	Free tier + $20/mo	Zero-code setup, cost tracking
Braintrust	Evals + observability	Free tier + usage	Eval-first, CI integration
Datadog LLM Observability	Enterprise	Contact sales	Full APM integration, RBAC

Quick Setup: Langfuse (Open Source)

# pip install langfuse

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

langfuse = Langfuse(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com"  # or self-hosted
)

@observe()  # Automatically traces this function
def run_agent(user_input: str) -> str:
    langfuse_context.update_current_observation(
        input=user_input,
        metadata={"version": "1.2.0"}
    )

    # LLM call — automatically captured
    decision = call_llm(user_input)

    # Tool call — create a child span
    with langfuse_context.observe(name="tool_call") as span:
        result = execute_tool(decision.tool, decision.args)
        span.update(
            input=decision.args,
            output=result,
            metadata={"tool": decision.tool}
        )

    response = generate_response(result)
    langfuse_context.update_current_observation(
        output=response,
        usage={"total_tokens": 1500}
    )
    return response

Debugging Common Agent Failures

Here are the most common production issues and how observability helps you diagnose them:

1. Wrong Answer (Hallucination)

Symptom: Agent returns confident but incorrect information.

Debug with traces: Check the tool call outputs. Did the tool return correct data? If yes, the LLM misinterpreted it. If no, the tool query was wrong (often a hallucinated SQL query or wrong API parameter).

# Look for in your logs:
# 1. Tool output vs final response — do they match?
# 2. Did the LLM call the right tool?
# 3. Were the tool arguments correct?

# Common fix: Add output verification step
verification = llm.generate(f"""
Given this tool output: {tool_result}
And this response draft: {agent_response}
Does the response accurately reflect the data? YES/NO + explanation
""")

2. Infinite Loop

Symptom: Request never completes, high token usage.

Debug with traces: Look at the steps count. If it's near your max, check the last 5 actions — you'll usually see a pattern: the agent tries tool A, gets an error, tries tool A again with slightly different args, gets the same error, etc.

# Prevention: Track action patterns
if steps[-5:] == steps[-10:-5]:  # Same 5 actions repeated
    logger.warning("Loop detected", pattern=steps[-5:])

3. Slow Responses

Symptom: Latency spikes to 30s+.

Debug with traces: Look at the span durations. Is one LLM call taking 15s (model congestion)? Is a tool call timing out? Is the agent taking too many steps?

The fix depends on what's slow:

LLM slow: Model routing to faster model for simple steps
Tool slow: Add timeouts, use cached results
Too many steps: Improve system prompt, add few-shot examples

4. Unexpected Tool Usage

Symptom: Agent calls tools it shouldn't, or calls them with wrong arguments.

Debug with traces: Check the LLM decision span. What was in the context when it decided to call that tool? Usually it's a prompt injection, ambiguous user input, or missing tool description.

Building an Agent Debug Dashboard

Here's what an effective agent dashboard should show:

Overview Panel

Requests per minute (last 24h graph)
p50/p95 latency (last 24h graph)
Error rate (target: < 2%)
Daily cost (running total vs budget)

Deep Dive Panel

Slowest requests (clickable to trace view)
Most expensive requests (token usage breakdown)
Failed requests (error categorization)
Guardrail triggers (which guardrails fire most)

Tool Performance Panel

Success rate per tool
Average latency per tool
Most called tools (shows agent behavior patterns)
Tool errors by type

Cost Panel

Cost per request histogram
Cost breakdown by model
Daily/weekly/monthly spend trends
Cost per user (identify expensive patterns)

Advanced: Replay and Time Travel Debugging

The killer feature of good agent observability: the ability to replay a past request with identical context.

class TraceRecorder:
    """Record everything needed to replay an agent execution."""

    def record(self, request_id: str):
        return {
            "request_id": request_id,
            "timestamp": time.time(),
            "user_input": self.user_input,
            "system_prompt": self.system_prompt,
            "tool_results": self.tool_results,  # Ordered list
            "llm_responses": self.llm_responses,  # Each step
            "context_at_each_step": self.contexts,
            "final_response": self.response,
        }

class TraceReplayer:
    """Replay a past request, optionally with modifications."""

    def replay(self, trace: dict, modifications: dict = None):
        """Replay with same inputs. Optionally change system prompt,
        tool behavior, or model to test alternatives."""

        config = {**trace}
        if modifications:
            config.update(modifications)

        # Re-run with original tool outputs (deterministic replay)
        # or with live tools (test if fix works)
        return self.agent.run(
            config["user_input"],
            system_prompt=config.get("system_prompt"),
            mock_tools=config.get("tool_results") if not modifications else None
        )

Replay debugging lets you answer "would my fix have prevented this bug?" without waiting for the same user input to happen again.

Observability Anti-Patterns

1. Logging Everything

Full prompt/response logging for every request will blow up your storage costs and create a PII liability. Log metadata by default, full payloads only for sampled requests or errors.

2. No Sampling

At scale, trace 100% of errors but sample successful requests. 10% sampling for successful requests gives you enough data without the cost.

# Sampling strategy
def should_trace_full(request) -> bool:
    if request.is_error:
        return True  # Always trace errors
    if request.cost_usd > 0.10:
        return True  # Always trace expensive requests
    if request.latency_ms > 10000:
        return True  # Always trace slow requests
    return random.random() < 0.10  # 10% sample for normal requests

3. Alerting on Symptoms, Not Causes

"Latency is high" is a symptom. "Model API latency p95 exceeds 10s" is a cause. Alert on specific, actionable metrics.

4. No Baseline

You can't spot anomalies without knowing what's normal. Run your agent for a week, establish baselines for latency, cost, steps, and token usage, then set alerts relative to those baselines.

Checklist: Agent Observability Essentials

Tracing: Every request gets a trace with spans for LLM calls, tool calls, and decisions
Logging: Structured JSON logs with request_id, costs, tokens, latency
Metrics: Latency (p50/p95/p99), cost per request, tokens per request, error rate, tool success rate
Cost tracking: Per-request, daily, and monthly cost with budget alerts
Error categorization: LLM errors vs tool errors vs guardrail blocks vs user errors
Dashboard: Overview + deep dive + tools + cost panels
Alerts: Latency spikes, cost overruns, error rate increase, guardrail trigger rate
Replay: Ability to reproduce past requests for debugging
Sampling: Full traces for errors, sampled for success

Want to stay current on agent observability tools and practices? AI Agents Weekly covers production patterns, new tools, and real-world debugging stories 3x/week.

Conclusion

Observability is the difference between "the agent seems to work" and "the agent provably works." Without traces, you're guessing. Without structured logs, you're grepping. Without metrics, you're reacting instead of preventing.

Start with the basics: structured logging with request IDs and cost tracking. Add tracing when you need to debug specific failures. Build dashboards when you have enough data to establish baselines. The investment pays off the first time you diagnose a production issue in minutes instead of hours.

Your agent's reliability is only as good as your ability to see what it's doing.