AI Agent Observability: Tracing, Logging & Debugging in Production (2026 Guide)

Mar 27, 2026 • 14 min read • By Paxrel

Your AI agent works in development. It passes tests. You deploy it. Then a user reports: "It gave me a completely wrong answer." Now what?

Without observability, debugging an AI agent is like debugging a web app with no logs — impossible. You can't see which tools it called, what the LLM returned at each step, why it chose one path over another, or where the reasoning broke down.

This guide covers everything you need to make your AI agent observable: what to trace, how to structure logs, which tools to use, and how to build dashboards that actually help you debug production issues.

Why Agent Observability Is Different

Traditional application monitoring tracks request/response pairs. AI agent observability needs to track multi-step reasoning chains where each step involves an LLM call, a tool invocation, or a decision point.

Traditional AppAI Agent
Deterministic flowNon-deterministic (LLM decides the path)
Fixed number of stepsVariable steps (1 to 50+)
Errors are clearErrors can be subtle (correct format, wrong content)
Latency is predictableLatency varies 10x based on reasoning path
Cost is fixed per requestCost varies based on tokens consumed
One service callMultiple LLM + tool calls per request

You need three pillars of observability for agents: traces (the full execution path), logs (what happened at each step), and metrics (aggregate performance data).

Pillar 1: Distributed Tracing for Agents

A trace captures the full lifecycle of a single agent request — every LLM call, tool invocation, and decision point.

Trace Structure

A typical agent trace looks like this:

Request: "What were our sales last quarter?"
│
├── [Span] LLM Decision (420ms, 850 tokens)
│   └── Decision: Call tool "query_database"
│
├── [Span] Tool: query_database (180ms)
│   ├── Input: SELECT SUM(amount) FROM sales WHERE quarter='Q4-2025'
│   └── Output: {"total": 1247500}
│
├── [Span] LLM Decision (380ms, 620 tokens)
│   └── Decision: Call tool "format_currency"
│
├── [Span] Tool: format_currency (2ms)
│   └── Output: "$1,247,500"
│
└── [Span] LLM Response (290ms, 430 tokens)
    └── "Your sales last quarter were $1,247,500..."

Total: 1,272ms | 1,900 tokens | $0.008 | 5 spans

Implementation with OpenTelemetry

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="localhost:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("ai-agent")

class TracedAgent:
    def run(self, user_input: str) -> str:
        with tracer.start_as_current_span("agent_request") as span:
            span.set_attribute("user.input", user_input[:200])
            span.set_attribute("agent.version", "1.2.0")

            steps = 0
            total_tokens = 0

            while True:
                steps += 1

                # Trace LLM decision
                with tracer.start_as_current_span("llm_decision") as llm_span:
                    decision = self.llm.decide(user_input, self.context)
                    llm_span.set_attribute("llm.model", "gpt-4o")
                    llm_span.set_attribute("llm.tokens.input", decision.input_tokens)
                    llm_span.set_attribute("llm.tokens.output", decision.output_tokens)
                    llm_span.set_attribute("llm.decision_type", decision.type)
                    total_tokens += decision.total_tokens

                if decision.type == "respond":
                    span.set_attribute("agent.steps", steps)
                    span.set_attribute("agent.total_tokens", total_tokens)
                    span.set_attribute("agent.cost_usd", total_tokens * 0.000003)
                    return decision.content

                # Trace tool execution
                with tracer.start_as_current_span("tool_call") as tool_span:
                    tool_span.set_attribute("tool.name", decision.tool)
                    tool_span.set_attribute("tool.input", str(decision.args)[:500])
                    result = self.tools.execute(decision.tool, decision.args)
                    tool_span.set_attribute("tool.output", str(result)[:500])
                    tool_span.set_attribute("tool.success", result.success)
Tip: Truncate inputs and outputs in span attributes to prevent trace storage from exploding. 200-500 chars is usually enough for debugging. Store full payloads only when needed for replay.

Pillar 2: Structured Logging

Traces show the flow. Logs capture the details. For AI agents, structured JSON logs are essential — you'll need to filter, aggregate, and search them programmatically.

What to Log at Each Step

EventRequired FieldsOptional Fields
Request receivedrequest_id, user_id, input (truncated)session_id, source
LLM callmodel, tokens_in, tokens_out, latency_ms, decisiontemperature, prompt_hash
Tool calltool_name, input, output, success, latency_msretry_count, error_type
Guardrail triggeredguardrail_name, reason, action_takeninput_that_triggered, severity
Response sentrequest_id, latency_total_ms, total_tokens, cost_usduser_satisfaction
Errorerror_type, error_message, step, stack_tracerecovery_action

Implementation

import json
import logging
import time
from uuid import uuid4

class AgentLogger:
    def __init__(self):
        self.logger = logging.getLogger("agent")
        handler = logging.StreamHandler()
        handler.setFormatter(logging.Formatter("%(message)s"))
        self.logger.addHandler(handler)
        self.logger.setLevel(logging.INFO)

    def _log(self, event: str, **kwargs):
        entry = {
            "timestamp": time.time(),
            "event": event,
            **kwargs
        }
        self.logger.info(json.dumps(entry))

    def request_start(self, request_id: str, user_input: str):
        self._log("request_start",
                  request_id=request_id,
                  input_preview=user_input[:200],
                  input_length=len(user_input))

    def llm_call(self, request_id: str, model: str, tokens_in: int,
                 tokens_out: int, latency_ms: int, decision: str):
        self._log("llm_call",
                  request_id=request_id,
                  model=model,
                  tokens_in=tokens_in,
                  tokens_out=tokens_out,
                  latency_ms=latency_ms,
                  decision_type=decision,
                  cost_usd=round((tokens_in * 0.000003 + tokens_out * 0.000015), 6))

    def tool_call(self, request_id: str, tool: str, success: bool,
                  latency_ms: int, error: str = None):
        self._log("tool_call",
                  request_id=request_id,
                  tool=tool,
                  success=success,
                  latency_ms=latency_ms,
                  error=error)

    def request_end(self, request_id: str, total_ms: int,
                    total_tokens: int, steps: int, cost_usd: float):
        self._log("request_end",
                  request_id=request_id,
                  total_ms=total_ms,
                  total_tokens=total_tokens,
                  steps=steps,
                  cost_usd=cost_usd)

Pillar 3: Metrics and Dashboards

Metrics give you the bird's-eye view. While traces help you debug individual requests, metrics tell you how your agent is performing overall.

Essential Agent Metrics

MetricTypeAlert Threshold
Request latency (p50, p95, p99)Histogramp95 > 30s
Tokens per requestHistogramp99 > 10,000
Cost per requestHistogramp99 > $0.50
Steps per requestHistogramMean > 8
Tool success rateCounter< 95%
LLM error rateCounter> 2%
Guardrail trigger rateCounter> 10%
Daily costGauge> budget * 0.8
Requests per minuteCounterSpike > 3x baseline

Prometheus Metrics Example

from prometheus_client import Histogram, Counter, Gauge

# Latency
agent_latency = Histogram(
    "agent_request_duration_seconds",
    "Time to complete an agent request",
    buckets=[0.5, 1, 2, 5, 10, 20, 30, 60]
)

# Cost
agent_cost = Histogram(
    "agent_request_cost_usd",
    "Cost per agent request in USD",
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
)

# Token usage
agent_tokens = Histogram(
    "agent_tokens_total",
    "Total tokens per request",
    ["model"],
    buckets=[100, 500, 1000, 2000, 5000, 10000]
)

# Tool calls
tool_calls = Counter(
    "agent_tool_calls_total",
    "Total tool calls",
    ["tool_name", "status"]  # status: success, failure, blocked
)

# Daily spend
daily_spend = Gauge(
    "agent_daily_spend_usd",
    "Running total spend for today"
)

Observability Tools Compared

You don't have to build everything from scratch. Here are the main tools for agent observability in 2026:

ToolBest ForPriceKey Feature
LangSmithLangChain agentsFree tier + $39/moFull trace visualization, playground replay
LangfuseAny framework (open source)Free (self-host) / $59/moOpen source, prompt management
Arize PhoenixLLM evaluationFree (open source)Embedding visualization, eval workflows
OpenTelemetry + JaegerCustom agentsFree (open source)Standard protocol, any backend
HeliconeLLM API proxyFree tier + $20/moZero-code setup, cost tracking
BraintrustEvals + observabilityFree tier + usageEval-first, CI integration
Datadog LLM ObservabilityEnterpriseContact salesFull APM integration, RBAC

Quick Setup: Langfuse (Open Source)

# pip install langfuse

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

langfuse = Langfuse(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com"  # or self-hosted
)

@observe()  # Automatically traces this function
def run_agent(user_input: str) -> str:
    langfuse_context.update_current_observation(
        input=user_input,
        metadata={"version": "1.2.0"}
    )

    # LLM call — automatically captured
    decision = call_llm(user_input)

    # Tool call — create a child span
    with langfuse_context.observe(name="tool_call") as span:
        result = execute_tool(decision.tool, decision.args)
        span.update(
            input=decision.args,
            output=result,
            metadata={"tool": decision.tool}
        )

    response = generate_response(result)
    langfuse_context.update_current_observation(
        output=response,
        usage={"total_tokens": 1500}
    )
    return response

Debugging Common Agent Failures

Here are the most common production issues and how observability helps you diagnose them:

1. Wrong Answer (Hallucination)

Symptom: Agent returns confident but incorrect information.

Debug with traces: Check the tool call outputs. Did the tool return correct data? If yes, the LLM misinterpreted it. If no, the tool query was wrong (often a hallucinated SQL query or wrong API parameter).

# Look for in your logs:
# 1. Tool output vs final response — do they match?
# 2. Did the LLM call the right tool?
# 3. Were the tool arguments correct?

# Common fix: Add output verification step
verification = llm.generate(f"""
Given this tool output: {tool_result}
And this response draft: {agent_response}
Does the response accurately reflect the data? YES/NO + explanation
""")

2. Infinite Loop

Symptom: Request never completes, high token usage.

Debug with traces: Look at the steps count. If it's near your max, check the last 5 actions — you'll usually see a pattern: the agent tries tool A, gets an error, tries tool A again with slightly different args, gets the same error, etc.

# Prevention: Track action patterns
if steps[-5:] == steps[-10:-5]:  # Same 5 actions repeated
    logger.warning("Loop detected", pattern=steps[-5:])

3. Slow Responses

Symptom: Latency spikes to 30s+.

Debug with traces: Look at the span durations. Is one LLM call taking 15s (model congestion)? Is a tool call timing out? Is the agent taking too many steps?

The fix depends on what's slow:

4. Unexpected Tool Usage

Symptom: Agent calls tools it shouldn't, or calls them with wrong arguments.

Debug with traces: Check the LLM decision span. What was in the context when it decided to call that tool? Usually it's a prompt injection, ambiguous user input, or missing tool description.

Building an Agent Debug Dashboard

Here's what an effective agent dashboard should show:

Overview Panel

Deep Dive Panel

Tool Performance Panel

Cost Panel

Advanced: Replay and Time Travel Debugging

The killer feature of good agent observability: the ability to replay a past request with identical context.

class TraceRecorder:
    """Record everything needed to replay an agent execution."""

    def record(self, request_id: str):
        return {
            "request_id": request_id,
            "timestamp": time.time(),
            "user_input": self.user_input,
            "system_prompt": self.system_prompt,
            "tool_results": self.tool_results,  # Ordered list
            "llm_responses": self.llm_responses,  # Each step
            "context_at_each_step": self.contexts,
            "final_response": self.response,
        }

class TraceReplayer:
    """Replay a past request, optionally with modifications."""

    def replay(self, trace: dict, modifications: dict = None):
        """Replay with same inputs. Optionally change system prompt,
        tool behavior, or model to test alternatives."""

        config = {**trace}
        if modifications:
            config.update(modifications)

        # Re-run with original tool outputs (deterministic replay)
        # or with live tools (test if fix works)
        return self.agent.run(
            config["user_input"],
            system_prompt=config.get("system_prompt"),
            mock_tools=config.get("tool_results") if not modifications else None
        )

Replay debugging lets you answer "would my fix have prevented this bug?" without waiting for the same user input to happen again.

Observability Anti-Patterns

1. Logging Everything

Full prompt/response logging for every request will blow up your storage costs and create a PII liability. Log metadata by default, full payloads only for sampled requests or errors.

2. No Sampling

At scale, trace 100% of errors but sample successful requests. 10% sampling for successful requests gives you enough data without the cost.

# Sampling strategy
def should_trace_full(request) -> bool:
    if request.is_error:
        return True  # Always trace errors
    if request.cost_usd > 0.10:
        return True  # Always trace expensive requests
    if request.latency_ms > 10000:
        return True  # Always trace slow requests
    return random.random() < 0.10  # 10% sample for normal requests

3. Alerting on Symptoms, Not Causes

"Latency is high" is a symptom. "Model API latency p95 exceeds 10s" is a cause. Alert on specific, actionable metrics.

4. No Baseline

You can't spot anomalies without knowing what's normal. Run your agent for a week, establish baselines for latency, cost, steps, and token usage, then set alerts relative to those baselines.

Checklist: Agent Observability Essentials

Want to stay current on agent observability tools and practices? AI Agents Weekly covers production patterns, new tools, and real-world debugging stories 3x/week.

Conclusion

Observability is the difference between "the agent seems to work" and "the agent provably works." Without traces, you're guessing. Without structured logs, you're grepping. Without metrics, you're reacting instead of preventing.

Start with the basics: structured logging with request IDs and cost tracking. Add tracing when you need to debug specific failures. Build dashboards when you have enough data to establish baselines. The investment pays off the first time you diagnose a production issue in minutes instead of hours.

Your agent's reliability is only as good as your ability to see what it's doing.