March 26, 2026 · 13 min read

How to Test AI Agents: A Practical Guide to Evals, Benchmarks & CI (2026)

You've built an AI agent. It works in your demo. But how do you know it'll work tomorrow? Or after you change the prompt? Or when OpenAI updates GPT-4o and your carefully-tuned behavior shifts?

Close-up view of a robotic assembly machine with vibrant red and metallic components.

Photo by Ludovic Delot on Pexels

Testing AI agents is fundamentally different from testing traditional software. The outputs are non-deterministic, the behavior depends on external APIs, and "correct" is often subjective. But that doesn't mean you can't test them rigorously. Here's how.

Why Agent Testing Is Different

Traditional software testing relies on determinism: given input X, expect output Y. AI agents break this assumption in three ways:

Non-deterministic outputs. The same prompt can produce different responses. Even with temperature=0, model updates can change behavior.
Multi-step execution. Agents don't just return a response — they take actions, use tools, and make decisions across multiple steps. A bug might only appear at step 7 of a 10-step workflow.
External dependencies. Agents call APIs, browse the web, execute code. Your test environment needs to handle these without hitting production systems (or racking up API bills).

            The testing paradox: The more autonomous your agent, the harder it is to test. A chatbot that answers questions has a small behavior space. An agent that can write code, call APIs, and make decisions has an almost infinite one. You can't test every path — you need to test the right paths.
        

The 5 Levels of Agent Testing

Level 1: Unit Tests (Component Level)

Test individual components in isolation: parsers, formatters, tool handlers, prompt templates. These are deterministic and fast.

# Test your tool handlers independently
def test_search_tool_parses_results():
    raw_response = {"results": [{"title": "AI News", "url": "https://example.com"}]}
    parsed = parse_search_results(raw_response)
    assert len(parsed) == 1
    assert parsed[0]["title"] == "AI News"

def test_prompt_template_includes_context():
    template = build_prompt(
        task="Write a summary",
        context="Article about AI agents",
        constraints=["Max 200 words", "Include sources"]
    )
    assert "Article about AI agents" in template
    assert "Max 200 words" in template

What to test: Input parsing, output formatting, tool wrappers, error handling, prompt construction.

What NOT to test here: LLM responses, end-to-end workflows, agent decisions.

Level 2: Eval Tests (LLM Output Quality)

Evals are the core of agent testing. They assess whether the LLM's outputs meet your quality criteria. There are three approaches:

Exact match: For structured outputs (JSON, specific formats).

def test_agent_returns_valid_json():
    response = agent.run("List the top 3 AI frameworks")
    data = json.loads(response)
    assert isinstance(data, list)
    assert len(data) == 3
    assert all("name" in item for item in data)

Rubric-based (LLM-as-judge): Use a second LLM to evaluate the first one's output.

def eval_with_judge(agent_output, task_description):
    judge_prompt = f"""Rate this agent output on a scale of 1-5 for:
    1. Accuracy: Does it correctly address the task?
    2. Completeness: Does it cover all aspects?
    3. Clarity: Is it well-organized and clear?

    Task: {task_description}
    Output: {agent_output}

    Return JSON: {{"accuracy": N, "completeness": N, "clarity": N}}"""

    scores = llm.call(judge_prompt)
    return json.loads(scores)

# In your test
result = agent.run("Explain how RAG works")
scores = eval_with_judge(result, "Explain how RAG works")
assert scores["accuracy"] >= 4
assert scores["completeness"] >= 3

Human eval: For subjective quality (tone, creativity, persuasiveness). Expensive but sometimes necessary.

Level 3: Trajectory Tests (Multi-Step Behavior)

Agents don't just produce outputs — they take sequences of actions. Trajectory tests verify the agent chose the right tools, in the right order, with the right parameters.

def test_research_agent_trajectory():
    agent = ResearchAgent(tools=[search, scrape, summarize])
    result = agent.run("What's new in AI agents this week?")

    # Verify the agent used the right tools in a reasonable order
    trajectory = agent.get_trajectory()

    # Should search first
    assert trajectory[0]["tool"] == "search"
    assert "AI agents" in trajectory[0]["input"]

    # Should scrape at least 2 results
    scrape_steps = [s for s in trajectory if s["tool"] == "scrape"]
    assert len(scrape_steps) >= 2

    # Should summarize at the end
    assert trajectory[-1]["tool"] == "summarize"

    # Should complete in reasonable number of steps
    assert len(trajectory) <= 15

Key insight: Don't test for exact trajectories (too brittle). Test for properties: "used search before summarize", "scraped at least 2 sources", "completed in under 15 steps".

Level 4: Regression Tests (Behavior Stability)

When you change a prompt, update a model, or modify a tool — does the agent still work? Regression tests catch silent breakages.

# Save golden outputs for critical scenarios
GOLDEN_TESTS = [
    {
        "input": "Summarize this article about GPT-5",
        "expected_properties": {
            "mentions_gpt5": True,
            "word_count_range": (100, 300),
            "contains_key_points": ["capabilities", "pricing", "availability"],
            "tone": "professional"
        }
    },
    {
        "input": "Draft a tweet about our latest blog post",
        "expected_properties": {
            "char_count_max": 280,
            "contains_link": True,
            "tone": "engaging"
        }
    }
]

def test_regression_suite():
    for test in GOLDEN_TESTS:
        output = agent.run(test["input"])
        props = test["expected_properties"]

        if "word_count_range" in props:
            wc = len(output.split())
            low, high = props["word_count_range"]
            assert low <= wc <= high, f"Word count {wc} outside {low}-{high}"

        if "char_count_max" in props:
            assert len(output) <= props["char_count_max"]

        if "contains_key_points" in props:
            for point in props["contains_key_points"]:
                assert point.lower() in output.lower(), f"Missing: {point}"

Level 5: Integration Tests (End-to-End)

Full pipeline tests with real (or sandboxed) external services. These are slow and expensive, so run them sparingly.

# End-to-end test of a newsletter pipeline
def test_newsletter_pipeline_e2e():
    # Use test API keys / sandbox environment
    pipeline = NewsletterPipeline(
        scraper=RSSCraper(feeds=TEST_FEEDS),
        scorer=RelevanceScorer(model="deepseek-chat"),
        writer=NewsletterWriter(model="claude-haiku"),
        publisher=ButtondownPublisher(api_key=TEST_API_KEY, draft=True)
    )

    result = pipeline.run()

    assert result["articles_scraped"] > 0
    assert result["articles_selected"] >= 5
    assert result["newsletter_word_count"] > 500
    assert result["published"] == True  # draft mode
    assert result["cost_usd"] < 0.50  # cost guard

Testing Tools & Frameworks

Tool	Type	Best For	Cost
promptfoo	Eval framework	Prompt testing, LLM comparison, CI	Free / open-source
Braintrust	Eval platform	Team eval workflows, logging	Free tier, then $50+/mo
LangSmith	Observability + evals	LangChain agents, tracing	Free tier, then $39/mo
Inspect AI	Eval framework	Multi-step agent evals, by Anthropic & AISI	Free / open-source
pytest + custom	Test framework	Unit + integration tests	Free
DeepEval	Eval framework	RAG evals, hallucination detection	Free / open-source

            Our pick: Start with promptfoo for eval testing (YAML config, easy CI integration, supports all major LLMs) and plain pytest for unit/integration tests. Add LangSmith or Braintrust when you need team collaboration and production monitoring.
        

Setting Up promptfoo for Agent Evals

# promptfoo.yaml
providers:
  - id: openai:gpt-4o
  - id: anthropic:claude-sonnet-4-6
  - id: deepseek:deepseek-chat

prompts:
  - "You are a research agent. {{task}}"

tests:
  - vars:
      task: "Find the top 3 AI agent frameworks in 2026"
    assert:
      - type: contains
        value: "CrewAI"
      - type: contains
        value: "LangGraph"
      - type: llm-rubric
        value: "Output lists exactly 3 frameworks with brief descriptions"
      - type: cost
        threshold: 0.05  # max $0.05 per test

  - vars:
      task: "Summarize recent news about autonomous AI agents"
    assert:
      - type: llm-rubric
        value: "Summary is factual, mentions specific products or companies, and is under 300 words"
      - type: javascript
        value: "output.split(' ').length <= 300"

# Run evals
npx promptfoo eval
npx promptfoo view  # Opens web UI with results

Cost-Aware Testing Strategy

Running evals costs real money. A full eval suite hitting GPT-4o for 100 test cases can cost $10-50 per run. Here's how to keep costs down:

Test Type	Run Frequency	Cost per Run	Model
Unit tests	Every commit	$0 (no LLM)	N/A
Quick evals (10 cases)	Every PR	$0.50-2	Haiku / DeepSeek
Full eval suite (100 cases)	Daily / release	$5-20	Mix of models
E2E integration	Weekly / release	$10-50	Production model

Pro tip: Use cheap models (Haiku, DeepSeek) for frequent eval runs to catch obvious regressions. Reserve expensive models (Opus, GPT-4o) for pre-release full suites.

CI/CD Integration

Add agent evals to your CI pipeline so regressions are caught before deployment:

# .github/workflows/agent-tests.yml
name: Agent Tests
on: [pull_request]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - run: pytest tests/unit/ -v

  quick-evals:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - run: npx promptfoo eval --config promptfoo-quick.yaml
      - run: npx promptfoo eval --output results.json
      - name: Check pass rate
        run: |
          PASS_RATE=$(cat results.json | jq '.results.stats.successes / .results.stats.total')
          if (( $(echo "$PASS_RATE < 0.8" | bc -l) )); then
            echo "Eval pass rate $PASS_RATE below 80% threshold"
            exit 1
          fi
    env:
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Testing Anti-Patterns to Avoid

1. Testing Exact String Matches

Bad: assert output == "The top 3 AI frameworks are CrewAI, LangGraph, and AutoGen."

Good: assert "CrewAI" in output and len(output.split()) < 100

LLM outputs vary in phrasing. Test for properties and key content, not exact strings.

2. No Cost Guards

An agent stuck in a loop can burn $100 in API calls during a test run. Always set:

Max steps per agent run
Max cost per test case
Timeout per test

3. Testing in Production

Don't test with real customer data or production APIs. Use sandboxed environments, test API keys, and synthetic data.

4. Ignoring Flaky Tests

LLM non-determinism means some tests will be flaky. Don't ignore them — handle them:

Run flaky tests 3x and pass if 2/3 succeed
Widen assertions (score >= 3 instead of score == 5)
Use temperature=0 where possible

5. Only Testing Happy Paths

Test what happens when things go wrong:

API returns an error
Tool returns empty results
User gives ambiguous instructions
Context window is nearly full
Model refuses the request (safety filters)

Real-World Testing Checklist

Unit tests for all tool handlers and parsers (deterministic, fast)
10-20 core evals covering your most important use cases
Trajectory tests for multi-step workflows (right tools, right order)
Cost guards on every test (max steps, max cost, timeout)
Regression suite that runs on every prompt/model change
LLM-as-judge for subjective quality (accuracy, tone, completeness)
Error path tests for API failures, empty results, edge cases
CI integration with pass/fail threshold (e.g., 80% pass rate)
Cost monitoring per test run to catch expensive regressions
Monthly human review of a sample of agent outputs

Key Takeaways

Test properties, not exact outputs. LLMs are non-deterministic. Assert that the output contains key information, stays within length limits, and uses the right tools.
Layer your tests: unit (free, fast) → evals (cheap, frequent) → integration (expensive, rare).
LLM-as-judge works. Using a second model to evaluate the first is surprisingly effective and scales better than human review.
Cost guards are mandatory. A test without a cost limit is a bug waiting to happen.
Start with 10 evals. You don't need 1000 test cases. 10 well-chosen evals covering your critical paths will catch 90% of regressions.
promptfoo + pytest is all you need to start. Add fancier tools when you have a team.

Ship Agents With Confidence

Our AI Agent Playbook includes eval templates, CI configs, and testing checklists for production agents.

Get the Playbook — $19

Stay Updated on AI Agents

Testing frameworks, new eval tools, and agent best practices. 3x/week, no spam.

Subscribe to AI Agents Weekly

Not ready to buy? Start with Chapter 1 — free

Get the first chapter of The AI Agent Playbook delivered to your inbox. Learn what AI agents really are and see real production examples.

Get Free Chapter →