You've built an AI agent. It works in your demo. But how do you know it'll work tomorrow? Or after you change the prompt? Or when OpenAI updates GPT-4o and your carefully-tuned behavior shifts?
Photo by Ludovic Delot on Pexels
Testing AI agents is fundamentally different from testing traditional software. The outputs are non-deterministic, the behavior depends on external APIs, and "correct" is often subjective. But that doesn't mean you can't test them rigorously. Here's how.
Traditional software testing relies on determinism: given input X, expect output Y. AI agents break this assumption in three ways:
temperature=0, model updates can change behavior.Test individual components in isolation: parsers, formatters, tool handlers, prompt templates. These are deterministic and fast.
# Test your tool handlers independently
def test_search_tool_parses_results():
raw_response = {"results": [{"title": "AI News", "url": "https://example.com"}]}
parsed = parse_search_results(raw_response)
assert len(parsed) == 1
assert parsed[0]["title"] == "AI News"
def test_prompt_template_includes_context():
template = build_prompt(
task="Write a summary",
context="Article about AI agents",
constraints=["Max 200 words", "Include sources"]
)
assert "Article about AI agents" in template
assert "Max 200 words" in template
What to test: Input parsing, output formatting, tool wrappers, error handling, prompt construction.
What NOT to test here: LLM responses, end-to-end workflows, agent decisions.
Evals are the core of agent testing. They assess whether the LLM's outputs meet your quality criteria. There are three approaches:
Exact match: For structured outputs (JSON, specific formats).
def test_agent_returns_valid_json():
response = agent.run("List the top 3 AI frameworks")
data = json.loads(response)
assert isinstance(data, list)
assert len(data) == 3
assert all("name" in item for item in data)
Rubric-based (LLM-as-judge): Use a second LLM to evaluate the first one's output.
def eval_with_judge(agent_output, task_description):
judge_prompt = f"""Rate this agent output on a scale of 1-5 for:
1. Accuracy: Does it correctly address the task?
2. Completeness: Does it cover all aspects?
3. Clarity: Is it well-organized and clear?
Task: {task_description}
Output: {agent_output}
Return JSON: {{"accuracy": N, "completeness": N, "clarity": N}}"""
scores = llm.call(judge_prompt)
return json.loads(scores)
# In your test
result = agent.run("Explain how RAG works")
scores = eval_with_judge(result, "Explain how RAG works")
assert scores["accuracy"] >= 4
assert scores["completeness"] >= 3
Human eval: For subjective quality (tone, creativity, persuasiveness). Expensive but sometimes necessary.
Agents don't just produce outputs — they take sequences of actions. Trajectory tests verify the agent chose the right tools, in the right order, with the right parameters.
def test_research_agent_trajectory():
agent = ResearchAgent(tools=[search, scrape, summarize])
result = agent.run("What's new in AI agents this week?")
# Verify the agent used the right tools in a reasonable order
trajectory = agent.get_trajectory()
# Should search first
assert trajectory[0]["tool"] == "search"
assert "AI agents" in trajectory[0]["input"]
# Should scrape at least 2 results
scrape_steps = [s for s in trajectory if s["tool"] == "scrape"]
assert len(scrape_steps) >= 2
# Should summarize at the end
assert trajectory[-1]["tool"] == "summarize"
# Should complete in reasonable number of steps
assert len(trajectory) <= 15
Key insight: Don't test for exact trajectories (too brittle). Test for properties: "used search before summarize", "scraped at least 2 sources", "completed in under 15 steps".
When you change a prompt, update a model, or modify a tool — does the agent still work? Regression tests catch silent breakages.
# Save golden outputs for critical scenarios
GOLDEN_TESTS = [
{
"input": "Summarize this article about GPT-5",
"expected_properties": {
"mentions_gpt5": True,
"word_count_range": (100, 300),
"contains_key_points": ["capabilities", "pricing", "availability"],
"tone": "professional"
}
},
{
"input": "Draft a tweet about our latest blog post",
"expected_properties": {
"char_count_max": 280,
"contains_link": True,
"tone": "engaging"
}
}
]
def test_regression_suite():
for test in GOLDEN_TESTS:
output = agent.run(test["input"])
props = test["expected_properties"]
if "word_count_range" in props:
wc = len(output.split())
low, high = props["word_count_range"]
assert low <= wc <= high, f"Word count {wc} outside {low}-{high}"
if "char_count_max" in props:
assert len(output) <= props["char_count_max"]
if "contains_key_points" in props:
for point in props["contains_key_points"]:
assert point.lower() in output.lower(), f"Missing: {point}"
Full pipeline tests with real (or sandboxed) external services. These are slow and expensive, so run them sparingly.
# End-to-end test of a newsletter pipeline
def test_newsletter_pipeline_e2e():
# Use test API keys / sandbox environment
pipeline = NewsletterPipeline(
scraper=RSSCraper(feeds=TEST_FEEDS),
scorer=RelevanceScorer(model="deepseek-chat"),
writer=NewsletterWriter(model="claude-haiku"),
publisher=ButtondownPublisher(api_key=TEST_API_KEY, draft=True)
)
result = pipeline.run()
assert result["articles_scraped"] > 0
assert result["articles_selected"] >= 5
assert result["newsletter_word_count"] > 500
assert result["published"] == True # draft mode
assert result["cost_usd"] < 0.50 # cost guard
| Tool | Type | Best For | Cost |
|---|---|---|---|
| promptfoo | Eval framework | Prompt testing, LLM comparison, CI | Free / open-source |
| Braintrust | Eval platform | Team eval workflows, logging | Free tier, then $50+/mo |
| LangSmith | Observability + evals | LangChain agents, tracing | Free tier, then $39/mo |
| Inspect AI | Eval framework | Multi-step agent evals, by Anthropic & AISI | Free / open-source |
| pytest + custom | Test framework | Unit + integration tests | Free |
| DeepEval | Eval framework | RAG evals, hallucination detection | Free / open-source |
promptfoo for eval testing (YAML config, easy CI integration, supports all major LLMs) and plain pytest for unit/integration tests. Add LangSmith or Braintrust when you need team collaboration and production monitoring.
# promptfoo.yaml
providers:
- id: openai:gpt-4o
- id: anthropic:claude-sonnet-4-6
- id: deepseek:deepseek-chat
prompts:
- "You are a research agent. {{task}}"
tests:
- vars:
task: "Find the top 3 AI agent frameworks in 2026"
assert:
- type: contains
value: "CrewAI"
- type: contains
value: "LangGraph"
- type: llm-rubric
value: "Output lists exactly 3 frameworks with brief descriptions"
- type: cost
threshold: 0.05 # max $0.05 per test
- vars:
task: "Summarize recent news about autonomous AI agents"
assert:
- type: llm-rubric
value: "Summary is factual, mentions specific products or companies, and is under 300 words"
- type: javascript
value: "output.split(' ').length <= 300"
# Run evals
npx promptfoo eval
npx promptfoo view # Opens web UI with results
Running evals costs real money. A full eval suite hitting GPT-4o for 100 test cases can cost $10-50 per run. Here's how to keep costs down:
| Test Type | Run Frequency | Cost per Run | Model |
|---|---|---|---|
| Unit tests | Every commit | $0 (no LLM) | N/A |
| Quick evals (10 cases) | Every PR | $0.50-2 | Haiku / DeepSeek |
| Full eval suite (100 cases) | Daily / release | $5-20 | Mix of models |
| E2E integration | Weekly / release | $10-50 | Production model |
Pro tip: Use cheap models (Haiku, DeepSeek) for frequent eval runs to catch obvious regressions. Reserve expensive models (Opus, GPT-4o) for pre-release full suites.
Add agent evals to your CI pipeline so regressions are caught before deployment:
# .github/workflows/agent-tests.yml
name: Agent Tests
on: [pull_request]
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt
- run: pytest tests/unit/ -v
quick-evals:
runs-on: ubuntu-latest
needs: unit-tests
steps:
- uses: actions/checkout@v4
- run: npx promptfoo eval --config promptfoo-quick.yaml
- run: npx promptfoo eval --output results.json
- name: Check pass rate
run: |
PASS_RATE=$(cat results.json | jq '.results.stats.successes / .results.stats.total')
if (( $(echo "$PASS_RATE < 0.8" | bc -l) )); then
echo "Eval pass rate $PASS_RATE below 80% threshold"
exit 1
fi
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Bad: assert output == "The top 3 AI frameworks are CrewAI, LangGraph, and AutoGen."
Good: assert "CrewAI" in output and len(output.split()) < 100
LLM outputs vary in phrasing. Test for properties and key content, not exact strings.
An agent stuck in a loop can burn $100 in API calls during a test run. Always set:
Don't test with real customer data or production APIs. Use sandboxed environments, test API keys, and synthetic data.
LLM non-determinism means some tests will be flaky. Don't ignore them — handle them:
temperature=0 where possibleTest what happens when things go wrong:
Our AI Agent Playbook includes eval templates, CI configs, and testing checklists for production agents.
Get the Playbook — $19Testing frameworks, new eval tools, and agent best practices. 3x/week, no spam.
Subscribe to AI Agents WeeklyGet the first chapter of The AI Agent Playbook delivered to your inbox. Learn what AI agents really are and see real production examples.
Get Free Chapter →