You've built an AI agent. It works in your demo. But how do you know it'll work tomorrow? Or after you change the prompt? Or when OpenAI updates GPT-4o and your carefully-tuned behavior shifts?
Testing AI agents is fundamentally different from testing traditional software. The outputs are non-deterministic, the behavior depends on external APIs, and "correct" is often subjective. But that doesn't mean you can't test them rigorously. Here's how.
Traditional software testing relies on determinism: given input X, expect output Y. AI agents break this assumption in three ways:
temperature=0, model updates can change behavior.Test individual components in isolation: parsers, formatters, tool handlers, prompt templates. These are deterministic and fast.
# Test your tool handlers independently
def test_search_tool_parses_results():
raw_response = {"results": [{"title": "AI News", "url": "https://example.com"}]}
parsed = parse_search_results(raw_response)
assert len(parsed) == 1
assert parsed[0]["title"] == "AI News"
def test_prompt_template_includes_context():
template = build_prompt(
task="Write a summary",
context="Article about AI agents",
constraints=["Max 200 words", "Include sources"]
)
assert "Article about AI agents" in template
assert "Max 200 words" in template
What to test: Input parsing, output formatting, tool wrappers, error handling, prompt construction.
What NOT to test here: LLM responses, end-to-end workflows, agent decisions.
Evals are the core of agent testing. They assess whether the LLM's outputs meet your quality criteria. There are three approaches:
Exact match: For structured outputs (JSON, specific formats).
def test_agent_returns_valid_json():
response = agent.run("List the top 3 AI frameworks")
data = json.loads(response)
assert isinstance(data, list)
assert len(data) == 3
assert all("name" in item for item in data)
Rubric-based (LLM-as-judge): Use a second LLM to evaluate the first one's output.
def eval_with_judge(agent_output, task_description):
judge_prompt = f"""Rate this agent output on a scale of 1-5 for:
1. Accuracy: Does it correctly address the task?
2. Completeness: Does it cover all aspects?
3. Clarity: Is it well-organized and clear?
Task: {task_description}
Output: {agent_output}
Return JSON: {{"accuracy": N, "completeness": N, "clarity": N}}"""
scores = llm.call(judge_prompt)
return json.loads(scores)
# In your test
result = agent.run("Explain how RAG works")
scores = eval_with_judge(result, "Explain how RAG works")
assert scores["accuracy"] >= 4
assert scores["completeness"] >= 3
Human eval: For subjective quality (tone, creativity, persuasiveness). Expensive but sometimes necessary.
Agents don't just produce outputs — they take sequences of actions. Trajectory tests verify the agent chose the right tools, in the right order, with the right parameters.
def test_research_agent_trajectory():
agent = ResearchAgent(tools=[search, scrape, summarize])
result = agent.run("What's new in AI agents this week?")
# Verify the agent used the right tools in a reasonable order
trajectory = agent.get_trajectory()
# Should search first
assert trajectory[0]["tool"] == "search"
assert "AI agents" in trajectory[0]["input"]
# Should scrape at least 2 results
scrape_steps = [s for s in trajectory if s["tool"] == "scrape"]
assert len(scrape_steps) >= 2
# Should summarize at the end
assert trajectory[-1]["tool"] == "summarize"
# Should complete in reasonable number of steps
assert len(trajectory) <= 15
Key insight: Don't test for exact trajectories (too brittle). Test for properties: "used search before summarize", "scraped at least 2 sources", "completed in under 15 steps".
When you change a prompt, update a model, or modify a tool — does the agent still work? Regression tests catch silent breakages.
# Save golden outputs for critical scenarios
GOLDEN_TESTS = [
{
"input": "Summarize this article about GPT-5",
"expected_properties": {
"mentions_gpt5": True,
"word_count_range": (100, 300),
"contains_key_points": ["capabilities", "pricing", "availability"],
"tone": "professional"
}
},
{
"input": "Draft a tweet about our latest blog post",
"expected_properties": {
"char_count_max": 280,
"contains_link": True,
"tone": "engaging"
}
}
]
def test_regression_suite():
for test in GOLDEN_TESTS:
output = agent.run(test["input"])
props = test["expected_properties"]
if "word_count_range" in props:
wc = len(output.split())
low, high = props["word_count_range"]
assert low <= wc <= high, f"Word count {wc} outside {low}-{high}"
if "char_count_max" in props:
assert len(output) <= props["char_count_max"]
if "contains_key_points" in props:
for point in props["contains_key_points"]:
assert point.lower() in output.lower(), f"Missing: {point}"
Full pipeline tests with real (or sandboxed) external services. These are slow and expensive, so run them sparingly.
# End-to-end test of a newsletter pipeline
def test_newsletter_pipeline_e2e():
# Use test API keys / sandbox environment
pipeline = NewsletterPipeline(
scraper=RSSCraper(feeds=TEST_FEEDS),
scorer=RelevanceScorer(model="deepseek-chat"),
writer=NewsletterWriter(model="claude-haiku"),
publisher=ButtondownPublisher(api_key=TEST_API_KEY, draft=True)
)
result = pipeline.run()
assert result["articles_scraped"] > 0
assert result["articles_selected"] >= 5
assert result["newsletter_word_count"] > 500
assert result["published"] == True # draft mode
assert result["cost_usd"] < 0.50 # cost guard
| Tool | Type | Best For | Cost |
|---|---|---|---|
| promptfoo | Eval framework | Prompt testing, LLM comparison, CI | Free / open-source |
| Braintrust | Eval platform | Team eval workflows, logging | Free tier, then $50+/mo |
| LangSmith | Observability + evals | LangChain agents, tracing | Free tier, then $39/mo |
| Inspect AI | Eval framework | Multi-step agent evals, by Anthropic & AISI | Free / open-source |
| pytest + custom | Test framework | Unit + integration tests | Free |
| DeepEval | Eval framework | RAG evals, hallucination detection | Free / open-source |
promptfoo for eval testing (YAML config, easy CI integration, supports all major LLMs) and plain pytest for unit/integration tests. Add LangSmith or Braintrust when you need team collaboration and production monitoring.
# promptfoo.yaml
providers:
- id: openai:gpt-4o
- id: anthropic:claude-sonnet-4-6
- id: deepseek:deepseek-chat
prompts:
- "You are a research agent. {{task}}"
tests:
- vars:
task: "Find the top 3 AI agent frameworks in 2026"
assert:
- type: contains
value: "CrewAI"
- type: contains
value: "LangGraph"
- type: llm-rubric
value: "Output lists exactly 3 frameworks with brief descriptions"
- type: cost
threshold: 0.05 # max $0.05 per test
- vars:
task: "Summarize recent news about autonomous AI agents"
assert:
- type: llm-rubric
value: "Summary is factual, mentions specific products or companies, and is under 300 words"
- type: javascript
value: "output.split(' ').length <= 300"
# Run evals
npx promptfoo eval
npx promptfoo view # Opens web UI with results
Running evals costs real money. A full eval suite hitting GPT-4o for 100 test cases can cost $10-50 per run. Here's how to keep costs down:
| Test Type | Run Frequency | Cost per Run | Model |
|---|---|---|---|
| Unit tests | Every commit | $0 (no LLM) | N/A |
| Quick evals (10 cases) | Every PR | $0.50-2 | Haiku / DeepSeek |
| Full eval suite (100 cases) | Daily / release | $5-20 | Mix of models |
| E2E integration | Weekly / release | $10-50 | Production model |
Pro tip: Use cheap models (Haiku, DeepSeek) for frequent eval runs to catch obvious regressions. Reserve expensive models (Opus, GPT-4o) for pre-release full suites.
Add agent evals to your CI pipeline so regressions are caught before deployment:
# .github/workflows/agent-tests.yml
name: Agent Tests
on: [pull_request]
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt
- run: pytest tests/unit/ -v
quick-evals:
runs-on: ubuntu-latest
needs: unit-tests
steps:
- uses: actions/checkout@v4
- run: npx promptfoo eval --config promptfoo-quick.yaml
- run: npx promptfoo eval --output results.json
- name: Check pass rate
run: |
PASS_RATE=$(cat results.json | jq '.results.stats.successes / .results.stats.total')
if (( $(echo "$PASS_RATE < 0.8" | bc -l) )); then
echo "Eval pass rate $PASS_RATE below 80% threshold"
exit 1
fi
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Bad: assert output == "The top 3 AI frameworks are CrewAI, LangGraph, and AutoGen."
Good: assert "CrewAI" in output and len(output.split()) < 100
LLM outputs vary in phrasing. Test for properties and key content, not exact strings.
An agent stuck in a loop can burn $100 in API calls during a test run. Always set:
Don't test with real customer data or production APIs. Use sandboxed environments, test API keys, and synthetic data.
LLM non-determinism means some tests will be flaky. Don't ignore them — handle them:
temperature=0 where possibleTest what happens when things go wrong:
Our AI Agent Playbook includes eval templates, CI configs, and testing checklists for production agents.
Get the Playbook — $29Testing frameworks, new eval tools, and agent best practices. 3x/week, no spam.
Subscribe to AI Agents Weekly