Running AI agents in production is expensive. A single autonomous agent making 50 API calls per task at $15/MTok can burn through $100/day without breaking a sweat. But here's what most people miss: 80% of those calls don't need a frontier model.
This guide covers the exact strategies we use at Paxrel to run autonomous agents for under $3/month — down from an estimated $90/month if we used GPT-4o for everything.
Before optimizing, you need to understand where the money goes. An AI agent's cost breaks down into:
| Model | Input $/MTok | Output $/MTok | Best For |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | Complex reasoning, multi-step planning |
| Claude Sonnet 4 | $3.00 | $15.00 | Coding, analysis, long-form writing |
| Claude Haiku 4.5 | $0.80 | $4.00 | Fast tasks, classification, extraction |
| GPT-4o-mini | $0.15 | $0.60 | Simple tasks, high volume |
| DeepSeek V3 | $0.27 | $1.10 | General purpose, great value |
| Gemini 2.5 Flash | $0.15 | $0.60 | High volume, long context |
| Llama 3.3 70B (self-hosted) | ~$0.05 | ~$0.05 | Predictable costs, data privacy |
The price difference between GPT-4o output ($10/MTok) and GPT-4o-mini ($0.60/MTok) is 16x. That's why model routing is the single highest-impact optimization.
The idea is simple: use expensive models only when you need them. Route simple tasks to cheap models, complex tasks to frontier models.
# Smart model router
import re
def select_model(task_type, complexity=None):
"""Route tasks to the cheapest adequate model."""
# Tier 1: Frontier models ($10-15/MTok output)
# Only for tasks that genuinely need them
FRONTIER_TASKS = ["complex_planning", "novel_code_generation",
"multi_step_reasoning", "creative_writing"]
# Tier 2: Mid-range models ($1-4/MTok output)
MID_TASKS = ["code_review", "summarization", "analysis",
"document_qa", "translation"]
# Tier 3: Cheap models ($0.15-0.60/MTok output)
CHEAP_TASKS = ["classification", "extraction", "formatting",
"simple_qa", "scoring", "tagging"]
if task_type in FRONTIER_TASKS:
return "claude-sonnet-4-20250514"
elif task_type in MID_TASKS:
return "deepseek-chat"
else:
return "gpt-4o-mini"
# In practice:
model = select_model("scoring") # → gpt-4o-mini ($0.60/MTok)
model = select_model("summarization") # → deepseek ($1.10/MTok)
model = select_model("complex_planning") # → claude-sonnet ($15/MTok)
Every token in your prompt costs money. Most prompts are bloated with unnecessary context, verbose instructions, and redundant examples.
# Before: 847 tokens
"""You are a helpful AI assistant that specializes in analyzing
news articles about artificial intelligence. Your task is to
read the following article and determine how relevant it is to
the topic of AI agents. Please consider factors such as whether
the article discusses autonomous agents, AI automation, agent
frameworks, or related topics. Rate the relevance on a scale
of 1 to 10, where 1 means not relevant at all and 10 means
highly relevant. Also provide a brief explanation of your rating.
Article: {article_text}
Please provide your response in JSON format with the keys
"score" and "reason"."""
# After: 127 tokens
"""Rate this article's relevance to AI agents (1-10).
Return JSON: {"score": N, "reason": "one line"}
{article_text}"""
That's an 85% token reduction. At 120 articles per run, that's ~86,400 fewer input tokens per pipeline execution.
Return JSON: {format} instead of paragraphs explaining the formatIf your agent asks the same question twice, you're paying twice for the same answer. Caching is the easiest optimization with the highest ROI.
import hashlib
import json
from pathlib import Path
class LLMCache:
def __init__(self, cache_dir="cache/llm"):
self.dir = Path(cache_dir)
self.dir.mkdir(parents=True, exist_ok=True)
self.hits = 0
self.misses = 0
def _key(self, model, messages, **kwargs):
"""Deterministic cache key from request params."""
data = json.dumps({"model": model, "messages": messages,
**kwargs}, sort_keys=True)
return hashlib.sha256(data.encode()).hexdigest()[:16]
def get(self, model, messages, **kwargs):
key = self._key(model, messages, **kwargs)
path = self.dir / f"{key}.json"
if path.exists():
self.hits += 1
return json.loads(path.read_text())
self.misses += 1
return None
def set(self, model, messages, response, **kwargs):
key = self._key(model, messages, **kwargs)
path = self.dir / f"{key}.json"
path.write_text(json.dumps(response))
# Usage
cache = LLMCache()
result = cache.get(model, messages)
if result is None:
result = call_llm(model, messages)
cache.set(model, messages, result)
# After a week: "Cache hit rate: 34% — saved $12.50"
Anthropic and OpenAI prompt caching: Both providers now offer automatic prompt caching. If your system prompt is the same across calls, you get up to 90% discount on cached input tokens. This is free — just structure your prompts so the static parts come first.
Instead of scoring articles one at a time, batch them. One API call with 10 articles is cheaper than 10 calls with 1 article each, because you amortize the system prompt and reduce per-request overhead.
# Bad: 120 API calls for 120 articles
for article in articles:
score = call_llm(f"Score this article: {article['title']}")
# Good: 12 API calls for 120 articles (batches of 10)
for batch in chunks(articles, 10):
titles = "\n".join(f"{i+1}. {a['title']}" for i, a in enumerate(batch))
scores = call_llm(f"Score these articles 1-10:\n{titles}\nReturn JSON array.")
# Savings: ~60% fewer input tokens (system prompt sent 12x instead of 120x)
Not every agent task needs the full pipeline. Build tiers of execution complexity:
def handle_task(task):
"""Progressive complexity: try cheap first, escalate if needed."""
# Tier 1: Pattern matching (free)
if regex_match := check_patterns(task):
return regex_match
# Tier 2: Cheap model ($0.001)
result = call_llm("gpt-4o-mini", task, max_tokens=100)
if result.confidence > 0.9:
return result
# Tier 3: Mid model ($0.01)
result = call_llm("deepseek-chat", task, max_tokens=500)
if result.confidence > 0.8:
return result
# Tier 4: Frontier model ($0.05)
return call_llm("claude-sonnet-4-20250514", task, max_tokens=2000)
Most tasks resolve at Tier 1 or 2. You only pay frontier prices for genuinely hard problems.
Output tokens are 3-5x more expensive than input tokens. Yet most agents let models ramble. Control output length aggressively:
max_tokens explicitly: Don't let a scoring task generate 500 tokens when 20 will do"stop": ["\n\n"] prevents unnecessary continuation# Bad: no output control, model writes 200 tokens
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Summarize: {article}"}]
)
# Cost: ~$0.002
# Good: controlled output
response = client.chat.completions.create(
model="gpt-4o-mini", # cheaper model
messages=[{"role": "user",
"content": f"Summarize in 1 sentence:\n{article}"}],
max_tokens=60,
stop=["\n"]
)
# Cost: ~$0.00005 (40x cheaper)
Retries are hidden cost multipliers. An agent that retries 3 times on every API error is 4x more expensive than one that fails gracefully.
import time
def call_with_budget(model, messages, max_cost=0.01, max_retries=2):
"""API call with cost awareness."""
for attempt in range(max_retries + 1):
try:
response = call_llm(model, messages)
cost = estimate_cost(model, messages, response)
if cost > max_cost:
print(f"Warning: call cost ${cost:.4f} exceeds budget ${max_cost}")
return response
except RateLimitError:
wait = 2 ** attempt
time.sleep(wait)
except (APIError, TimeoutError) as e:
if attempt == max_retries:
# Fallback to cheaper model instead of retrying expensive one
return call_llm("gpt-4o-mini", messages)
time.sleep(1)
return None
Here's the actual cost breakdown for Paxrel's autonomous newsletter agent running 3x/week:
| Pipeline Step | Model | Calls/Run | Cost/Run | Monthly |
|---|---|---|---|---|
| Scraping | None (RSS) | 11 feeds | $0.00 | $0.00 |
| Scoring (120 articles) | DeepSeek V3 | 12 batches | $0.06 | $0.72 |
| Newsletter writing | DeepSeek V3 | 1 call | $0.03 | $0.36 |
| Tweet generation | DeepSeek V3 | 1 call | $0.01 | $0.12 |
| Subject line | DeepSeek V3 | 1 call | $0.005 | $0.06 |
| Publishing | None (API) | 1 call | $0.00 | $0.00 |
| Total | $0.105 | $1.26 |
Add Reddit karma building (~$0.50/month), tweet scheduling (~$0.30/month), and SEO content scoring (~$0.40/month), and we're at approximately $2.50/month for a fully autonomous business agent.
You can't optimize what you don't measure. Build a simple cost tracker:
import json
from datetime import datetime
from pathlib import Path
class CostTracker:
PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-sonnet-4": {"input": 3.00, "output": 15.00},
"deepseek-chat": {"input": 0.27, "output": 1.10},
}
def __init__(self, log_file="costs.jsonl"):
self.log_file = Path(log_file)
def log(self, model, input_tokens, output_tokens, task=""):
pricing = self.PRICING.get(model, {"input": 1.0, "output": 5.0})
cost = (input_tokens * pricing["input"] +
output_tokens * pricing["output"]) / 1_000_000
entry = {
"timestamp": datetime.now().isoformat(),
"model": model,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"cost_usd": round(cost, 6),
"task": task
}
with open(self.log_file, "a") as f:
f.write(json.dumps(entry) + "\n")
return cost
def daily_total(self):
today = datetime.now().strftime("%Y-%m-%d")
total = 0
for line in open(self.log_file):
entry = json.loads(line)
if entry["timestamp"].startswith(today):
total += entry["cost_usd"]
return total
max_tokens on every call. No exceptions.max_tokens, structured output, and brevity instructions.The AI Agent Playbook includes cost tracking templates, model routing configs, and optimization checklists for production agents.
Get the Playbook — $29Real cost numbers, optimization tips, and agent infrastructure insights. 3x/week.
Subscribe Free