It's 3 AM. PagerDuty fires. Your API latency is spiking. The on-call engineer wakes up, opens their laptop, checks Grafana, reads the alert, SSHs into the server, checks logs, identifies the root cause (a memory leak from the latest deploy), rolls back the deployment, verifies the fix, and goes back to sleep. Total time: 45 minutes of groggy debugging.
An AI DevOps agent does the same thing in 3 minutes. It receives the alert, correlates it with recent deployments, checks relevant logs, identifies the root cause, executes the rollback runbook, verifies the fix, and pages the human only if it can't resolve the issue automatically.
This isn't science fiction — teams are running these agents in production today. Here's how to build one.
What an AI DevOps Agent Can Handle
| Task | Manual Time | Agent Time | Automation Level |
|---|---|---|---|
| Alert triage & correlation | 10-30 min | 30 sec | Fully auto |
| Log analysis & root cause | 15-60 min | 1-2 min | Fully auto |
| Runbook execution | 10-20 min | 2-3 min | Auto with approval |
| Deployment rollback | 5-15 min | 1 min | Auto with approval |
| Scaling decisions | 5-10 min | 30 sec | Auto within limits |
| Post-incident report | 1-2 hours | 5 min | Fully auto |
| Security alert response | 30-60 min | 2-5 min | Triage auto, response manual |
Architecture: The AI SRE
Alert (PagerDuty/Grafana/Datadog)
│
▼
┌─────────────────┐
│ Alert Classifier │ → Severity, category, affected service
└────────┬────────┘
│
▼
┌─────────────────┐
│ Context Gatherer │ → Recent deploys, related alerts, metrics, logs
└────────┬────────┘
│
▼
┌────────────────────┐
│ Root Cause Analyzer │ → Correlate signals, identify probable cause
└────────┬───────────┘
│
▼
┌─────────────────┐
│ Runbook Selector │ → Match to known resolution playbook
└────────┬────────┘
│
▼
┌─────────────────────┐
│ Action Executor │ → Run remediation (with approval if needed)
└────────┬────────────┘
│
▼
┌──────────────────┐
│ Verification │ → Confirm fix, update status, generate report
└──────────────────┘
Step 1: Alert Classification and Enrichment
Raw alerts are noisy. Your agent's first job is to classify, deduplicate, and enrich them with context.
class AlertClassifier:
CATEGORIES = {
"performance": ["latency", "slow", "timeout", "p99", "response time"],
"availability": ["down", "unreachable", "5xx", "health check", "connection refused"],
"resource": ["cpu", "memory", "disk", "oom", "out of memory", "storage"],
"deployment": ["deploy", "rollout", "version", "release", "canary"],
"security": ["unauthorized", "403", "brute force", "suspicious", "CVE"],
}
async def classify(self, alert: dict) -> dict:
# Quick keyword classification
text = f"{alert['title']} {alert['description']}".lower()
category = "unknown"
for cat, keywords in self.CATEGORIES.items():
if any(kw in text for kw in keywords):
category = cat
break
# Enrich with context
enriched = {
**alert,
"category": category,
"affected_service": self.extract_service(alert),
"recent_deploys": await self.get_recent_deploys(alert["service"], hours=6),
"related_alerts": await self.get_correlated_alerts(alert, minutes=30),
"current_metrics": await self.get_service_metrics(alert["service"]),
}
# Severity adjustment
if len(enriched["related_alerts"]) > 3:
enriched["severity"] = "critical" # Multiple correlated alerts = serious
return enriched
async def get_recent_deploys(self, service: str, hours: int) -> list:
"""Check CI/CD for recent deployments to this service."""
deploys = await github.get_deployments(service, since=f"{hours}h")
return [{"sha": d.sha, "author": d.author, "time": d.time,
"message": d.message} for d in deploys]
Step 2: Intelligent Log Analysis
Logs hold the answer to most incidents. But parsing through thousands of log lines at 3 AM is where humans make mistakes. Your agent doesn't get tired.
class LogAnalyzer:
async def analyze(self, service: str, time_range: tuple, alert: dict) -> dict:
# Fetch relevant logs
logs = await self.fetch_logs(
service=service,
start=time_range[0],
end=time_range[1],
level=["ERROR", "WARN", "FATAL"]
)
# Pattern detection: find error spikes
error_timeline = self.build_error_timeline(logs, bucket_minutes=5)
spike_time = self.find_spike(error_timeline)
# Extract unique error messages
unique_errors = self.deduplicate_errors(logs)
# Use LLM to analyze patterns
analysis = await self.llm.generate(f"""Analyze these error logs from service '{service}'.
Alert: {alert['title']}
Error spike detected at: {spike_time}
Recent deployments: {alert.get('recent_deploys', 'None')}
Unique errors (count, message):
{self.format_errors(unique_errors[:20])}
Questions to answer:
1. What is the most likely root cause?
2. Did this start after a deployment?
3. Is this a new error or a recurring pattern?
4. What is the blast radius (which users/features affected)?
5. Suggested remediation steps.
Be specific. Reference actual error messages and timestamps.""")
return {
"spike_time": spike_time,
"unique_errors": unique_errors[:10],
"analysis": analysis,
"log_count": len(logs)
}
Step 3: Automated Runbook Execution
Runbooks are step-by-step procedures for handling known incidents. Most teams have them in Confluence or Notion, gathering dust. An AI agent can execute them automatically.
class RunbookExecutor:
def __init__(self):
self.runbooks = self.load_runbooks()
def load_runbooks(self) -> dict:
return {
"high_latency_api": {
"trigger": {"category": "performance", "service_type": "api"},
"steps": [
{"action": "check_metrics", "params": {"metric": "request_rate"}},
{"action": "check_metrics", "params": {"metric": "error_rate"}},
{"action": "check_recent_deploys", "params": {}},
{"action": "check_downstream_health", "params": {}},
{"decision": "if_recent_deploy_and_error_spike",
"true": "rollback_deployment",
"false": "scale_horizontally"},
{"action": "verify_recovery", "params": {"wait_seconds": 120}},
],
"approval_required": False, # Auto-execute for latency
},
"oom_kill": {
"trigger": {"category": "resource", "error_pattern": "OOM"},
"steps": [
{"action": "identify_pod", "params": {}},
{"action": "capture_heap_dump", "params": {}},
{"action": "restart_pod", "params": {}},
{"action": "increase_memory_limit", "params": {"factor": 1.5}},
{"action": "verify_recovery", "params": {"wait_seconds": 60}},
],
"approval_required": True, # Memory changes need approval
},
"deployment_rollback": {
"trigger": {"category": "deployment"},
"steps": [
{"action": "identify_bad_deploy", "params": {}},
{"action": "rollback_to_previous", "params": {}},
{"action": "verify_recovery", "params": {"wait_seconds": 180}},
{"action": "notify_deployer", "params": {}},
{"action": "create_incident_ticket", "params": {}},
],
"approval_required": True,
}
}
async def execute(self, runbook_name: str, context: dict) -> dict:
runbook = self.runbooks[runbook_name]
results = []
for step in runbook["steps"]:
if "decision" in step:
# Evaluate condition
branch = step["true"] if self.evaluate(step["decision"], context) else step["false"]
result = await self.execute_action(branch, context)
else:
result = await self.execute_action(step["action"], step["params"], context)
results.append(result)
if not result["success"]:
return {"status": "failed", "failed_at": step, "results": results}
return {"status": "resolved", "results": results}
Step 4: Deployment Intelligence
Most incidents correlate with recent changes. Your agent should automatically correlate alerts with deployments.
class DeploymentCorrelator:
async def correlate(self, alert: dict, deploys: list) -> dict:
if not deploys:
return {"deployment_related": False}
# Find deploys within the alert time window
alert_time = alert["triggered_at"]
suspect_deploys = [
d for d in deploys
if (alert_time - d["time"]).total_seconds() < 3600 # Within 1 hour
]
if not suspect_deploys:
return {"deployment_related": False}
# Get the diff for suspect deploys
for deploy in suspect_deploys:
deploy["changes"] = await github.get_commit_diff(deploy["sha"])
deploy["files_changed"] = len(deploy["changes"])
# Ask LLM to assess correlation
assessment = await self.llm.generate(f"""
Alert: {alert['title']} on {alert['service']}
Error pattern: {alert.get('root_cause', 'unknown')}
Recent deployments to this service:
{self.format_deploys(suspect_deploys)}
Could any of these deployments have caused this alert?
Assess each deployment's likelihood (high/medium/low/none) and explain why.
Output JSON: {{"most_likely_deploy": "sha or null", "confidence": "high/medium/low", "reason": "..."}}""")
return {
"deployment_related": True,
"suspect_deploys": suspect_deploys,
"assessment": json.loads(assessment)
}
Step 5: Post-Incident Automation
After resolution, the agent generates the incident report — the task everyone hates doing.
class IncidentReporter:
async def generate_report(self, incident: dict) -> str:
report = await self.llm.generate(f"""Generate a post-incident report.
Incident data:
- Alert: {incident['alert']['title']}
- Severity: {incident['severity']}
- Triggered: {incident['started_at']}
- Resolved: {incident['resolved_at']}
- Duration: {incident['duration_minutes']} minutes
- Root cause: {incident['root_cause']}
- Resolution: {incident['resolution_steps']}
- Affected services: {incident['affected_services']}
- User impact: {incident.get('user_impact', 'Unknown')}
- Related deployment: {incident.get('suspect_deploy', 'None')}
Format as a standard post-incident report with sections:
1. Summary (2-3 sentences)
2. Timeline (key events with timestamps)
3. Root Cause Analysis
4. Resolution
5. Impact Assessment
6. Action Items (preventive measures)
Be factual and specific. Include actual timestamps and metrics.""")
return report
Tools Your Agent Needs
| Category | Tools | Purpose |
|---|---|---|
| Monitoring | Grafana API, Datadog API, Prometheus | Read metrics, check dashboards |
| Logging | Elasticsearch, Loki, CloudWatch Logs | Search and analyze logs |
| Alerting | PagerDuty, OpsGenie, Slack | Receive alerts, update status |
| CI/CD | GitHub Actions, ArgoCD, Jenkins | Check deploys, trigger rollbacks |
| Infrastructure | Kubernetes API, AWS API, Terraform | Scale, restart, modify resources |
| Communication | Slack, Teams, email | Notify teams, request approval |
Platform Comparison: AIOps Tools
| Platform | Best For | Price | Key Feature |
|---|---|---|---|
| Shoreline.io | Automated remediation | Custom | Op-based runbook automation |
| BigPanda | Alert correlation | Custom | ML-powered alert grouping |
| Moogsoft | Noise reduction | $15/host/mo | AI alert correlation |
| PagerDuty AIOps | Existing PD users | Add-on | Intelligent triage, similar incidents |
| Custom (this guide) | Full control | $100-300/mo | Your runbooks, your rules |
Safety Guardrails for DevOps Agents
DevOps agents have access to production infrastructure. The guardrails must be strict:
- Read-first, write-later: Start with read-only access. Add write actions one at a time after proving reliability
- Blast radius limits: Agent can restart 1 pod, not the entire deployment. Scale by 2x, not 10x
- Approval gates: Rollbacks, scaling changes, and config modifications require human approval via Slack/PagerDuty
- Dry-run mode: Agent shows what it would do, human approves, then it executes
- Kill switch: One command to disable all agent write actions immediately
- Audit trail: Every action logged with timestamp, reason, and outcome
- Time boundaries: No infrastructure changes during business hours without approval
class DevOpsGuardrails:
MAX_SCALE_FACTOR = 2.0 # Never scale more than 2x
MAX_RESTARTS_PER_HOUR = 5 # Prevent restart loops
CHANGE_FREEZE_HOURS = (9, 17) # No auto-changes during business hours
REQUIRE_APPROVAL = [
"rollback_deployment",
"scale_service",
"modify_config",
"restart_service", # After initial testing, this can be auto
]
def can_execute(self, action: str, params: dict) -> tuple[bool, str]:
# Business hours check
hour = datetime.now().hour
if self.CHANGE_FREEZE_HOURS[0] <= hour < self.CHANGE_FREEZE_HOURS[1]:
if action in self.REQUIRE_APPROVAL:
return False, "Change freeze during business hours"
# Scale limit
if action == "scale_service":
if params.get("factor", 1) > self.MAX_SCALE_FACTOR:
return False, f"Scale factor {params['factor']} exceeds max {self.MAX_SCALE_FACTOR}"
return True, "OK"
Implementation Roadmap
Don't try to build the full AI SRE in one sprint. Follow this progression:
- Week 1-2: Observer — Alert classification, log analysis, context gathering. No write actions. Agent produces reports in Slack.
- Week 3-4: Advisor — Root cause analysis, runbook recommendation. Agent suggests actions, human executes.
- Week 5-6: Assistant — Agent executes simple runbooks (restart pod, clear cache) with approval. Human handles complex cases.
- Week 7-8: Responder — Auto-execute known runbooks for recurring incidents. Approval required for new patterns.
- Month 3+: Autonomous — Full auto-remediation for known incident types. Human-in-the-loop for novel issues.
Building AI agents for DevOps and SRE? AI Agents Weekly covers infrastructure automation, AIOps tools, and production deployment patterns 3x/week. Join free.
Conclusion
The best on-call engineer is one who never has to wake up. An AI DevOps agent won't replace your SRE team, but it will handle the repetitive incidents that drain their energy — the 3 AM OOM kills, the deployment rollbacks, the "disk is 90% full" alerts that have a known fix.
Start as an observer. Prove the agent can correctly identify root causes. Then gradually give it the keys to execute. Every incident it handles autonomously is an interrupted sleep your team gets back. That's not just efficiency — it's quality of life.