AI Agent for Data Analysis: Automate Reports, Dashboards & Insights (2026)

Mar 27, 2026 • 13 min read • By Paxrel

Every company is drowning in data and starving for insights. The data team has a 3-week backlog. Business users wait days for a simple report. Dashboards exist but nobody updates them. Sound familiar?

An AI data analysis agent changes this equation. Instead of filing a Jira ticket and waiting, a product manager types "Show me revenue by region for Q1, compared to last year" and gets the answer in seconds — complete with a chart and the SQL query that generated it.

This guide shows you how to build one: from natural language to SQL, automated reporting, anomaly detection, and the guardrails that prevent your agent from accidentally dropping a table.

What a Data Analysis Agent Can Do

CapabilityExampleTraditional Alternative
Natural language to SQL"Top 10 customers by revenue"Write SQL manually or ask analyst
Automated reportsWeekly sales report every Monday 8amAnalyst builds in Excel/Sheets
Anomaly detection"Revenue dropped 15% vs last week"Someone notices in a dashboard
Ad-hoc exploration"Why did churn spike in March?"Multi-day analysis project
Data validation"Are there duplicate records in orders?"Data quality scripts
Chart generation"Bar chart of MRR trend last 12 months"Copy data to Excel, make chart

Architecture: How It Works

User: "What was our churn rate last month?"
       │
       ▼
┌─────────────────┐
│ Schema Retrieval │ → Load relevant table schemas + column descriptions
└────────┬────────┘
         │
         ▼
┌────────────────┐
│ SQL Generation  │ → LLM generates: SELECT COUNT(CASE WHEN canceled...)
└────────┬───────┘
         │
         ▼
┌────────────────┐
│ SQL Validation  │ → Syntax check, permissions check, read-only enforcement
└────────┬───────┘
         │
         ▼
┌────────────────┐
│ Query Execution │ → Run against database (with timeout + row limit)
└────────┬───────┘
         │
         ▼
┌────────────────┐
│ Result Analysis │ → LLM interprets results, generates insight
└────────┬───────┘
         │
         ▼
  "Your churn rate was 4.2% last month, up from 3.8% the month before.
   The increase was driven by the Enterprise segment (+1.2pp)."

Step 1: Schema Context — Teaching Your Agent the Database

The agent needs to understand your database structure. Without schema context, it'll hallucinate table and column names.

Option A: Full Schema in System Prompt

For small databases (< 20 tables), include the full schema:

SCHEMA_CONTEXT = """
Database: analytics_prod (PostgreSQL 15)

Table: customers (Customer accounts)
  - id: INTEGER PRIMARY KEY
  - email: VARCHAR(255) UNIQUE
  - name: VARCHAR(255)
  - plan: ENUM('free', 'starter', 'pro', 'enterprise')
  - created_at: TIMESTAMP
  - canceled_at: TIMESTAMP NULL (NULL = active)
  - mrr_cents: INTEGER (monthly recurring revenue in cents)
  - region: VARCHAR(50) ('NA', 'EU', 'APAC', 'LATAM')

Table: orders (Purchase transactions)
  - id: INTEGER PRIMARY KEY
  - customer_id: INTEGER REFERENCES customers(id)
  - amount_cents: INTEGER
  - currency: CHAR(3) DEFAULT 'USD'
  - status: ENUM('pending', 'completed', 'refunded', 'failed')
  - created_at: TIMESTAMP

Table: events (Product usage events)
  - id: BIGINT PRIMARY KEY
  - customer_id: INTEGER REFERENCES customers(id)
  - event_type: VARCHAR(100)
  - properties: JSONB
  - created_at: TIMESTAMP
  INDEX: (customer_id, created_at), (event_type, created_at)

Common joins:
  - customers.id = orders.customer_id
  - customers.id = events.customer_id

Notes:
  - All monetary values are in cents (divide by 100 for dollars)
  - Timestamps are UTC
  - 'canceled_at IS NULL' means customer is active
"""

Option B: RAG-Based Schema Retrieval

For large databases (50+ tables), retrieve only relevant schemas:

class SchemaRetriever:
    def __init__(self, vector_store):
        self.vs = vector_store

    def index_schema(self, connection):
        """Index all tables with their descriptions and column info."""
        for table in connection.get_tables():
            doc = f"Table: {table.name}\n"
            doc += f"Description: {table.comment or 'No description'}\n"
            doc += f"Columns:\n"
            for col in table.columns:
                doc += f"  - {col.name}: {col.type} ({col.comment or ''})\n"
            doc += f"Row count: ~{table.approximate_row_count}\n"
            doc += f"Common queries: {table.frequent_query_patterns}\n"

            self.vs.add(text=doc, metadata={"table": table.name})

    def get_relevant_schema(self, question: str, top_k: int = 5) -> str:
        """Retrieve only the tables relevant to the question."""
        results = self.vs.search(question, top_k=top_k)
        return "\n\n".join([r.text for r in results])
Tip: Always include column descriptions in your schema context. "mrr_cents: INTEGER" is ambiguous — "mrr_cents: INTEGER (monthly recurring revenue in cents, divide by 100 for dollars)" prevents calculation errors.

Step 2: Natural Language to SQL

The core capability. The LLM translates a user's question into a SQL query using the schema context.

class TextToSQL:
    SYSTEM_PROMPT = """You are a SQL expert. Generate PostgreSQL queries from natural language questions.

Rules:
1. ONLY generate SELECT queries. Never INSERT, UPDATE, DELETE, DROP, or ALTER.
2. Always use table aliases for readability.
3. Include LIMIT 1000 unless the user asks for all results.
4. Use CTEs for complex queries instead of nested subqueries.
5. Format monetary values: amount_cents / 100.0 AS amount_usd
6. Use explicit date ranges, never relative dates without anchoring to CURRENT_DATE.
7. Add comments explaining non-obvious logic.

{schema}

Output ONLY the SQL query, no explanation."""

    async def generate(self, question: str, schema: str) -> str:
        response = await self.llm.generate(
            self.SYSTEM_PROMPT.format(schema=schema),
            user_message=question,
            model="gpt-4o"  # Strong model for SQL accuracy
        )
        return self.extract_sql(response)

    def extract_sql(self, text: str) -> str:
        """Extract SQL from markdown code blocks or plain text."""
        if "```sql" in text:
            return text.split("```sql")[1].split("```")[0].strip()
        if "```" in text:
            return text.split("```")[1].split("```")[0].strip()
        return text.strip()

Handling Ambiguity

Users are imprecise. "Show me revenue" could mean total revenue, MRR, ARR, or revenue by product. Build clarification into your agent:

CLARIFICATION_PROMPT = """Given this question and database schema, identify any ambiguities.

Question: {question}
Schema: {schema}

If the question is clear, output: {{"clear": true}}
If ambiguous, output: {{"clear": false, "ambiguities": [
  {{"issue": "description", "options": ["option1", "option2"], "default": "option1"}}
]}}"""

async def handle_ambiguity(self, question, schema):
    result = await self.llm.generate(CLARIFICATION_PROMPT.format(
        question=question, schema=schema
    ))
    parsed = json.loads(result)

    if parsed["clear"]:
        return question  # Proceed as-is

    # Use defaults for minor ambiguities, ask user for major ones
    for amb in parsed["ambiguities"]:
        if amb["issue"] in self.known_defaults:
            question += f" (using {amb['default']})"
        else:
            return {"needs_clarification": True, "options": amb["options"]}

Step 3: SQL Validation & Safety

Never trust LLM-generated SQL blindly. Validate before execution.

import sqlparse

class SQLValidator:
    BLOCKED_KEYWORDS = [
        'INSERT', 'UPDATE', 'DELETE', 'DROP', 'ALTER', 'CREATE',
        'TRUNCATE', 'GRANT', 'REVOKE', 'EXEC', 'EXECUTE',
        'INTO OUTFILE', 'LOAD_FILE', 'pg_sleep'
    ]

    def validate(self, sql: str) -> tuple[bool, str]:
        # Parse the SQL
        parsed = sqlparse.parse(sql)
        if not parsed:
            return False, "Could not parse SQL"

        # Check statement type
        for statement in parsed:
            if statement.get_type() != 'SELECT':
                return False, f"Only SELECT queries allowed, got {statement.get_type()}"

        # Check for blocked keywords
        upper_sql = sql.upper()
        for keyword in self.BLOCKED_KEYWORDS:
            if keyword in upper_sql:
                return False, f"Blocked keyword detected: {keyword}"

        # Check for subquery writes (SELECT with side effects)
        if 'INTO' in upper_sql and 'SELECT' in upper_sql:
            return False, "SELECT INTO not allowed"

        # Ensure LIMIT exists
        if 'LIMIT' not in upper_sql:
            sql += '\nLIMIT 1000'

        return True, sql

    def add_safety_wrapper(self, sql: str, timeout_ms: int = 30000) -> str:
        """Wrap query with timeout and read-only transaction."""
        return f"""
SET statement_timeout = {timeout_ms};
SET transaction_read_only = ON;
{sql}
"""
Warning: Always use a read-only database user for your data agent. Even with SQL validation, defense in depth matters. Create a dedicated role: CREATE ROLE data_agent WITH LOGIN PASSWORD '...' CONNECTION LIMIT 5; then GRANT SELECT ON ALL TABLES IN SCHEMA public TO data_agent;

Step 4: Query Execution & Result Formatting

import asyncpg

class QueryExecutor:
    def __init__(self, connection_string: str):
        self.conn_str = connection_string
        self.max_rows = 1000
        self.timeout_seconds = 30

    async def execute(self, sql: str) -> dict:
        conn = await asyncpg.connect(self.conn_str)
        try:
            # Set read-only mode
            await conn.execute("SET transaction_read_only = ON")
            await conn.execute(f"SET statement_timeout = '{self.timeout_seconds}s'")

            # Execute query
            rows = await conn.fetch(sql)

            # Format results
            if not rows:
                return {"data": [], "columns": [], "row_count": 0}

            columns = list(rows[0].keys())
            data = [dict(row) for row in rows[:self.max_rows]]

            return {
                "data": data,
                "columns": columns,
                "row_count": len(data),
                "truncated": len(rows) > self.max_rows
            }
        except asyncpg.QueryCanceledError:
            return {"error": "Query timed out (>30s). Try a more specific query."}
        except Exception as e:
            return {"error": str(e)}
        finally:
            await conn.close()

Step 5: Insight Generation

Raw query results aren't insights. The agent should interpret the data and highlight what matters.

INSIGHT_PROMPT = """You are a data analyst. Given a user's question and query results, provide a clear, actionable insight.

Question: {question}
SQL Query: {sql}
Results: {results}

Guidelines:
1. Lead with the answer to the question
2. Highlight notable trends or anomalies
3. Compare to relevant benchmarks if obvious (e.g., month-over-month)
4. Suggest follow-up questions if the data raises new questions
5. If the data is insufficient to fully answer, say what's missing
6. Format numbers: use commas for thousands, 1 decimal for percentages, $ for currency

Keep it concise — 2-4 sentences for simple queries, up to a paragraph for complex analysis."""

async def generate_insight(self, question, sql, results):
    # Format results as a readable table (first 20 rows)
    formatted = self.format_as_table(results["data"][:20], results["columns"])

    return await self.llm.generate(
        INSIGHT_PROMPT.format(
            question=question,
            sql=sql,
            results=formatted
        )
    )

Step 6: Automated Reports

The most valuable feature for business teams. Schedule reports that run automatically and deliver results via email or Slack.

class ScheduledReport:
    def __init__(self, name: str, queries: list, schedule: str, recipients: list):
        self.name = name
        self.queries = queries          # List of {question, sql} pairs
        self.schedule = schedule         # Cron expression
        self.recipients = recipients     # Email or Slack channel

    async def run(self):
        results = []
        for query in self.queries:
            data = await executor.execute(query["sql"])
            insight = await generate_insight(query["question"], query["sql"], data)
            results.append({
                "question": query["question"],
                "data": data,
                "insight": insight,
                "chart": await self.generate_chart(data) if data["row_count"] > 1 else None
            })

        report = self.format_report(results)
        await self.deliver(report)

# Example: Weekly Sales Report
weekly_sales = ScheduledReport(
    name="Weekly Sales Report",
    queries=[
        {
            "question": "Total revenue this week vs last week",
            "sql": """SELECT
                SUM(CASE WHEN created_at >= CURRENT_DATE - 7 THEN amount_cents END) / 100.0 AS this_week,
                SUM(CASE WHEN created_at >= CURRENT_DATE - 14
                    AND created_at < CURRENT_DATE - 7 THEN amount_cents END) / 100.0 AS last_week
            FROM orders WHERE status = 'completed'"""
        },
        {
            "question": "Top 5 products by revenue this week",
            "sql": "SELECT product_name, SUM(amount_cents)/100.0 AS revenue..."
        },
        {
            "question": "New customers this week by region",
            "sql": "SELECT region, COUNT(*) AS new_customers..."
        }
    ],
    schedule="0 8 * * 1",  # Every Monday at 8am
    recipients=["[email protected]", "#sales-channel"]
)

Step 7: Anomaly Detection

Proactive data agents don't just answer questions — they alert you when something looks wrong.

class AnomalyDetector:
    def __init__(self, executor, alert_service):
        self.executor = executor
        self.alert = alert_service

    async def check_metric(self, metric_name: str, sql: str,
                           threshold_pct: float = 20):
        """Compare current value to historical average. Alert if deviation > threshold."""
        result = await self.executor.execute(sql)
        current = result["data"][0]["current_value"]
        historical_avg = result["data"][0]["avg_value"]

        if historical_avg == 0:
            return

        deviation_pct = ((current - historical_avg) / historical_avg) * 100

        if abs(deviation_pct) > threshold_pct:
            direction = "up" if deviation_pct > 0 else "down"
            await self.alert.send(
                f"Anomaly: {metric_name} is {direction} {abs(deviation_pct):.1f}% "
                f"vs historical average. Current: {current}, Avg: {historical_avg}"
            )

# Daily anomaly checks
ANOMALY_CHECKS = [
    {
        "name": "Daily Revenue",
        "sql": """SELECT
            (SELECT SUM(amount_cents)/100.0 FROM orders
             WHERE created_at >= CURRENT_DATE AND status='completed') AS current_value,
            (SELECT AVG(daily_rev) FROM (
                SELECT DATE(created_at), SUM(amount_cents)/100.0 AS daily_rev
                FROM orders WHERE created_at >= CURRENT_DATE - 30
                AND status='completed' GROUP BY DATE(created_at)
            ) t) AS avg_value""",
        "threshold_pct": 25
    },
    {
        "name": "Signup Rate",
        "sql": "...",
        "threshold_pct": 30
    },
    {
        "name": "Error Rate",
        "sql": "...",
        "threshold_pct": 50
    }
]

Platform Comparison: Build vs Buy

PlatformBest ForPriceKey Feature
Julius AINon-technical users$20/moUpload CSV, ask questions
DefogText-to-SQLFree (open source)Fine-tuned SQL models, self-host
Databricks AI/BIEnterprise analyticsPart of DatabricksGenie dashboards, Unity Catalog
MindsDBIn-database AIFree (open source)SQL-native ML predictions
Custom (this guide)Full control$50-200/moYour schema, your rules
ChatGPT + Code InterpreterOne-off analysis$20/moUpload files, Python analysis

Common Mistakes

1. No Schema Context

Without schema context, the LLM will guess table and column names. It'll generate syntactically valid SQL that references tables that don't exist. Always provide schema.

2. Trusting LLM Math

LLMs are bad at arithmetic. Never let the LLM calculate numbers from raw data — always generate SQL that does the calculation in the database. "What's the average?" should produce SELECT AVG(amount), not the LLM averaging numbers it read.

3. No Row Limits

A user asks "Show me all orders" on a table with 50 million rows. Without LIMIT, your agent crashes or runs a query that costs $200 in compute. Always enforce limits.

4. Exposing Raw Errors

Database errors can leak schema information, connection strings, or internal details. Catch errors and return user-friendly messages: "I couldn't run that query. Could you rephrase?" not "ERROR: relation 'internal_audit_log' does not exist".

5. No Query Explanation

Users need to trust the results. Always show the SQL query alongside the answer so analysts can verify. Transparency builds trust — hiding the query makes people suspicious.

Quick Start: MVP in a Day

  1. Morning: Create a read-only database user. Extract schema with descriptions.
  2. Midday: Build the text-to-SQL pipeline: schema context → LLM → SQL validation → execution → insight generation.
  3. Afternoon: Add a Slack bot or web interface. Test with 20 real questions from your team.
  4. Evening: Add 2-3 scheduled reports for the most common weekly queries.

This MVP handles 70% of ad-hoc data questions. Then iterate: improve schema descriptions based on query failures, add chart generation, build anomaly checks.

Building AI agents for data and analytics? AI Agents Weekly covers the latest tools, patterns, and real-world deployments 3x/week. Join free.

Conclusion

A data analysis agent pays for itself in the first week by eliminating the "can you pull this data for me?" Slack messages that consume analyst time. The key is getting the fundamentals right: good schema context, SQL validation, read-only access, and clear insight generation.

Start with text-to-SQL for your most-queried tables. Add scheduled reports for the questions people ask every week. Then add anomaly detection so your agent finds problems before anyone asks. The goal isn't to replace your data team — it's to free them from repetitive queries so they can do the deep analysis that actually moves the business.