How I Built an Autonomous Incident Investigation Agent That Reduced MTTR by 65%

wpnews.pro

It's 2:47 AM. Your phone buzzes and it's a P1 alert. You open your laptop, bleary-eyed, and begin the familiar ritual:

This process takes 15–45 minutes for an experienced engineer. For a junior on-call? Sometimes hours. And the cognitive overhead of context-switching between 3-4 tools while sleep-deprived leads to missed signals, false conclusions, and longer outages.

I asked myself:

So I built one. It's been running in production for months, investigating real incidents on a platform serving 30+ million end users across multiple AWS regions. We call it FRIDAY.

When a PagerDuty alert fires, FRIDAY:

The entire investigation takes under 2 minutes. The on-call engineer wakes up to a complete analysis instead of a raw alert.

┌──────────────┐     ┌────────────────┐     ┌─────────────────────┐
│  PagerDuty   │────▶│  API Gateway   │────▶│  Lambda (Sync)      │
│  Webhook     │     │  (Validate)    │     │  Parse + Self-Invoke│
└──────────────┘     └────────────────┘     └─────────┬───────────┘
                                                       │ Async
                                                       ▼
                                            ┌─────────────────────┐
                                            │  Lambda (Async)      │
                                            │  Investigation Agent │
                                            │                      │
                                            │  ┌────────────────┐ │
                                            │  │ Amazon Bedrock  │ │
                                            │  │ Claude Opus     │ │
                                            │  │ (Tool-Use Loop) │ │
                                            │  └───────┬────────┘ │
                                            │          │          │
                                            │    ┌─────┼─────┐   │
                                            │    ▼     ▼     ▼   │
                                            │ GitHub Datadog  S3  │
                                            └─────────┬───────────┘
                                                      │
                                                      ▼
                                            ┌─────────────────────┐
                                            │  Microsoft Teams    │
                                            │  (Adaptive Card)    │
                                            └─────────────────────┘

API Gateway has a 30-second hard timeout. A thorough AI investigation takes 60–180 seconds. The solution: the sync Lambda validates the webhook, parses the alert, and immediately self-invokes asynchronously returning 200 OK

to PagerDuty within 2 seconds.

lambda_client.invoke(
    FunctionName=context.function_name,
    InvocationType="Event",  # Fire and forget
    Payload=json.dumps({
        "_async_investigate": True,
        "alert_payload": alert_payload,
    }),
)
return {"statusCode": 200, "body": "Investigation started"}

The async Lambda runs the full investigation without timeout pressure.

This is counterintuitive. Most engineers and most AI systems jump straight to observability data when an alert fires. But in my experience, 80%+ of acute incidents are caused by a preceding change: a deployment, a config update, a replica count change, a memory limit modification.

FRIDAY is instructed to check GitHub before touching Datadog:

MANDATORY FIRST STEP — GitHub (Step 0):
Before touching Datadog, you MUST run these calls in parallel:
1. github_search_repos — find the repo for the alerted service
2. github_list_commits — find commits in the 2 hours before 
   the alert fired

A deployment or config change is the most likely root cause.

Our platform spans multiple AWS regions. A naive agent querying "all 5xx errors" would mix signals from healthy and unhealthy regions, producing confused analysis.

FRIDAY's first action is always to lock a target region from the alert metadata:

🌍 Region: Description — resolved from alert hostname

Every subsequent Datadog query includes:
kube_cluster_name:region-az-* (scoped to affected region only)

This eliminated an entire class of false-positive findings where the AI would cite errors from an unrelated region.

FRIDAY's output isn't freeform text. It follows a strict section contract that the Teams integration parses into visual containers:

## EXECUTIVE SUMMARY
[2-3 sentences — what happened, who's affected, what changed]

## KEY FINDINGS
[Bulleted evidence from GitHub + Datadog]

## WHAT CHANGED
[Specific commit/PR with timestamp and author]

## ERROR BREAKDOWN
[Service-by-service error counts with affected tenants]

## ROOT CAUSE
[Confirmed / Suspected / Unknown — with evidence chain]

## CUSTOMER IMPACT
[Affected tenants, operations, scope]

## RECOMMENDED ACTIONS
[Specific next steps for the on-call engineer]

The on-call engineer can glance at the Teams card and immediately know: what happened, who's affected, what likely caused it, and what to do next without reading a wall of text.

FRIDAY uses Claude's tool-use capability in a multi-round loop. The AI doesn't execute a fixed script — it reasons about each alert independently, deciding which tools to call based on what it's learned so far.

for round_num in range(MAX_TOOL_ROUNDS):  # Max 25 rounds
    response = bedrock_client.converse(
        modelId="anthropic.claude-opus",
        messages=messages,
        toolConfig={"tools": TOOL_DEFINITIONS},
    )

    if stop_reason == "tool_use":
        for tool_call in content_blocks:
            result = execute_tool(
                tool_call["name"], 
                tool_call["input"]
            )
            tool_results.append(result)
        messages.append(tool_results)

    elif stop_reason == "end_turn":
        return extract_final_report(content_blocks)

Tool	Purpose
`github_search_repos`
Find which repo owns a service
`github_list_commits`
What changed before the alert
`github_get_file`
Read actual deployment configs
`github_search_code`
Find all producers/consumers of a queue
`datadog_log_search`
Find specific error messages
`datadog_log_aggregate`
Count errors by backend/tenant/path
`datadog_query_metrics`
Queue depth, CPU, memory, latency
`datadog_get_monitor`
Understand what threshold triggered

The AI typically uses 8–15 tool calls per investigation, batching parallel calls when possible to minimize round-trip time.

A cold investigation — where the AI knows nothing about your infrastructure — is slow and imprecise. FRIDAY includes a deterministic training mode that pre-builds architectural knowledge:

def train():
    """
    Deterministic training:
    ~13 targeted API calls, then one Bedrock synthesis call.

    Collects: cluster-service maps, HAProxy backends, 
    chronic error baselines, recent planned work.
    """
    collected = {}
    for key, tool_name, tool_input in TRAINING_CALLS:
        collected[key] = execute_tool(tool_name, tool_input)

    knowledge_doc = synthesize_knowledge(collected)

    save_to_s3(knowledge_doc)

The knowledge document contains:

One of the hardest problems: distinguishing planned maintenance from real outages. During a Kubernetes cluster migration, you expect 5xx errors as traffic drains. FRIDAY handles this through:

"This alert coincides with planned cluster decommission. Errors are expected during traffic drain. No incident action required."

What happens when an investigation is complex and approaching the 25-round tool limit? FRIDAY has a graceful degradation mechanism:

if rounds_remaining <= 3:
    user_content.append({
        "text": (
            "STOP CALLING TOOLS. Write your FINAL report "
            "NOW using all data collected so far. Mark "
            "uncertain findings as 'Suspected' rather "
            "than skipping them."
        )
    })

This ensures every investigation produces a report — even if incomplete — rather than timing out silently.

PagerDuty retries webhooks. FRIDAY handles this at two levels:

After running in production for several months:

Metric	Before FRIDAY	After FRIDAY	Improvement
Mean Time to First Analysis	15–45 min	90 sec–3 min	~90% faster
MTTR (overall)	~60 min	~15 min	65% reduction
AI tool adoption (team)	20%	85%	4x increase
Alert noise (false escalations)	High	Minimal	~80% reduction
Auto-generated postmortems	0%	100% of P1/P2	Eliminated manual RCA drafts

The system prompt is the most important file in the codebase. It's not instructions — it's the agent's operating manual. Ours is ~5,000 words covering:

Invest in your prompt like you invest in your architecture docs.

Before this rule, the AI would spend 10+ rounds querying Datadog, building elaborate theories about traffic patterns — then discover a config change was merged 5 minutes before the alert. Now it finds the root cause in rounds 1-2 for ~80% of change-induced incidents.

FRIDAY is explicitly told it does NOT take remediation actions. It investigates, analyzes, and reports. A human validates and acts. This is not a limitation — it's a design choice that builds trust. When on-call engineers trust the AI's analysis, they act on it faster.

The two-Lambda pattern (sync for webhook receipt, async for investigation) is essential. Don't let API Gateway timeouts dictate your AI agent's investigation depth.

We're extending this pattern to autonomous security remediation — an agent that ingests vulnerability findings, generates IaC fixes, deploys through GitOps, verifies no impact, and requests human approval before proceeding. Same tool-use architecture, different domain.

The pattern is reproducible with:

The hard part isn't the code — it's the system prompt. That's where your SRE expertise lives. The AI is the execution engine; your knowledge of your infrastructure is what makes it useful.

The name also works as a backronym: F irst R esponder for I ncident D iagnostics and A nalY sis — but honestly, we just thought the Marvel reference was cooler.

I'm Vinothsingh Elumalai, a Platform Engineering leader building AI-native operations at enterprise scale. I lead the Platform team for a global IAM/SSO platform serving 30M+ users. Currently exploring how agentic AI transforms SRE from reactive firefighting to autonomous, closed-loop operations.

This is Part 1 of my AI-Native SRE series. Part 2 will cover JARVIS — an autonomous vulnerability remediation agent that fixes security findings through GitOps with human approval gates.

source & further reading

dev.to — original article Scaling AI Beyond the Monolith: Multi-Agent Coordination via Federated MCP Servers Databricks Lakebase: Give Your Agent a Branch, Not Your Production Database Dollars and rupees without Stripe: what building Skill Exchange's checkout taught me (PayPal + UPI)

How I Built an Autonomous Incident Investigation Agent That Reduced MTTR by 65%

Run your AI side-project on zahid.host