# How I Built an Autonomous Incident Investigation Agent That Reduced MTTR by 65%

> Source: <https://dev.to/velumal09/how-i-built-an-autonomous-incident-investigation-agent-that-reduced-mttr-by-65-42ae>
> Published: 2026-06-18 04:22:12+00:00

It's 2:47 AM. Your phone buzzes and it's a P1 alert. You open your laptop, bleary-eyed, and begin the familiar ritual:

This process takes **15–45 minutes** for an experienced engineer. For a junior on-call? Sometimes hours. And the cognitive overhead of context-switching between 3-4 tools while sleep-deprived leads to missed signals, false conclusions, and longer outages.

I asked myself:

So I built one. It's been running in production for months, investigating real incidents on a platform serving **30+ million end users** across multiple AWS regions. We call it **FRIDAY**.

When a PagerDuty alert fires, FRIDAY:

The entire investigation takes **under 2 minutes**. The on-call engineer wakes up to a complete analysis instead of a raw alert.

```
┌──────────────┐     ┌────────────────┐     ┌─────────────────────┐
│  PagerDuty   │────▶│  API Gateway   │────▶│  Lambda (Sync)      │
│  Webhook     │     │  (Validate)    │     │  Parse + Self-Invoke│
└──────────────┘     └────────────────┘     └─────────┬───────────┘
                                                       │ Async
                                                       ▼
                                            ┌─────────────────────┐
                                            │  Lambda (Async)      │
                                            │  Investigation Agent │
                                            │                      │
                                            │  ┌────────────────┐ │
                                            │  │ Amazon Bedrock  │ │
                                            │  │ Claude Opus     │ │
                                            │  │ (Tool-Use Loop) │ │
                                            │  └───────┬────────┘ │
                                            │          │          │
                                            │    ┌─────┼─────┐   │
                                            │    ▼     ▼     ▼   │
                                            │ GitHub Datadog  S3  │
                                            └─────────┬───────────┘
                                                      │
                                                      ▼
                                            ┌─────────────────────┐
                                            │  Microsoft Teams    │
                                            │  (Adaptive Card)    │
                                            └─────────────────────┘
```

API Gateway has a **30-second hard timeout**. A thorough AI investigation takes 60–180 seconds. The solution: the sync Lambda validates the webhook, parses the alert, and immediately self-invokes asynchronously returning `200 OK`

to PagerDuty within 2 seconds.

```
# Sync handler: validate, parse, self-invoke, return immediately
lambda_client.invoke(
    FunctionName=context.function_name,
    InvocationType="Event",  # Fire and forget
    Payload=json.dumps({
        "_async_investigate": True,
        "alert_payload": alert_payload,
    }),
)
return {"statusCode": 200, "body": "Investigation started"}
```

The async Lambda runs the full investigation without timeout pressure.

This is counterintuitive. Most engineers and most AI systems jump straight to observability data when an alert fires. But in my experience, **80%+ of acute incidents are caused by a preceding change**: a deployment, a config update, a replica count change, a memory limit modification.

FRIDAY is instructed to check GitHub *before* touching Datadog:

```
MANDATORY FIRST STEP — GitHub (Step 0):
Before touching Datadog, you MUST run these calls in parallel:
1. github_search_repos — find the repo for the alerted service
2. github_list_commits — find commits in the 2 hours before 
   the alert fired

A deployment or config change is the most likely root cause.
```

Our platform spans multiple AWS regions. A naive agent querying "all 5xx errors" would mix signals from healthy and unhealthy regions, producing confused analysis.

FRIDAY's first action is always to **lock a target region** from the alert metadata:

```
🌍 Region: Description — resolved from alert hostname

Every subsequent Datadog query includes:
kube_cluster_name:region-az-* (scoped to affected region only)
```

This eliminated an entire class of false-positive findings where the AI would cite errors from an unrelated region.

FRIDAY's output isn't freeform text. It follows a strict section contract that the Teams integration parses into visual containers:

```
## EXECUTIVE SUMMARY
[2-3 sentences — what happened, who's affected, what changed]

## KEY FINDINGS
[Bulleted evidence from GitHub + Datadog]

## WHAT CHANGED
[Specific commit/PR with timestamp and author]

## ERROR BREAKDOWN
[Service-by-service error counts with affected tenants]

## ROOT CAUSE
[Confirmed / Suspected / Unknown — with evidence chain]

## CUSTOMER IMPACT
[Affected tenants, operations, scope]

## RECOMMENDED ACTIONS
[Specific next steps for the on-call engineer]
```

The on-call engineer can glance at the Teams card and immediately know: *what happened, who's affected, what likely caused it, and what to do next* without reading a wall of text.

FRIDAY uses Claude's tool-use capability in a multi-round loop. The AI doesn't execute a fixed script — it **reasons** about each alert independently, deciding which tools to call based on what it's learned so far.

```
for round_num in range(MAX_TOOL_ROUNDS):  # Max 25 rounds
    response = bedrock_client.converse(
        modelId="anthropic.claude-opus",
        messages=messages,
        toolConfig={"tools": TOOL_DEFINITIONS},
    )

    if stop_reason == "tool_use":
        # Execute tools, append results, continue reasoning
        for tool_call in content_blocks:
            result = execute_tool(
                tool_call["name"], 
                tool_call["input"]
            )
            tool_results.append(result)
        messages.append(tool_results)

    elif stop_reason == "end_turn":
        # AI has concluded — extract findings
        return extract_final_report(content_blocks)
```

| Tool | Purpose |
|---|---|
`github_search_repos` |
Find which repo owns a service |
`github_list_commits` |
What changed before the alert |
`github_get_file` |
Read actual deployment configs |
`github_search_code` |
Find all producers/consumers of a queue |
`datadog_log_search` |
Find specific error messages |
`datadog_log_aggregate` |
Count errors by backend/tenant/path |
`datadog_query_metrics` |
Queue depth, CPU, memory, latency |
`datadog_get_monitor` |
Understand what threshold triggered |

The AI typically uses **8–15 tool calls per investigation**, batching parallel calls when possible to minimize round-trip time.

A cold investigation — where the AI knows nothing about your infrastructure — is slow and imprecise. FRIDAY includes a **deterministic training mode** that pre-builds architectural knowledge:

``` python
def train():
    """
    Deterministic training:
    ~13 targeted API calls, then one Bedrock synthesis call.

    Collects: cluster-service maps, HAProxy backends, 
    chronic error baselines, recent planned work.
    """
    # Phase 1: Targeted data collection (no AI — pure API calls)
    collected = {}
    for key, tool_name, tool_input in TRAINING_CALLS:
        collected[key] = execute_tool(tool_name, tool_input)

    # Phase 2: Single AI synthesis call
    knowledge_doc = synthesize_knowledge(collected)

    # Phase 3: Save to S3 — injected into system prompt
    save_to_s3(knowledge_doc)
```

The knowledge document contains:

One of the hardest problems: distinguishing planned maintenance from real outages. During a Kubernetes cluster migration, you *expect* 5xx errors as traffic drains. FRIDAY handles this through:

"This alert coincides with planned cluster decommission. Errors are expected during traffic drain. No incident action required."

What happens when an investigation is complex and approaching the 25-round tool limit? FRIDAY has a **graceful degradation** mechanism:

```
if rounds_remaining <= 3:
    user_content.append({
        "text": (
            "STOP CALLING TOOLS. Write your FINAL report "
            "NOW using all data collected so far. Mark "
            "uncertain findings as 'Suspected' rather "
            "than skipping them."
        )
    })
```

This ensures every investigation produces a report — even if incomplete — rather than timing out silently.

PagerDuty retries webhooks. FRIDAY handles this at two levels:

After running in production for several months:

| Metric | Before FRIDAY | After FRIDAY | Improvement |
|---|---|---|---|
| Mean Time to First Analysis | 15–45 min | 90 sec–3 min | ~90% faster |
| MTTR (overall) | ~60 min | ~15 min | 65% reduction |
| AI tool adoption (team) | 20% | 85% | 4x increase |
| Alert noise (false escalations) | High | Minimal | ~80% reduction |
| Auto-generated postmortems | 0% | 100% of P1/P2 | Eliminated manual RCA drafts |

The system prompt is the most important file in the codebase. It's not instructions — it's the agent's **operating manual**. Ours is ~5,000 words covering:

**Invest in your prompt like you invest in your architecture docs.**

Before this rule, the AI would spend 10+ rounds querying Datadog, building elaborate theories about traffic patterns — then discover a config change was merged 5 minutes before the alert. Now it finds the root cause in rounds 1-2 for ~80% of change-induced incidents.

FRIDAY is explicitly told it **does NOT take remediation actions**. It investigates, analyzes, and reports. A human validates and acts. This is not a limitation — it's a **design choice that builds trust**. When on-call engineers trust the AI's analysis, they act on it faster.

The two-Lambda pattern (sync for webhook receipt, async for investigation) is essential. Don't let API Gateway timeouts dictate your AI agent's investigation depth.

We're extending this pattern to **autonomous security remediation** — an agent that ingests vulnerability findings, generates IaC fixes, deploys through GitOps, verifies no impact, and requests human approval before proceeding. Same tool-use architecture, different domain.

The pattern is reproducible with:

The hard part isn't the code — it's the **system prompt**. That's where your SRE expertise lives. The AI is the execution engine; your knowledge of your infrastructure is what makes it useful.

The name also works as a backronym: **F** irst **R** esponder for **I** ncident **D** iagnostics and **A** nal**Y** sis — but honestly, we just thought the Marvel reference was cooler.

*I'm Vinothsingh Elumalai, a Platform Engineering leader building AI-native operations at enterprise scale. I lead the Platform team for a global IAM/SSO platform serving 30M+ users. Currently exploring how agentic AI transforms SRE from reactive firefighting to autonomous, closed-loop operations.*

*This is Part 1 of my AI-Native SRE series. Part 2 will cover JARVIS — an autonomous vulnerability remediation agent that fixes security findings through GitOps with human approval gates.*
