Build a Self-Healing App on AWS: A Beginner's Guide

wpnews.pro

cd /news/mlops/build-a-self-healing-app-on-aws-a-be… · home › topics › mlops › article

[ARTICLE · art-15399] src=dev.to ↗ pub=2026-05-27T14:53Z topic=mlops verified=true sentiment=· neutral

Build a Self-Healing App on AWS: A Beginner's Guide

An AWS developer built a self-healing application on AWS that automatically detects a web server crash and restarts the service without human intervention. The solution uses Amazon CloudWatch Agent to monitor the Nginx process on an EC2 instance, then triggers an AWS Lambda function via EventBridge when the process fails, which sends an SSM command to restart the web server. The automation eliminates the need for manual, reactive server restarts by detecting failures and initiating recovery before a human operator would even be notified.

read4 min views11 publishedMay 27, 2026

It's midnight. You get an incident call on your phone that your application's web server has crashed, and users are seeing the dreaded 500 Internal server error. You stumble to your laptop, sleepy-eyes, to run the restart command or to run your restart script.

This is the "Old Way." It's manual, it's reactive, and it ruins your sleep.

In the world of DevOps, we don't just fix things; we build things that fix themselves (self-healing). In this article, we're going to build a simple automation that detects a web server crash and restarts the service before you even roll over in bed.

To build this, we need five simple AWS resources:

The flow looks like this:

First, we need a server to monitor.

sudo dnf update -y
sudo dnf install nginx -y
sudo systemctl enable --now nginx

Open your browser to the instance's public IP. You should see the "Welcome to Nginx" page.

The default EC2 metrics can't see inside your server. We need the agent to monitor the Nginx process.

sudo dnf install amazon-cloudwatch-agent -y

sudo nano /opt/aws/amazon-cloudwatch-agent/bin/config.json

Paste this configuration:

{
  "agent": {
    "metrics_collection_interval": 60
  },
  "metrics": {
    "namespace": "CWAgent",
    "metrics_collected": {
      "procstat": [
        {
          "exe": "nginx",
          "measurement": ["pid_count"]
        }
      ]
    }
  }
}

Save the file and start the agen, you have to tell the agent to load the file:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s

The agent is now running. Verify with:

sudo systemctl status amazon-cloudwatch-agent

After a minute or two, you should see a procstat_lookup pid_count metric appear in CloudWatch under the CWAgent namespace.

We need a tiny function that tells AWS: "Hey, go to Instance X and restart the web server." Create a new Lambda function (Python 3.12+) and paste this code:

import os
import boto3

def lambda_handler(event, context):
    ssm = boto3.client('ssm')

    instance_id = os.environ.get('INSTANCE_ID', 'i-YOUR_INSTANCE_ID_HERE')

    alarm_name = event.get('detail', {}).get('alarmName', 'unknown')
    print(f"Alarm '{alarm_name}' triggered. Restarting Nginx on {instance_id}...")

    ssm.send_command(
        InstanceIds=[instance_id],
        DocumentName="AWS-RunShellScript",
        Parameters={'commands': ['sudo systemctl restart nginx']}
    )
    return {"status": "Restart command sent"}

IAM permissions for the Lambda. In the Lambda console, open your function's Configuration → Permissions tab and click the execution role name to open it in IAM.

Confirm AWSLambdaBasicExecutionRole is attached (it usually is by default). Then click Add permissions → Create inline policy, switch to the JSON tab, and paste:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "ssm:SendCommand",
    "Resource": [
      "arn:aws:ec2:REGION:ACCOUNT_ID:instance/i-YOUR_INSTANCE_ID",
      "arn:aws:ssm:REGION::document/AWS-RunShellScript"
    ]
  }]
}

Replace REGION, ACCOUNT_ID, and the instance ID with your values.

Now we tell AWS when to trigger that code.

Go to CloudWatch > Alarms and click "Create Alarm."

Give the alarm a memorable name like nginx-down-alarm. You'll reference it in the next step.

Here's where EventBridge shines. CloudWatch Alarms publish state-change events to the default event bus automatically, we just need a rule that catches the ones we care about and sends them to our Lambda.

Go to EventBridge > Rules and click "Create rule."

{
  "source": ["aws.cloudwatch"],
  "detail-type": ["CloudWatch Alarm State Change"],
  "detail": {
    "alarmName": ["nginx-down-alarm"],
    "state": {
      "value": ["ALARM"]
    }
  }
}

This pattern says: "Match only when this specific alarm enters the ALARM state." No filtering logic needed inside the Lambda, EventBridge handles it.

EventBridge will automatically add the invoke permission on your function. No subscriptions, no topic management.

We're going to kill Nginx and watch it come back.

sudo systemctl stop nginx

If something doesn't fire, check the alarm history in CloudWatch first, then your Lambda's CloudWatch Logs. EventBridge also has a "Monitoring" tab on each rule showing invocation counts and failures, which is handy for debugging.

If you can walk an interviewer through this project, you're demonstrating skills that hiring managers genuinely look for in entry-level DevOps roles. You aren't just "using AWS"; you are demonstrating:

Most of what we used in this article fits inside the AWS Free Tier, the EC2 t3.micro, Lambda invocations, and EventBridge events from AWS services (like CloudWatch alarm state changes) are all free. One caveat: CloudWatch custom metrics (which is what procstat produces) are only free for the first 10 metrics, so a single procstat metric is fine, but the cost can scale up if you expand this pattern broadly.

When you're done experimenting, terminate the EC2 instance, delete the CloudWatch alarm, and remove the EventBridge rule to avoid any surprise bills.

source & further reading

dev.to — original article I Traced a Multi-Step LLM Agent With Self-Hosted SigNoz. One Feature Sold Me. How I Built a Fully Automated AI Blog with AWS CDK, Bedrock, and Step Functions The Missing Economic Layer: How AI Agents Will Pay for Their Own Infrastructure

~/api · this article 200

$curl api.wpnews.pro/v1/news/build-a-self-healing-app…

Read original on dev.to → dev.to/esthernnolum/build-a-self-healing-app-on-…

mentioned entities

AWS

Nginx

CloudWatch

metadata

slugbuild-a-self-healing-app-on-aws-a-beginner-s-guide

topic#mlops

sentimentneutral

canonicaldev.to

navigation

← prevSemantic caching the VLM step in…

next →Hedge Fund Whale Rock Snaps Up 3…

── more in #mlops 4 stories · sorted by recency

dev.to · 11 Jul · #mlops

Why SNS Silently Drops Your Messages and How to Catch It Before You Ship

dev.to · 11 Jul · #mlops

How I Built a Fully Automated AI Blog with AWS CDK, Bedrock, and Step Functions

byteiota.com · 11 Jul · #mlops

JetBrains AI for Teams Is Live: Govern Claude Code, Codex, and Gemini CLI

dev.to · 10 Jul · #mlops

Why I Stopped Looking for Other AI Assistants After Finding AWS Kiro

── more on @aws 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required