It's midnight. You get an incident call on your phone that your application's web server has crashed, and users are seeing the dreaded 500 Internal server error. You stumble to your laptop, sleepy-eyes, to run the restart command or to run your restart script.
This is the "Old Way." It's manual, it's reactive, and it ruins your sleep.
In the world of DevOps, we don't just fix things; we build things that fix themselves (self-healing). In this article, we're going to build a simple automation that detects a web server crash and restarts the service before you even roll over in bed.
To build this, we need five simple AWS resources:
The flow looks like this:
First, we need a server to monitor.
sudo dnf update -y
sudo dnf install nginx -y
sudo systemctl enable --now nginx
Open your browser to the instance's public IP. You should see the "Welcome to Nginx" page.
The default EC2 metrics can't see inside your server. We need the agent to monitor the Nginx process.
sudo dnf install amazon-cloudwatch-agent -y
sudo nano /opt/aws/amazon-cloudwatch-agent/bin/config.json
Paste this configuration:
{
"agent": {
"metrics_collection_interval": 60
},
"metrics": {
"namespace": "CWAgent",
"metrics_collected": {
"procstat": [
{
"exe": "nginx",
"measurement": ["pid_count"]
}
]
}
}
}
Save the file and start the agen, you have to tell the agent to load the file:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
The agent is now running. Verify with:
sudo systemctl status amazon-cloudwatch-agent
After a minute or two, you should see a procstat_lookup pid_count metric appear in CloudWatch under the CWAgent namespace.
We need a tiny function that tells AWS: "Hey, go to Instance X and restart the web server." Create a new Lambda function (Python 3.12+) and paste this code:
import os
import boto3
def lambda_handler(event, context):
ssm = boto3.client('ssm')
instance_id = os.environ.get('INSTANCE_ID', 'i-YOUR_INSTANCE_ID_HERE')
alarm_name = event.get('detail', {}).get('alarmName', 'unknown')
print(f"Alarm '{alarm_name}' triggered. Restarting Nginx on {instance_id}...")
ssm.send_command(
InstanceIds=[instance_id],
DocumentName="AWS-RunShellScript",
Parameters={'commands': ['sudo systemctl restart nginx']}
)
return {"status": "Restart command sent"}
IAM permissions for the Lambda. In the Lambda console, open your function's Configuration → Permissions tab and click the execution role name to open it in IAM.
Confirm AWSLambdaBasicExecutionRole is attached (it usually is by default). Then click Add permissions → Create inline policy, switch to the JSON tab, and paste:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": "ssm:SendCommand",
"Resource": [
"arn:aws:ec2:REGION:ACCOUNT_ID:instance/i-YOUR_INSTANCE_ID",
"arn:aws:ssm:REGION::document/AWS-RunShellScript"
]
}]
}
Replace REGION, ACCOUNT_ID, and the instance ID with your values.
Now we tell AWS when to trigger that code.
Go to CloudWatch > Alarms and click "Create Alarm."
Give the alarm a memorable name like nginx-down-alarm. You'll reference it in the next step.
Here's where EventBridge shines. CloudWatch Alarms publish state-change events to the default event bus automatically, we just need a rule that catches the ones we care about and sends them to our Lambda.
Go to EventBridge > Rules and click "Create rule."
{
"source": ["aws.cloudwatch"],
"detail-type": ["CloudWatch Alarm State Change"],
"detail": {
"alarmName": ["nginx-down-alarm"],
"state": {
"value": ["ALARM"]
}
}
}
This pattern says: "Match only when this specific alarm enters the ALARM state." No filtering logic needed inside the Lambda, EventBridge handles it.
EventBridge will automatically add the invoke permission on your function. No subscriptions, no topic management.
We're going to kill Nginx and watch it come back.
sudo systemctl stop nginx
If something doesn't fire, check the alarm history in CloudWatch first, then your Lambda's CloudWatch Logs. EventBridge also has a "Monitoring" tab on each rule showing invocation counts and failures, which is handy for debugging.
If you can walk an interviewer through this project, you're demonstrating skills that hiring managers genuinely look for in entry-level DevOps roles. You aren't just "using AWS"; you are demonstrating:
Most of what we used in this article fits inside the AWS Free Tier, the EC2 t3.micro, Lambda invocations, and EventBridge events from AWS services (like CloudWatch alarm state changes) are all free. One caveat: CloudWatch custom metrics (which is what procstat produces) are only free for the first 10 metrics, so a single procstat metric is fine, but the cost can scale up if you expand this pattern broadly.
When you're done experimenting, terminate the EC2 instance, delete the CloudWatch alarm, and remove the EventBridge rule to avoid any surprise bills.