{"slug": "build-a-self-healing-app-on-aws-a-beginner-s-guide", "title": "Build a Self-Healing App on AWS: A Beginner's Guide", "summary": "An AWS developer built a self-healing application on AWS that automatically detects a web server crash and restarts the service without human intervention. The solution uses Amazon CloudWatch Agent to monitor the Nginx process on an EC2 instance, then triggers an AWS Lambda function via EventBridge when the process fails, which sends an SSM command to restart the web server. The automation eliminates the need for manual, reactive server restarts by detecting failures and initiating recovery before a human operator would even be notified.", "body_md": "It's midnight. You get an incident call on your phone that your application's web server has crashed, and users are seeing the dreaded 500 Internal server error. You stumble to your laptop, sleepy-eyes, to run the restart command or to run your restart script.\n\nThis is the **\"Old Way.\"** It's manual, it's reactive, and it ruins your sleep.\n\nIn the world of DevOps, we don't just fix things; we build things that fix themselves (self-healing). In this article, we're going to build a simple automation that detects a web server crash and restarts the service before you even roll over in bed.\n\nTo build this, we need five simple AWS resources:\n\nThe flow looks like this:\n\nFirst, we need a server to monitor.\n\n```\nsudo dnf update -y\nsudo dnf install nginx -y\nsudo systemctl enable --now nginx\n```\n\nOpen your browser to the instance's public IP. You should see the \"Welcome to Nginx\" page.\n\nThe default EC2 metrics can't see inside your server. We need the agent to monitor the Nginx process.\n\n`sudo dnf install amazon-cloudwatch-agent -y`\n\n`sudo nano /opt/aws/amazon-cloudwatch-agent/bin/config.json`\n\nPaste this configuration:\n\n```\n{\n  \"agent\": {\n    \"metrics_collection_interval\": 60\n  },\n  \"metrics\": {\n    \"namespace\": \"CWAgent\",\n    \"metrics_collected\": {\n      \"procstat\": [\n        {\n          \"exe\": \"nginx\",\n          \"measurement\": [\"pid_count\"]\n        }\n      ]\n    }\n  }\n}\n```\n\nSave the file and start the agen, you have to tell the agent to load the file:\n\n`sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s`\n\nThe agent is now running. Verify with:\n\n`sudo systemctl status amazon-cloudwatch-agent`\n\nAfter a minute or two, you should see a procstat_lookup pid_count metric appear in CloudWatch under the CWAgent namespace.\n\nWe need a tiny function that tells AWS: \"Hey, go to Instance X and restart the web server.\" Create a new Lambda function (Python 3.12+) and paste this code:\n\n``` python\nimport os\nimport boto3\n\ndef lambda_handler(event, context):\n    ssm = boto3.client('ssm')\n\n    # In production, use an environment variable. Hardcoded for this lab:\n    instance_id = os.environ.get('INSTANCE_ID', 'i-YOUR_INSTANCE_ID_HERE')\n\n    alarm_name = event.get('detail', {}).get('alarmName', 'unknown')\n    print(f\"Alarm '{alarm_name}' triggered. Restarting Nginx on {instance_id}...\")\n\n    ssm.send_command(\n        InstanceIds=[instance_id],\n        DocumentName=\"AWS-RunShellScript\",\n        Parameters={'commands': ['sudo systemctl restart nginx']}\n    )\n    return {\"status\": \"Restart command sent\"}\n```\n\n**IAM permissions for the Lambda.** In the Lambda console, open your function's Configuration → Permissions tab and click the execution role name to open it in IAM.\n\nConfirm AWSLambdaBasicExecutionRole is attached (it usually is by default). Then click Add permissions → Create inline policy, switch to the JSON tab, and paste:\n\n```\n{\n  \"Version\": \"2012-10-17\",\n  \"Statement\": [{\n    \"Effect\": \"Allow\",\n    \"Action\": \"ssm:SendCommand\",\n    \"Resource\": [\n      \"arn:aws:ec2:REGION:ACCOUNT_ID:instance/i-YOUR_INSTANCE_ID\",\n      \"arn:aws:ssm:REGION::document/AWS-RunShellScript\"\n    ]\n  }]\n}\n```\n\nReplace REGION, ACCOUNT_ID, and the instance ID with your values.\n\nNow we tell AWS when to trigger that code.\n\nGo to CloudWatch > Alarms and click \"Create Alarm.\"\n\nGive the alarm a memorable name like nginx-down-alarm. You'll reference it in the next step.\n\nHere's where EventBridge shines. CloudWatch Alarms publish state-change events to the default event bus automatically, we just need a rule that catches the ones we care about and sends them to our Lambda.\n\nGo to EventBridge > Rules and click \"Create rule.\"\n\n```\n{\n  \"source\": [\"aws.cloudwatch\"],\n  \"detail-type\": [\"CloudWatch Alarm State Change\"],\n  \"detail\": {\n    \"alarmName\": [\"nginx-down-alarm\"],\n    \"state\": {\n      \"value\": [\"ALARM\"]\n    }\n  }\n}\n```\n\nThis pattern says: \"Match only when this specific alarm enters the ALARM state.\" No filtering logic needed inside the Lambda, EventBridge handles it.\n\nEventBridge will automatically add the invoke permission on your function. No subscriptions, no topic management.\n\nWe're going to kill Nginx and watch it come back.\n\n`sudo systemctl stop nginx`\n\nIf something doesn't fire, check the alarm history in CloudWatch first, then your Lambda's CloudWatch Logs. EventBridge also has a \"Monitoring\" tab on each rule showing invocation counts and failures, which is handy for debugging.\n\nIf you can walk an interviewer through this project, you're demonstrating skills that hiring managers genuinely look for in entry-level DevOps roles. You aren't just \"using AWS\"; you are demonstrating:\n\nMost of what we used in this article fits inside the AWS Free Tier, the EC2 t3.micro, Lambda invocations, and EventBridge events from AWS services (like CloudWatch alarm state changes) are all free. One caveat: CloudWatch custom metrics (which is what procstat produces) are only free for the first 10 metrics, so a single procstat metric is fine, but the cost can scale up if you expand this pattern broadly.\n\nWhen you're done experimenting, terminate the EC2 instance, delete the CloudWatch alarm, and remove the EventBridge rule to avoid any surprise bills.", "url": "https://wpnews.pro/news/build-a-self-healing-app-on-aws-a-beginner-s-guide", "canonical_source": "https://dev.to/esthernnolum/build-a-self-healing-app-on-aws-a-beginners-guide-1cle", "published_at": "2026-05-27 14:53:48+00:00", "updated_at": "2026-05-27 15:12:17.753990+00:00", "lang": "en", "topics": ["mlops"], "entities": ["AWS", "Nginx", "CloudWatch"], "alternates": {"html": "https://wpnews.pro/news/build-a-self-healing-app-on-aws-a-beginner-s-guide", "markdown": "https://wpnews.pro/news/build-a-self-healing-app-on-aws-a-beginner-s-guide.md", "text": "https://wpnews.pro/news/build-a-self-healing-app-on-aws-a-beginner-s-guide.txt", "jsonld": "https://wpnews.pro/news/build-a-self-healing-app-on-aws-a-beginner-s-guide.jsonld"}}