Zero-Downtime Blue-Green and IP-Based Canary Deployments on ECS Fargate

This article describes a Terraform-driven, zero-downtime deployment workflow for ECS Fargate that uses ALB listener rules based on source IP addresses to route internal team traffic to a GREEN environment while public users continue hitting the BLUE environment. This approach allows QA and internal teams to validate new releases on the actual production infrastructure and URL before public rollout, without relying on CodeDeploy, DNS switching, or duplicate infrastructure. The deployment is controlled by simple boolean variables (`enable_canary`, `activate_canary`, `promote_to_all`) that manage the lifecycle of the BLUE and GREEN environments.

Most ECS blue-green deployment tutorials eventually lead to the same stack: And while CodeDeploy works, I kept running into one practical limitation during real deployments: I couldn’t let my internal team validate a new release on the actual production URL before exposing it to customers. That became the entire motivation behind this setup. I didn’t want: I wanted something much simpler: So I built a Terraform-driven deployment workflow using: without using CodeDeploy. After running this setup in practice, I ended up preferring it for many ECS workloads. Both BLUE and GREEN environments run behind the same ALB. Internal office/VPN IPs get routed to GREEN first. Everyone else continues hitting BLUE. That means QA and internal teams can validate the new release directly on the real production infrastructure before public rollout begins. Same: No “staging surprises” later. A lot of deployment issues only appear on the real production routing path. Internal users open: https://nginx.jayakrishnayadav.cloud …and immediately see the GREEN version. Meanwhile, public users continue seeing BLUE. No DNS switching. No duplicate infrastructure. Just ALB listener routing. The deployment flow looks like this: ┌────────────────────┐ │ Application LB │ └─────────┬──────────┘ │ ┌────────────────┴────────────────┐ │ │ Internal Office/VPN IPs Public Users │ │ ▼ ▼ GREEN Target Group BLUE Target Group │ │ ECS GREEN Tasks ECS BLUE Tasks The canary routing rule gets evaluated first. If the request source IP matches internal CIDRs, traffic goes to GREEN. Everything else falls back to BLUE. I kept the Terraform layout modular so it could be reused across multiple services. . ├── main.tf ├── variables.tf ├── outputs.tf ├── env/ │ ├── backend.hcl │ └── terraform.tfvars ├── modules/ │ ├── vpc/ │ ├── iam/ │ ├── alb/ │ ├── ecs-cluster/ │ └── ecs-blue-green-service/ └── scripts/ └── zero-downtime-test.sh Each ECS service gets: The entire deployment behavior depends on ALB listener priorities. The canary listener rule gets evaluated first. If the request source IP matches internal CIDRs, traffic gets forwarded to GREEN. resource "aws lb listener rule" "canary" { count = var.activate canary ? 1 : 0 priority = 99 condition { source ip { values = var.canary source ips } } condition { host header { values = "nginx.jayakrishnayadav.cloud" } } action { type = "forward" target group arn = aws lb target group.green.arn } } The production rule remains below it: resource "aws lb listener rule" "production" { priority = 100 condition { host header { values = "nginx.jayakrishnayadav.cloud" } } action { type = "forward" target group arn = local.active target group } } That’s it. No weighted routing. No lifecycle hooks. Just listener priorities. This wasn’t built as a theoretical architecture exercise. I tested the rollout flow directly from Terraform while continuously validating traffic behavior against live ECS Fargate services. Terraform initialization: terraform init -backend-config=env/backend.hcl Deployment apply: terraform apply \ -var-file=env/terraform.tfvars \ -lock=false \ -auto-approve During canary validation, I continuously verified my public IP: curl ifconfig.me That mattered because the ALB source-IP rule decides whether traffic reaches: Once my IP matched the configured canary CIDRs, traffic immediately started routing to GREEN. The nice part about this setup is that everything becomes variable-driven. BLUE handles all production traffic. GREEN remains scaled down. enable canary = false activate canary = false promote to all = false Apply: terraform apply \ -var-file=env/terraform.tfvars \ -lock=false \ -auto-approve Result: Now we start the GREEN environment. enable canary = true activate canary = false promote to all = false Apply again: terraform apply \ -var-file=env/terraform.tfvars \ -lock=false \ -auto-approve At this stage: Users never hit partially starting containers. Now we enable canary routing. enable canary = true activate canary = true promote to all = false Apply again: terraform apply \ -var-file=env/terraform.tfvars \ -lock=false \ -auto-approve Now: This became the most valuable phase of the deployment workflow. Because now: while customers remain completely unaffected. This is the ALB listener rules view while canary routing is enabled. The priority 99 rule matches internal source IPs and forwards them to GREEN, while everyone else continues hitting BLUE. Once validation looks good: enable canary = true activate canary = false promote to all = true Apply again: terraform apply \ -var-file=env/terraform.tfvars \ -lock=false \ -auto-approve Now: No downtime occurs. Traffic simply moves from one target group to another. I didn’t want to assume the deployment was safe. I wanted to verify it continuously during rollout. So I used a simple curl-based validation script that continuously hit both applications while traffic shifted between BLUE and GREEN. for i in {1..100} do for url in \ "https://nginx.jayakrishnayadav.cloud/" \ "https://apache.jayakrishnayadav.cloud/" do response=$ curl -k -s -w " HTTPSTATUS:%{http code}" "$url" body=${response% HTTPSTATUS: } status=${response HTTPSTATUS:} if $body == "BLUE - v" ; then color="BLUE" elif $body == "GREEN - v" ; then color="GREEN" else color="UNKNOWN" fi echo "Run: $i | URL: $url | Status: $status | Version: $color" done done Output during deployment: You can clearly see: That confirmed the deployment was genuinely zero downtime. After promotion: Clean and simple. Rollback became extremely simple. I just reverted the Terraform variables: enable canary = false activate canary = false promote to all = false Apply Terraform again: terraform apply \ -var-file=env/terraform.tfvars \ -lock=false \ -auto-approve ALB immediately routes traffic back to BLUE. The rollback process stays predictable because traffic switching is entirely controlled through ALB listener rules. The ALB uses ACM certificates for HTTPS. Listeners: Example: test listener allowed cidrs = "160.30.39.198/32" That keeps internal preview traffic private while still using the same production infrastructure. One thing I specifically wanted to avoid was permanently doubling infrastructure cost. Normal state: Deployment window: After promotion: So infrastructure cost only increases briefly during deployments. This project started because I wanted a very practical deployment workflow: Internal users should validate the new version on the actual production URL before customers ever see it. Once I implemented that using ALB listener priorities and source IP routing, I realized I no longer really needed CodeDeploy for this workflow. The end result became: And because everything is Terraform-driven, the deployment process stays reproducible and predictable. Full Terraform implementation: https://github.com/jayakrishnayadav24/ecs-blue-green-deployment/tree/canary