# Zero-Downtime Blue-Green and IP-Based Canary Deployments on ECS Fargate

> Source: <https://dev.to/aws-builders/zero-downtime-blue-green-and-ip-based-canary-deployments-on-ecs-fargate-4ea8>
> Published: 2026-05-23 08:36:49+00:00

Most ECS blue-green deployment tutorials eventually lead to the same stack:
And while CodeDeploy works, I kept running into one practical limitation during real deployments:
I couldn’t let my internal team validate a new release on the actual production URL before exposing it to customers.
That became the entire motivation behind this setup.
I didn’t want:
I wanted something much simpler:
So I built a Terraform-driven deployment workflow using:
without using CodeDeploy.
After running this setup in practice, I ended up preferring it for many ECS workloads.
Both BLUE and GREEN environments run behind the same ALB.
Internal office/VPN IPs get routed to GREEN first.
Everyone else continues hitting BLUE.
That means QA and internal teams can validate the new release directly on the real production infrastructure before public rollout begins.
Same:
No “staging surprises” later.
A lot of deployment issues only appear on the real production routing path.
Internal users open:
https://nginx.jayakrishnayadav.cloud
…and immediately see the GREEN version.
Meanwhile, public users continue seeing BLUE.
No DNS switching.
No duplicate infrastructure.
Just ALB listener routing.
The deployment flow looks like this:
┌────────────────────┐
│ Application LB │
└─────────┬──────────┘
│
┌────────────────┴────────────────┐
│ │
Internal Office/VPN IPs Public Users
│ │
▼ ▼
GREEN Target Group BLUE Target Group
│ │
ECS GREEN Tasks ECS BLUE Tasks
The canary routing rule gets evaluated first.
If the request source IP matches internal CIDRs, traffic goes to GREEN.
Everything else falls back to BLUE.
I kept the Terraform layout modular so it could be reused across multiple services.
.
├── main.tf
├── variables.tf
├── outputs.tf
├── env/
│ ├── backend.hcl
│ └── terraform.tfvars
├── modules/
│ ├── vpc/
│ ├── iam/
│ ├── alb/
│ ├── ecs-cluster/
│ └── ecs-blue-green-service/
└── scripts/
└── zero-downtime-test.sh
Each ECS service gets:
The entire deployment behavior depends on ALB listener priorities.
The canary listener rule gets evaluated first.
If the request source IP matches internal CIDRs, traffic gets forwarded to GREEN.
resource "aws_lb_listener_rule" "canary" {
count = var.activate_canary ? 1 : 0
priority = 99
condition {
source_ip {
values = var.canary_source_ips
}
}
condition {
host_header {
values = ["nginx.jayakrishnayadav.cloud"]
}
}
action {
type = "forward"
target_group_arn = aws_lb_target_group.green.arn
}
}
The production rule remains below it:
resource "aws_lb_listener_rule" "production" {
priority = 100
condition {
host_header {
values = ["nginx.jayakrishnayadav.cloud"]
}
}
action {
type = "forward"
target_group_arn = local.active_target_group
}
}
That’s it.
No weighted routing.
No lifecycle hooks.
Just listener priorities.
This wasn’t built as a theoretical architecture exercise.
I tested the rollout flow directly from Terraform while continuously validating traffic behavior against live ECS Fargate services.
Terraform initialization:
terraform init -backend-config=env/backend.hcl
Deployment apply:
terraform apply \
-var-file=env/terraform.tfvars \
-lock=false \
-auto-approve
During canary validation, I continuously verified my public IP:
curl ifconfig.me
That mattered because the ALB source-IP rule decides whether traffic reaches:
Once my IP matched the configured canary CIDRs, traffic immediately started routing to GREEN.
The nice part about this setup is that everything becomes variable-driven.
BLUE handles all production traffic.
GREEN remains scaled down.
enable_canary = false
activate_canary = false
promote_to_all = false
Apply:
terraform apply \
-var-file=env/terraform.tfvars \
-lock=false \
-auto-approve
Result:
Now we start the GREEN environment.
enable_canary = true
activate_canary = false
promote_to_all = false
Apply again:
terraform apply \
-var-file=env/terraform.tfvars \
-lock=false \
-auto-approve
At this stage:
Users never hit partially starting containers.
Now we enable canary routing.
enable_canary = true
activate_canary = true
promote_to_all = false
Apply again:
terraform apply \
-var-file=env/terraform.tfvars \
-lock=false \
-auto-approve
Now:
This became the most valuable phase of the deployment workflow.
Because now:
while customers remain completely unaffected.
This is the ALB listener rules view while canary routing is enabled.
The priority 99 rule matches internal source IPs and forwards them to GREEN, while everyone else continues hitting BLUE.
Once validation looks good:
enable_canary = true
activate_canary = false
promote_to_all = true
Apply again:
terraform apply \
-var-file=env/terraform.tfvars \
-lock=false \
-auto-approve
Now:
No downtime occurs.
Traffic simply moves from one target group to another.
I didn’t want to assume the deployment was safe.
I wanted to verify it continuously during rollout.
So I used a simple curl-based validation script that continuously hit both applications while traffic shifted between BLUE and GREEN.
for i in {1..100}
do
for url in \
"https://nginx.jayakrishnayadav.cloud/" \
"https://apache.jayakrishnayadav.cloud/"
do
response=$(curl -k -s -w " HTTPSTATUS:%{http_code}" "$url")
body=${response% HTTPSTATUS:*}
status=${response##*HTTPSTATUS:}
if [[ $body == *"BLUE - v"* ]]; then
color="BLUE"
elif [[ $body == *"GREEN - v"* ]]; then
color="GREEN"
else
color="UNKNOWN"
fi
echo "Run: $i | URL: $url | Status: $status | Version: $color"
done
done
Output during deployment:
You can clearly see:
That confirmed the deployment was genuinely zero downtime.
After promotion:
Clean and simple.
Rollback became extremely simple.
I just reverted the Terraform variables:
enable_canary = false
activate_canary = false
promote_to_all = false
Apply Terraform again:
terraform apply \
-var-file=env/terraform.tfvars \
-lock=false \
-auto-approve
ALB immediately routes traffic back to BLUE.
The rollback process stays predictable because traffic switching is entirely controlled through ALB listener rules.
The ALB uses ACM certificates for HTTPS.
Listeners:
Example:
test_listener_allowed_cidrs = [
"160.30.39.198/32"
]
That keeps internal preview traffic private while still using the same production infrastructure.
One thing I specifically wanted to avoid was permanently doubling infrastructure cost.
Normal state:
Deployment window:
After promotion:
So infrastructure cost only increases briefly during deployments.
This project started because I wanted a very practical deployment workflow:
Internal users should validate the new version on the actual production URL before customers ever see it.
Once I implemented that using ALB listener priorities and source IP routing, I realized I no longer really needed CodeDeploy for this workflow.
The end result became:
And because everything is Terraform-driven, the deployment process stays reproducible and predictable.
Full Terraform implementation:
https://github.com/jayakrishnayadav24/ecs-blue-green-deployment/tree/canary
