cd /news/cloud-computing/i-inherited-47000-lines-of-terraform… · home topics cloud-computing article
[ARTICLE · art-8026] src=dev.to ↗ pub= topic=cloud-computing verified=true sentiment=↓ negative

I Inherited 47,000 Lines of Terraform Spaghetti — Here's How I Untangled It Without Burning Production

DevOps engineer's experience inheriting a 47,000-line Terraform codebase with a single state file, no modules, and poorly named variables. The author explains how they refactored the infrastructure by splitting the monolithic state into five separate state files based on blast radius and change frequency, reducing `terraform plan` time from 14 minutes to 45 seconds. The piece also addresses the challenge of managing four environment directories (dev, staging, prod, DR) with significant configuration drift, recommending the Strangler Fig pattern for gradual refactoring rather than a "big bang" approach.

read11 min views6 publishedMay 22, 2026

The Slack Message That Ruined My Monday #

"Hey, the previous platform team left. Here's the repo. Good luck 🫡"

I stared at the Git repository. 47,000 lines of Terraform. One state file. Zero modules. Variables named x

, temp2

, and my personal favorite — DO_NOT_TOUCH_ask_raj

. Raj had left the company two years ago.

If you've been a Senior DevOps Engineer for more than a year, you've inherited something like this. Maybe not 47K lines, but you've opened a main.tf

that made you question your career choices.

This isn't a "Terraform best practices" article. Those are written by people who've never had to run terraform plan

on a 3,000-resource state file at 2 AM while the VP of Engineering watches.

This is a survival guide.

Anti-Pattern #1: The Monolith State File (aka "The Single Point of Career Failure") #

What I Found


resource "aws_vpc" "main" { ... }
resource "aws_instance" "api_server_1" { ... }
resource "aws_instance" "api_server_2" { ... }
resource "aws_rds_instance" "prod_db" { ... }
resource "aws_iam_role" "god_mode" { ... }  # yes, really

A single terraform apply

touched everything. Networking, databases, compute, DNS — all entangled like Christmas lights in January. One typo in a security group rule? Congratulations, your plan

just showed 847 resources to evaluate, and Terraform decided your RDS instance needs replacing.

The Real Danger

This isn't just messy — it's operationally catastrophic. Here's what happens:

terraform plan

takes14 minutes. Developers stop running it. - State file locking means only one person can work at a time.

  • Blast radius of any mistake = the entire infrastructure.
  • New team members are terrified to touch anything (rightfully so).

How I Fixed It (Without Downtime)

Step 1: State Surgery with terraform state mv

terraform graph | dot -Tsvg > infra-dependency-map.svg

terraform state mv 'aws_vpc.main' -state-out=networking/terraform.tfstate
terraform state mv 'aws_subnet.public[0]' -state-out=networking/terraform.tfstate
terraform state mv 'aws_subnet.public[1]' -state-out=networking/terraform.tfstate

Step 2: Introduce State Boundaries by Blast Radius

I split into five state files based on change frequency and blast radius:

Layer Contents Change Frequency Blast Radius
foundation
VPC, Subnets, Route Tables Monthly Critical
security
IAM, KMS, Security Groups Weekly Critical
data
RDS, ElastiCache, S3 Rare Catastrophic
compute
ECS/EKS, ASGs, ALBs Daily High
edge
CloudFront, Route53, WAF Weekly Medium

Step 3: Wire Them Together with Remote State Data Sources

data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "company-terraform-state"
    key    = "foundation/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_ecs_service" "api" {
  network_configuration {
    subnets = data.terraform_remote_state.networking.outputs.private_subnet_ids
  }
}

Result: terraform plan

went from 14 minutes to 45 seconds. Team velocity tripled. I stopped getting 2 AM pages about state locks.

Anti-Pattern #2: The Copy-Paste Empire (aka "Modules at Home") #

What I Found

environments/
├── dev/
│   └── main.tf      # 1,200 lines
├── staging/
│   └── main.tf      # 1,200 lines (95% identical to dev)
├── prod/
│   └── main.tf      # 1,200 lines (90% identical... with 47 "hotfixes")
└── dr/
    └── main.tf      # 1,200 lines (copied from prod 8 months ago, never updated)

Four copies of the same infrastructure with subtle drift. Staging had a security group rule that prod didn't. DR was missing three services entirely. Nobody knew which differences were intentional.

Why This Kills Senior Engineers

You can't diff

your way out of this. The files have diverged in ways that are both intentional (prod has larger instances) and accidental (someone fixed a bug in dev but forgot to propagate it). You have no source of truth.

The Refactoring Strategy That Actually Works

Don't try to unify everything at once. I learned this the hard way after a failed "big bang" refactor that took 3 sprints and broke staging for a week.

Instead, use the Strangler Fig pattern:

variable "environment" {
  type = string
  validation {
    condition     = contains(["dev", "staging", "prod", "dr"], var.environment)
    error_message = "Environment must be dev, staging, prod, or dr."
  }
}

variable "config" {
  type = object({
    instance_type    = string
    min_capacity     = number
    max_capacity     = number
    enable_waf       = bool
    multi_az         = bool
    backup_retention = number
  })
}

locals {
  env_config = {
    dev = {
      instance_type    = "t3.medium"
      min_capacity     = 1
      max_capacity     = 2
      enable_waf       = false
      multi_az         = false
      backup_retention = 1
    }
    prod = {
      instance_type    = "m5.xlarge"
      min_capacity     = 3
      max_capacity     = 20
      enable_waf       = true
      multi_az         = true
      backup_retention = 35
    }
  }
}

The key insight: Every environment difference should be documented in code as a conscious decision, not hidden in a 1,200-line file as an accidental divergence.

Anti-Pattern #3: The terraform apply -auto-approve #

YOLO Pipeline

What I Found in .gitlab-ci.yml

deploy_prod:
  stage: deploy
  script:
    - terraform init
    - terraform apply -auto-approve  # 🚨 WHAT
  only:
    - main

No plan artifact. No approval gate. No diff review. Push to main → infrastructure changes in production. The commit history told the horror story:

fix: revert the revert of the fix
fix: actually fix prod this time
fix: ok THIS one fixes it
revert: revert everything from today

What Senior Engineers Actually Need

name: "Terraform"

on:
  pull_request:
    paths: ['infrastructure/**']
  push:
    branches: [main]
    paths: ['infrastructure/**']

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Terraform Plan
        id: plan
        run: |
          terraform init
          terraform plan -no-color -out=tfplan \
            -detailed-exitcode 2>&1 | tee plan_output.txt
        continue-on-error: true

      - name: Comment Plan on PR
        uses: actions/github-script@v7
        if: github.event_name == 'pull_request'
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('plan_output.txt', 'utf8');
            const truncated = plan.length > 60000 
              ? plan.substring(0, 60000) + '\n\n... truncated ...' 
              : plan;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Terraform Plan Output\n\`\`\`\n${truncated}\n\`\`\``
            });

      - name: Upload Plan Artifact
        uses: actions/upload-artifact@v4
        with:
          name: tfplan
          path: tfplan

  apply:
    needs: plan
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    environment: production  # Requires manual approval
    steps:
      - uses: actions/checkout@v4

      - name: Download Plan
        uses: actions/download-artifact@v4
        with:
          name: tfplan

      - name: Terraform Apply
        run: terraform apply tfplan  # Apply ONLY the reviewed plan

The non-negotiable rules:

  • Plans are generated on PR and attached as artifacts.
  • Humans review the diff before any production apply.
  • Apply uses the exactplan that was reviewed (not a new plan). - The production

environment requires manual approval from a senior engineer.

Anti-Pattern #4: Secrets in State (The Ticking Compliance Bomb) #

What I Found

resource "aws_db_instance" "prod" {
  engine               = "postgres"
  instance_class       = "db.r5.2xlarge"
  username             = "admin"
  password             = "Pr0d_P@ssw0rd_2022!"  # I wish I was joking
  publicly_accessible  = true                    # I really wish I was joking
}

The password was in the .tf

file, the state file, the plan output, and the Git history. Four places to leak from. And publicly_accessible = true

was the cherry on this dumpster fire sundae.

The Fix (That Also Passes Audit)

data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/rds/master-password"
}

resource "aws_db_instance" "prod" {
  engine              = "postgres"
  instance_class      = "db.r5.2xlarge"
  username            = "admin"
  password            = data.aws_secretsmanager_secret_version.db_password.secret_string
  publicly_accessible = false

  lifecycle {
    ignore_changes = [password]
  }
}

But that's not enough. The state file still contains sensitive values. The complete solution:

terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "prod/data/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true                          # SSE-KMS encryption
    kms_key_id     = "arn:aws:kms:us-east-1:xxx:key/yyy"
    dynamodb_table = "terraform-state-lock"
  }
}

Plus strict S3 bucket policies, access logging, and never giving developers direct state file access. Use terraform output

instead.

Anti-Pattern #5: The "God Resource" With 200 Lines of Nested Blocks #

What I Found

resource "aws_ecs_task_definition" "api" {
  family                   = "api"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = 1024
  memory                   = 2048
  execution_role_arn       = aws_iam_role.ecs_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([
    {
      name  = "api"
      image = "company/api:latest"  # 🚨 LATEST TAG IN PROD
      portMappings = [{ containerPort = 8080 }]
      environment = [
        { name = "DB_HOST", value = "prod-db.cluster-xxx.us-east-1.rds.amazonaws.com" },
        { name = "DB_NAME", value = "production" },
        { name = "REDIS_URL", value = "prod-redis.xxx.cache.amazonaws.com:6379" },
      ]
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = "/ecs/api"
          "awslogs-region"        = "us-east-1"
          "awslogs-stream-prefix" = "api"
        }
      }
    }
  ])
}

The problems compound:

  • Environment variables are hardcoded (not sourced from SSM/Secrets Manager).

latest

tag means deployments are non-reproducible. - The jsonencode

blob is untestable and un-diffable in PR reviews. - One change to any env var triggers a full task definition replacement.

The Refactored Version

resource "aws_ecs_task_definition" "api" {
  family                   = "api-${var.environment}"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = var.task_cpu
  memory                   = var.task_memory
  execution_role_arn       = aws_iam_role.ecs_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = templatefile("${path.module}/templates/api-container.json.tpl", {
    image_tag     = var.image_tag  # Pinned, passed from CI/CD
    environment   = var.environment
    db_host       = data.aws_ssm_parameter.db_host.value
    redis_url     = data.aws_ssm_parameter.redis_url.value
    log_group     = aws_cloudwatch_log_group.api.name
    aws_region    = data.aws_region.current.name
  })
}

The Refactoring Playbook (Do This Monday) #

After untangling this mess across three months, here's the sequence that works:

Week 1: Triage and Protect

terraform plan -no-color > baseline_plan_$(date +%Y%m%d).txt

Week 2-4: Split the Monolith

terraform state list > all_resources.txt
wc -l all_resources.txt  # Mine had 2,847 resources

grep "aws_vpc\|aws_subnet\|aws_route" all_resources.txt > networking.txt
grep "aws_iam\|aws_kms" all_resources.txt > security.txt
grep "aws_rds\|aws_elasticache\|aws_s3" all_resources.txt > data.txt
grep "aws_ecs\|aws_alb\|aws_autoscaling" all_resources.txt > compute.txt

Week 5-8: Modularize Incrementally

Move one service at a time into a module. After each move:

  • Run terraform plan

— it should showzero changes. - If plan shows changes, you have a bug. Fix it before moving on.

  • Get a PR review from another senior engineer.
  • Apply and monitor for 24 hours.

Week 9-12: Harden the Pipeline

  • Add terraform validate

andtflint

to CI. - Add checkov

ortfsec

for security scanning. - Implement drift detection (scheduled plan that alerts on differences).

  • Add cost estimation with infracost

.

The Drift Detection Cron That Saved Us #

This is the thing nobody talks about. Even after a perfect refactor, drift happens. Someone clicks in the console. An auto-remediation tool makes changes. A Lambda modifies a security group.

name: "Drift Detection"

on:
  schedule:
    - cron: '0 6 * * 1-5'  # Every weekday at 6 AM

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        layer: [foundation, security, data, compute, edge]
    steps:
      - uses: actions/checkout@v4

      - name: Terraform Plan (Drift Check)
        id: plan
        working-directory: infrastructure/${{ matrix.layer }}
        run: |
          terraform init
          terraform plan -detailed-exitcode -no-color > plan.txt 2>&1
          echo "exitcode=$?" >> $GITHUB_OUTPUT
        continue-on-error: true

      - name: Alert on Drift
        if: steps.plan.outputs.exitcode == '2'
        run: |
          curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H 'Content-type: application/json' \
            -d "{\"text\":\"🚨 Drift detected in *${{ matrix.layer }}* layer. Check the plan output.\"}"

We caught 3 unauthorized console changes in the first week alone.

Parting Wisdom for the Senior Engineer Who Just Inherited a Mess #

Don't refactor everything at once. You'll break things and lose credibility.Document what you find before you fix it. Screenshot the horrors. You'll need them for the post-mortem and for your performance review.Get buy-in from leadership BEFORE you start."I need 3 sprints for tech debt" is a hard sell. "Our current setup means any infrastructure change has a 40% chance of causing an incident" gets budget approved.Every Not because it's technically necessary, but because when something breaks at step 37 of 50, you want a clean git history to bisect.terraform state mv

should be a separate, reviewed PR.The goal isn't perfect Terraform. The goal is Terraform that your team can safely operate at 2 AM. If a junior engineer can't runterraform plan

without fear, your refactor isn't done.

TL;DR for the Scrollers #

Anti-Pattern Fix Priority
Monolith state file Split by blast radius and change frequency P0
Copy-paste environments Modules + environment configs P1
-auto-approve in CI
Plan artifacts + manual approval gates P0
Secrets in state/code Secrets Manager + encrypted state + ignore_changes
P0
God resources with inline JSON
templatefile + SSM parameters
P2
No drift detection Scheduled plan with alerting
P1

If you've ever stared at a Terraform codebase and whispered "who did this?!" into the void — you're not alone. We've all been there. The good news? It's fixable. One state move at a time.

Found this useful? Follow me for more battle-tested DevOps content. I write about the stuff that actually happens in production — not the happy path from the docs.

── more in #cloud-computing 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/i-inherited-47000-li…] indexed:0 read:11min 2026-05-22 ·