{"slug": "i-inherited-47000-lines-of-terraform-spaghetti-here-s-how-i-untangled-it-without", "title": "I Inherited 47,000 Lines of Terraform Spaghetti — Here's How I Untangled It Without Burning Production", "summary": "DevOps engineer's experience inheriting a 47,000-line Terraform codebase with a single state file, no modules, and poorly named variables. The author explains how they refactored the infrastructure by splitting the monolithic state into five separate state files based on blast radius and change frequency, reducing `terraform plan` time from 14 minutes to 45 seconds. The piece also addresses the challenge of managing four environment directories (dev, staging, prod, DR) with significant configuration drift, recommending the Strangler Fig pattern for gradual refactoring rather than a \"big bang\" approach.", "body_md": "## The Slack Message That Ruined My Monday\n\n\"Hey, the previous platform team left. Here's the repo. Good luck 🫡\"\n\nI stared at the Git repository. **47,000 lines of Terraform.** One state file. Zero modules. Variables named `x`\n\n, `temp2`\n\n, and my personal favorite — `DO_NOT_TOUCH_ask_raj`\n\n. Raj had left the company two years ago.\n\nIf you've been a Senior DevOps Engineer for more than a year, you've inherited *something* like this. Maybe not 47K lines, but you've opened a `main.tf`\n\nthat made you question your career choices.\n\nThis isn't a \"Terraform best practices\" article. Those are written by people who've never had to run `terraform plan`\n\non a 3,000-resource state file at 2 AM while the VP of Engineering watches.\n\n**This is a survival guide.**\n\n## Anti-Pattern #1: The Monolith State File (aka \"The Single Point of Career Failure\")\n\n### What I Found\n\n```\n# main.tf — 8,400 lines\n# \"Managed\" networking, compute, databases, DNS, IAM, monitoring,\n# and somehow... a CloudFront distribution for a marketing site\n# that was decommissioned in 2023.\n\nresource \"aws_vpc\" \"main\" { ... }\nresource \"aws_instance\" \"api_server_1\" { ... }\nresource \"aws_instance\" \"api_server_2\" { ... }\n# ... 200 more instances ...\nresource \"aws_rds_instance\" \"prod_db\" { ... }\nresource \"aws_iam_role\" \"god_mode\" { ... }  # yes, really\n```\n\nA single `terraform apply`\n\ntouched **everything**. Networking, databases, compute, DNS — all entangled like Christmas lights in January. One typo in a security group rule? Congratulations, your `plan`\n\njust showed 847 resources to evaluate, and Terraform decided your RDS instance needs replacing.\n\n### The Real Danger\n\nThis isn't just messy — it's **operationally catastrophic**. Here's what happens:\n\n-\n`terraform plan`\n\ntakes**14 minutes**. Developers stop running it. - State file locking means only one person can work at a time.\n- Blast radius of any mistake = the entire infrastructure.\n- New team members are terrified to touch anything (rightfully so).\n\n### How I Fixed It (Without Downtime)\n\n**Step 1: State Surgery with terraform state mv**\n\n```\n# First, I mapped resource dependencies visually\nterraform graph | dot -Tsvg > infra-dependency-map.svg\n\n# Then, split by domain boundaries\nterraform state mv 'aws_vpc.main' -state-out=networking/terraform.tfstate\nterraform state mv 'aws_subnet.public[0]' -state-out=networking/terraform.tfstate\nterraform state mv 'aws_subnet.public[1]' -state-out=networking/terraform.tfstate\n```\n\n**Step 2: Introduce State Boundaries by Blast Radius**\n\nI split into five state files based on *change frequency* and *blast radius*:\n\n| Layer | Contents | Change Frequency | Blast Radius |\n|---|---|---|---|\n`foundation` |\nVPC, Subnets, Route Tables | Monthly | Critical |\n`security` |\nIAM, KMS, Security Groups | Weekly | Critical |\n`data` |\nRDS, ElastiCache, S3 | Rare | Catastrophic |\n`compute` |\nECS/EKS, ASGs, ALBs | Daily | High |\n`edge` |\nCloudFront, Route53, WAF | Weekly | Medium |\n\n**Step 3: Wire Them Together with Remote State Data Sources**\n\n```\n# In compute/main.tf\ndata \"terraform_remote_state\" \"networking\" {\n  backend = \"s3\"\n  config = {\n    bucket = \"company-terraform-state\"\n    key    = \"foundation/terraform.tfstate\"\n    region = \"us-east-1\"\n  }\n}\n\nresource \"aws_ecs_service\" \"api\" {\n  # Reference networking outputs safely\n  network_configuration {\n    subnets = data.terraform_remote_state.networking.outputs.private_subnet_ids\n  }\n}\n```\n\n**Result:** `terraform plan`\n\nwent from 14 minutes to 45 seconds. Team velocity tripled. I stopped getting 2 AM pages about state locks.\n\n## Anti-Pattern #2: The Copy-Paste Empire (aka \"Modules at Home\")\n\n### What I Found\n\n```\nenvironments/\n├── dev/\n│   └── main.tf      # 1,200 lines\n├── staging/\n│   └── main.tf      # 1,200 lines (95% identical to dev)\n├── prod/\n│   └── main.tf      # 1,200 lines (90% identical... with 47 \"hotfixes\")\n└── dr/\n    └── main.tf      # 1,200 lines (copied from prod 8 months ago, never updated)\n```\n\nFour copies of the same infrastructure with subtle drift. Staging had a security group rule that prod didn't. DR was missing three services entirely. Nobody knew which differences were intentional.\n\n### Why This Kills Senior Engineers\n\nYou can't `diff`\n\nyour way out of this. The files have diverged in ways that are both intentional (prod has larger instances) and accidental (someone fixed a bug in dev but forgot to propagate it). You have **no source of truth**.\n\n### The Refactoring Strategy That Actually Works\n\n**Don't try to unify everything at once.** I learned this the hard way after a failed \"big bang\" refactor that took 3 sprints and broke staging for a week.\n\n**Instead, use the Strangler Fig pattern:**\n\n```\n# modules/api-platform/main.tf\nvariable \"environment\" {\n  type = string\n  validation {\n    condition     = contains([\"dev\", \"staging\", \"prod\", \"dr\"], var.environment)\n    error_message = \"Environment must be dev, staging, prod, or dr.\"\n  }\n}\n\nvariable \"config\" {\n  type = object({\n    instance_type    = string\n    min_capacity     = number\n    max_capacity     = number\n    enable_waf       = bool\n    multi_az         = bool\n    backup_retention = number\n  })\n}\n\nlocals {\n  # Environment-specific defaults that document WHY they differ\n  env_config = {\n    dev = {\n      instance_type    = \"t3.medium\"\n      min_capacity     = 1\n      max_capacity     = 2\n      enable_waf       = false\n      multi_az         = false\n      backup_retention = 1\n    }\n    prod = {\n      instance_type    = \"m5.xlarge\"\n      min_capacity     = 3\n      max_capacity     = 20\n      enable_waf       = true\n      multi_az         = true\n      backup_retention = 35\n    }\n  }\n}\n```\n\n**The key insight:** Every environment difference should be **documented in code as a conscious decision**, not hidden in a 1,200-line file as an accidental divergence.\n\n## Anti-Pattern #3: The `terraform apply -auto-approve`\n\nYOLO Pipeline\n\n### What I Found in `.gitlab-ci.yml`\n\n```\ndeploy_prod:\n  stage: deploy\n  script:\n    - terraform init\n    - terraform apply -auto-approve  # 🚨 WHAT\n  only:\n    - main\n```\n\nNo plan artifact. No approval gate. No diff review. Push to main → infrastructure changes in production. The commit history told the horror story:\n\n```\nfix: revert the revert of the fix\nfix: actually fix prod this time\nfix: ok THIS one fixes it\nrevert: revert everything from today\n```\n\n### What Senior Engineers Actually Need\n\n```\n# .github/workflows/terraform.yml\nname: \"Terraform\"\n\non:\n  pull_request:\n    paths: ['infrastructure/**']\n  push:\n    branches: [main]\n    paths: ['infrastructure/**']\n\njobs:\n  plan:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n\n      - name: Terraform Plan\n        id: plan\n        run: |\n          terraform init\n          terraform plan -no-color -out=tfplan \\\n            -detailed-exitcode 2>&1 | tee plan_output.txt\n        continue-on-error: true\n\n      - name: Comment Plan on PR\n        uses: actions/github-script@v7\n        if: github.event_name == 'pull_request'\n        with:\n          script: |\n            const fs = require('fs');\n            const plan = fs.readFileSync('plan_output.txt', 'utf8');\n            const truncated = plan.length > 60000 \n              ? plan.substring(0, 60000) + '\\n\\n... truncated ...' \n              : plan;\n            github.rest.issues.createComment({\n              issue_number: context.issue.number,\n              owner: context.repo.owner,\n              repo: context.repo.repo,\n              body: `## Terraform Plan Output\\n\\`\\`\\`\\n${truncated}\\n\\`\\`\\``\n            });\n\n      - name: Upload Plan Artifact\n        uses: actions/upload-artifact@v4\n        with:\n          name: tfplan\n          path: tfplan\n\n  apply:\n    needs: plan\n    runs-on: ubuntu-latest\n    if: github.ref == 'refs/heads/main' && github.event_name == 'push'\n    environment: production  # Requires manual approval\n    steps:\n      - uses: actions/checkout@v4\n\n      - name: Download Plan\n        uses: actions/download-artifact@v4\n        with:\n          name: tfplan\n\n      - name: Terraform Apply\n        run: terraform apply tfplan  # Apply ONLY the reviewed plan\n```\n\n**The non-negotiable rules:**\n\n- Plans are generated on PR and attached as artifacts.\n- Humans review the diff before any production apply.\n- Apply uses the\n*exact*plan that was reviewed (not a new plan). - The\n`production`\n\nenvironment requires manual approval from a senior engineer.\n\n## Anti-Pattern #4: Secrets in State (The Ticking Compliance Bomb)\n\n### What I Found\n\n```\nresource \"aws_db_instance\" \"prod\" {\n  engine               = \"postgres\"\n  instance_class       = \"db.r5.2xlarge\"\n  username             = \"admin\"\n  password             = \"Pr0d_P@ssw0rd_2022!\"  # I wish I was joking\n  publicly_accessible  = true                    # I really wish I was joking\n}\n```\n\nThe password was in the `.tf`\n\nfile, the state file, the plan output, *and* the Git history. Four places to leak from. And `publicly_accessible = true`\n\nwas the cherry on this dumpster fire sundae.\n\n### The Fix (That Also Passes Audit)\n\n```\n# Use a data source to pull secrets at plan/apply time\ndata \"aws_secretsmanager_secret_version\" \"db_password\" {\n  secret_id = \"prod/rds/master-password\"\n}\n\nresource \"aws_db_instance\" \"prod\" {\n  engine              = \"postgres\"\n  instance_class      = \"db.r5.2xlarge\"\n  username            = \"admin\"\n  password            = data.aws_secretsmanager_secret_version.db_password.secret_string\n  publicly_accessible = false\n\n  # Prevent Terraform from detecting password \"drift\"\n  lifecycle {\n    ignore_changes = [password]\n  }\n}\n```\n\n**But that's not enough.** The state file *still* contains sensitive values. The complete solution:\n\n```\n# backend.tf\nterraform {\n  backend \"s3\" {\n    bucket         = \"company-terraform-state\"\n    key            = \"prod/data/terraform.tfstate\"\n    region         = \"us-east-1\"\n    encrypt        = true                          # SSE-KMS encryption\n    kms_key_id     = \"arn:aws:kms:us-east-1:xxx:key/yyy\"\n    dynamodb_table = \"terraform-state-lock\"\n  }\n}\n```\n\nPlus strict S3 bucket policies, access logging, and **never** giving developers direct state file access. Use `terraform output`\n\ninstead.\n\n## Anti-Pattern #5: The \"God Resource\" With 200 Lines of Nested Blocks\n\n### What I Found\n\n```\nresource \"aws_ecs_task_definition\" \"api\" {\n  family                   = \"api\"\n  network_mode             = \"awsvpc\"\n  requires_compatibilities = [\"FARGATE\"]\n  cpu                      = 1024\n  memory                   = 2048\n  execution_role_arn       = aws_iam_role.ecs_execution.arn\n  task_role_arn            = aws_iam_role.ecs_task.arn\n\n  container_definitions = jsonencode([\n    {\n      name  = \"api\"\n      image = \"company/api:latest\"  # 🚨 LATEST TAG IN PROD\n      portMappings = [{ containerPort = 8080 }]\n      environment = [\n        { name = \"DB_HOST\", value = \"prod-db.cluster-xxx.us-east-1.rds.amazonaws.com\" },\n        { name = \"DB_NAME\", value = \"production\" },\n        { name = \"REDIS_URL\", value = \"prod-redis.xxx.cache.amazonaws.com:6379\" },\n        # ... 45 more environment variables hardcoded here ...\n      ]\n      logConfiguration = {\n        logDriver = \"awslogs\"\n        options = {\n          \"awslogs-group\"         = \"/ecs/api\"\n          \"awslogs-region\"        = \"us-east-1\"\n          \"awslogs-stream-prefix\" = \"api\"\n        }\n      }\n      # ... 80 more lines of health checks, mount points, ulimits ...\n    }\n  ])\n}\n```\n\n**The problems compound:**\n\n- Environment variables are hardcoded (not sourced from SSM/Secrets Manager).\n-\n`latest`\n\ntag means deployments are non-reproducible. - The\n`jsonencode`\n\nblob is untestable and un-diffable in PR reviews. - One change to any env var triggers a full task definition replacement.\n\n### The Refactored Version\n\n```\n# Use templatefile for complex JSON — it's testable and readable\nresource \"aws_ecs_task_definition\" \"api\" {\n  family                   = \"api-${var.environment}\"\n  network_mode             = \"awsvpc\"\n  requires_compatibilities = [\"FARGATE\"]\n  cpu                      = var.task_cpu\n  memory                   = var.task_memory\n  execution_role_arn       = aws_iam_role.ecs_execution.arn\n  task_role_arn            = aws_iam_role.ecs_task.arn\n\n  container_definitions = templatefile(\"${path.module}/templates/api-container.json.tpl\", {\n    image_tag     = var.image_tag  # Pinned, passed from CI/CD\n    environment   = var.environment\n    db_host       = data.aws_ssm_parameter.db_host.value\n    redis_url     = data.aws_ssm_parameter.redis_url.value\n    log_group     = aws_cloudwatch_log_group.api.name\n    aws_region    = data.aws_region.current.name\n  })\n}\n```\n\n## The Refactoring Playbook (Do This Monday)\n\nAfter untangling this mess across three months, here's the sequence that works:\n\n### Week 1: Triage and Protect\n\n```\n# 1. Enable state file encryption and locking NOW\n# 2. Add branch protection — no direct pushes to main\n# 3. Run terraform plan and SAVE the output as your baseline\nterraform plan -no-color > baseline_plan_$(date +%Y%m%d).txt\n\n# 4. Enable detailed audit logging on your state bucket\n```\n\n### Week 2-4: Split the Monolith\n\n```\n# Use terraform state list to inventory everything\nterraform state list > all_resources.txt\nwc -l all_resources.txt  # Mine had 2,847 resources\n\n# Group by service domain\ngrep \"aws_vpc\\|aws_subnet\\|aws_route\" all_resources.txt > networking.txt\ngrep \"aws_iam\\|aws_kms\" all_resources.txt > security.txt\ngrep \"aws_rds\\|aws_elasticache\\|aws_s3\" all_resources.txt > data.txt\ngrep \"aws_ecs\\|aws_alb\\|aws_autoscaling\" all_resources.txt > compute.txt\n```\n\n### Week 5-8: Modularize Incrementally\n\nMove **one service at a time** into a module. After each move:\n\n- Run\n`terraform plan`\n\n— it should show**zero changes**. - If plan shows changes, you have a bug. Fix it before moving on.\n- Get a PR review from another senior engineer.\n- Apply and monitor for 24 hours.\n\n### Week 9-12: Harden the Pipeline\n\n- Add\n`terraform validate`\n\nand`tflint`\n\nto CI. - Add\n`checkov`\n\nor`tfsec`\n\nfor security scanning. - Implement drift detection (scheduled plan that alerts on differences).\n- Add cost estimation with\n`infracost`\n\n.\n\n## The Drift Detection Cron That Saved Us\n\nThis is the thing nobody talks about. Even after a perfect refactor, **drift happens**. Someone clicks in the console. An auto-remediation tool makes changes. A Lambda modifies a security group.\n\n```\n# .github/workflows/drift-detection.yml\nname: \"Drift Detection\"\n\non:\n  schedule:\n    - cron: '0 6 * * 1-5'  # Every weekday at 6 AM\n\njobs:\n  detect-drift:\n    runs-on: ubuntu-latest\n    strategy:\n      matrix:\n        layer: [foundation, security, data, compute, edge]\n    steps:\n      - uses: actions/checkout@v4\n\n      - name: Terraform Plan (Drift Check)\n        id: plan\n        working-directory: infrastructure/${{ matrix.layer }}\n        run: |\n          terraform init\n          terraform plan -detailed-exitcode -no-color > plan.txt 2>&1\n          echo \"exitcode=$?\" >> $GITHUB_OUTPUT\n        continue-on-error: true\n\n      - name: Alert on Drift\n        if: steps.plan.outputs.exitcode == '2'\n        run: |\n          # Exit code 2 = changes detected (drift!)\n          curl -X POST \"${{ secrets.SLACK_WEBHOOK }}\" \\\n            -H 'Content-type: application/json' \\\n            -d \"{\\\"text\\\":\\\"🚨 Drift detected in *${{ matrix.layer }}* layer. Check the plan output.\\\"}\"\n```\n\nWe caught 3 unauthorized console changes in the first week alone.\n\n## Parting Wisdom for the Senior Engineer Who Just Inherited a Mess\n\n**Don't refactor everything at once.** You'll break things and lose credibility.**Document what you find before you fix it.** Screenshot the horrors. You'll need them for the post-mortem and for your performance review.**Get buy-in from leadership BEFORE you start.**\"I need 3 sprints for tech debt\" is a hard sell. \"Our current setup means any infrastructure change has a 40% chance of causing an incident\" gets budget approved.**Every** Not because it's technically necessary, but because when something breaks at step 37 of 50, you want a clean git history to bisect.`terraform state mv`\n\nshould be a separate, reviewed PR.**The goal isn't perfect Terraform. The goal is Terraform that your team can safely operate at 2 AM.** If a junior engineer can't run`terraform plan`\n\nwithout fear, your refactor isn't done.\n\n## TL;DR for the Scrollers\n\n| Anti-Pattern | Fix | Priority |\n|---|---|---|\n| Monolith state file | Split by blast radius and change frequency | P0 |\n| Copy-paste environments | Modules + environment configs | P1 |\n`-auto-approve` in CI |\nPlan artifacts + manual approval gates | P0 |\n| Secrets in state/code | Secrets Manager + encrypted state + `ignore_changes`\n|\nP0 |\n| God resources with inline JSON |\n`templatefile` + SSM parameters |\nP2 |\n| No drift detection | Scheduled `plan` with alerting |\nP1 |\n\n*If you've ever stared at a Terraform codebase and whispered \"who did this?!\" into the void — you're not alone. We've all been there. The good news? It's fixable. One state move at a time.*\n\n**Found this useful? Follow me for more battle-tested DevOps content. I write about the stuff that actually happens in production — not the happy path from the docs.**", "url": "https://wpnews.pro/news/i-inherited-47000-lines-of-terraform-spaghetti-here-s-how-i-untangled-it-without", "canonical_source": "https://dev.to/sanjaysundarmurthy/i-inherited-47000-lines-of-terraform-spaghetti-heres-how-i-untangled-it-without-burning-1oh5", "published_at": "2026-05-22 08:02:39+00:00", "updated_at": "2026-05-22 08:24:34.972746+00:00", "lang": "en", "topics": ["cloud-computing", "developer-tools", "enterprise-software", "data"], "entities": ["Terraform", "AWS", "CloudFront", "RDS", "IAM", "VPC", "Raj", "VP of Engineering"], "alternates": {"html": "https://wpnews.pro/news/i-inherited-47000-lines-of-terraform-spaghetti-here-s-how-i-untangled-it-without", "markdown": "https://wpnews.pro/news/i-inherited-47000-lines-of-terraform-spaghetti-here-s-how-i-untangled-it-without.md", "text": "https://wpnews.pro/news/i-inherited-47000-lines-of-terraform-spaghetti-here-s-how-i-untangled-it-without.txt", "jsonld": "https://wpnews.pro/news/i-inherited-47000-lines-of-terraform-spaghetti-here-s-how-i-untangled-it-without.jsonld"}}