# I Inherited 47,000 Lines of Terraform Spaghetti — Here's How I Untangled It Without Burning Production

> Source: <https://dev.to/sanjaysundarmurthy/i-inherited-47000-lines-of-terraform-spaghetti-heres-how-i-untangled-it-without-burning-1oh5>
> Published: 2026-05-22 08:02:39+00:00

## The Slack Message That Ruined My Monday

"Hey, the previous platform team left. Here's the repo. Good luck 🫡"

I stared at the Git repository. **47,000 lines of Terraform.** One state file. Zero modules. Variables named `x`

, `temp2`

, and my personal favorite — `DO_NOT_TOUCH_ask_raj`

. Raj had left the company two years ago.

If you've been a Senior DevOps Engineer for more than a year, you've inherited *something* like this. Maybe not 47K lines, but you've opened a `main.tf`

that made you question your career choices.

This isn't a "Terraform best practices" article. Those are written by people who've never had to run `terraform plan`

on a 3,000-resource state file at 2 AM while the VP of Engineering watches.

**This is a survival guide.**

## Anti-Pattern #1: The Monolith State File (aka "The Single Point of Career Failure")

### What I Found

```
# main.tf — 8,400 lines
# "Managed" networking, compute, databases, DNS, IAM, monitoring,
# and somehow... a CloudFront distribution for a marketing site
# that was decommissioned in 2023.

resource "aws_vpc" "main" { ... }
resource "aws_instance" "api_server_1" { ... }
resource "aws_instance" "api_server_2" { ... }
# ... 200 more instances ...
resource "aws_rds_instance" "prod_db" { ... }
resource "aws_iam_role" "god_mode" { ... }  # yes, really
```

A single `terraform apply`

touched **everything**. Networking, databases, compute, DNS — all entangled like Christmas lights in January. One typo in a security group rule? Congratulations, your `plan`

just showed 847 resources to evaluate, and Terraform decided your RDS instance needs replacing.

### The Real Danger

This isn't just messy — it's **operationally catastrophic**. Here's what happens:

-
`terraform plan`

takes**14 minutes**. Developers stop running it. - State file locking means only one person can work at a time.
- Blast radius of any mistake = the entire infrastructure.
- New team members are terrified to touch anything (rightfully so).

### How I Fixed It (Without Downtime)

**Step 1: State Surgery with terraform state mv**

```
# First, I mapped resource dependencies visually
terraform graph | dot -Tsvg > infra-dependency-map.svg

# Then, split by domain boundaries
terraform state mv 'aws_vpc.main' -state-out=networking/terraform.tfstate
terraform state mv 'aws_subnet.public[0]' -state-out=networking/terraform.tfstate
terraform state mv 'aws_subnet.public[1]' -state-out=networking/terraform.tfstate
```

**Step 2: Introduce State Boundaries by Blast Radius**

I split into five state files based on *change frequency* and *blast radius*:

| Layer | Contents | Change Frequency | Blast Radius |
|---|---|---|---|
`foundation` |
VPC, Subnets, Route Tables | Monthly | Critical |
`security` |
IAM, KMS, Security Groups | Weekly | Critical |
`data` |
RDS, ElastiCache, S3 | Rare | Catastrophic |
`compute` |
ECS/EKS, ASGs, ALBs | Daily | High |
`edge` |
CloudFront, Route53, WAF | Weekly | Medium |

**Step 3: Wire Them Together with Remote State Data Sources**

```
# In compute/main.tf
data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "company-terraform-state"
    key    = "foundation/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_ecs_service" "api" {
  # Reference networking outputs safely
  network_configuration {
    subnets = data.terraform_remote_state.networking.outputs.private_subnet_ids
  }
}
```

**Result:** `terraform plan`

went from 14 minutes to 45 seconds. Team velocity tripled. I stopped getting 2 AM pages about state locks.

## Anti-Pattern #2: The Copy-Paste Empire (aka "Modules at Home")

### What I Found

```
environments/
├── dev/
│   └── main.tf      # 1,200 lines
├── staging/
│   └── main.tf      # 1,200 lines (95% identical to dev)
├── prod/
│   └── main.tf      # 1,200 lines (90% identical... with 47 "hotfixes")
└── dr/
    └── main.tf      # 1,200 lines (copied from prod 8 months ago, never updated)
```

Four copies of the same infrastructure with subtle drift. Staging had a security group rule that prod didn't. DR was missing three services entirely. Nobody knew which differences were intentional.

### Why This Kills Senior Engineers

You can't `diff`

your way out of this. The files have diverged in ways that are both intentional (prod has larger instances) and accidental (someone fixed a bug in dev but forgot to propagate it). You have **no source of truth**.

### The Refactoring Strategy That Actually Works

**Don't try to unify everything at once.** I learned this the hard way after a failed "big bang" refactor that took 3 sprints and broke staging for a week.

**Instead, use the Strangler Fig pattern:**

```
# modules/api-platform/main.tf
variable "environment" {
  type = string
  validation {
    condition     = contains(["dev", "staging", "prod", "dr"], var.environment)
    error_message = "Environment must be dev, staging, prod, or dr."
  }
}

variable "config" {
  type = object({
    instance_type    = string
    min_capacity     = number
    max_capacity     = number
    enable_waf       = bool
    multi_az         = bool
    backup_retention = number
  })
}

locals {
  # Environment-specific defaults that document WHY they differ
  env_config = {
    dev = {
      instance_type    = "t3.medium"
      min_capacity     = 1
      max_capacity     = 2
      enable_waf       = false
      multi_az         = false
      backup_retention = 1
    }
    prod = {
      instance_type    = "m5.xlarge"
      min_capacity     = 3
      max_capacity     = 20
      enable_waf       = true
      multi_az         = true
      backup_retention = 35
    }
  }
}
```

**The key insight:** Every environment difference should be **documented in code as a conscious decision**, not hidden in a 1,200-line file as an accidental divergence.

## Anti-Pattern #3: The `terraform apply -auto-approve`

YOLO Pipeline

### What I Found in `.gitlab-ci.yml`

```
deploy_prod:
  stage: deploy
  script:
    - terraform init
    - terraform apply -auto-approve  # 🚨 WHAT
  only:
    - main
```

No plan artifact. No approval gate. No diff review. Push to main → infrastructure changes in production. The commit history told the horror story:

```
fix: revert the revert of the fix
fix: actually fix prod this time
fix: ok THIS one fixes it
revert: revert everything from today
```

### What Senior Engineers Actually Need

```
# .github/workflows/terraform.yml
name: "Terraform"

on:
  pull_request:
    paths: ['infrastructure/**']
  push:
    branches: [main]
    paths: ['infrastructure/**']

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Terraform Plan
        id: plan
        run: |
          terraform init
          terraform plan -no-color -out=tfplan \
            -detailed-exitcode 2>&1 | tee plan_output.txt
        continue-on-error: true

      - name: Comment Plan on PR
        uses: actions/github-script@v7
        if: github.event_name == 'pull_request'
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('plan_output.txt', 'utf8');
            const truncated = plan.length > 60000 
              ? plan.substring(0, 60000) + '\n\n... truncated ...' 
              : plan;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Terraform Plan Output\n\`\`\`\n${truncated}\n\`\`\``
            });

      - name: Upload Plan Artifact
        uses: actions/upload-artifact@v4
        with:
          name: tfplan
          path: tfplan

  apply:
    needs: plan
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    environment: production  # Requires manual approval
    steps:
      - uses: actions/checkout@v4

      - name: Download Plan
        uses: actions/download-artifact@v4
        with:
          name: tfplan

      - name: Terraform Apply
        run: terraform apply tfplan  # Apply ONLY the reviewed plan
```

**The non-negotiable rules:**

- Plans are generated on PR and attached as artifacts.
- Humans review the diff before any production apply.
- Apply uses the
*exact*plan that was reviewed (not a new plan). - The
`production`

environment requires manual approval from a senior engineer.

## Anti-Pattern #4: Secrets in State (The Ticking Compliance Bomb)

### What I Found

```
resource "aws_db_instance" "prod" {
  engine               = "postgres"
  instance_class       = "db.r5.2xlarge"
  username             = "admin"
  password             = "Pr0d_P@ssw0rd_2022!"  # I wish I was joking
  publicly_accessible  = true                    # I really wish I was joking
}
```

The password was in the `.tf`

file, the state file, the plan output, *and* the Git history. Four places to leak from. And `publicly_accessible = true`

was the cherry on this dumpster fire sundae.

### The Fix (That Also Passes Audit)

```
# Use a data source to pull secrets at plan/apply time
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/rds/master-password"
}

resource "aws_db_instance" "prod" {
  engine              = "postgres"
  instance_class      = "db.r5.2xlarge"
  username            = "admin"
  password            = data.aws_secretsmanager_secret_version.db_password.secret_string
  publicly_accessible = false

  # Prevent Terraform from detecting password "drift"
  lifecycle {
    ignore_changes = [password]
  }
}
```

**But that's not enough.** The state file *still* contains sensitive values. The complete solution:

```
# backend.tf
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "prod/data/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true                          # SSE-KMS encryption
    kms_key_id     = "arn:aws:kms:us-east-1:xxx:key/yyy"
    dynamodb_table = "terraform-state-lock"
  }
}
```

Plus strict S3 bucket policies, access logging, and **never** giving developers direct state file access. Use `terraform output`

instead.

## Anti-Pattern #5: The "God Resource" With 200 Lines of Nested Blocks

### What I Found

```
resource "aws_ecs_task_definition" "api" {
  family                   = "api"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = 1024
  memory                   = 2048
  execution_role_arn       = aws_iam_role.ecs_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([
    {
      name  = "api"
      image = "company/api:latest"  # 🚨 LATEST TAG IN PROD
      portMappings = [{ containerPort = 8080 }]
      environment = [
        { name = "DB_HOST", value = "prod-db.cluster-xxx.us-east-1.rds.amazonaws.com" },
        { name = "DB_NAME", value = "production" },
        { name = "REDIS_URL", value = "prod-redis.xxx.cache.amazonaws.com:6379" },
        # ... 45 more environment variables hardcoded here ...
      ]
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = "/ecs/api"
          "awslogs-region"        = "us-east-1"
          "awslogs-stream-prefix" = "api"
        }
      }
      # ... 80 more lines of health checks, mount points, ulimits ...
    }
  ])
}
```

**The problems compound:**

- Environment variables are hardcoded (not sourced from SSM/Secrets Manager).
-
`latest`

tag means deployments are non-reproducible. - The
`jsonencode`

blob is untestable and un-diffable in PR reviews. - One change to any env var triggers a full task definition replacement.

### The Refactored Version

```
# Use templatefile for complex JSON — it's testable and readable
resource "aws_ecs_task_definition" "api" {
  family                   = "api-${var.environment}"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = var.task_cpu
  memory                   = var.task_memory
  execution_role_arn       = aws_iam_role.ecs_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = templatefile("${path.module}/templates/api-container.json.tpl", {
    image_tag     = var.image_tag  # Pinned, passed from CI/CD
    environment   = var.environment
    db_host       = data.aws_ssm_parameter.db_host.value
    redis_url     = data.aws_ssm_parameter.redis_url.value
    log_group     = aws_cloudwatch_log_group.api.name
    aws_region    = data.aws_region.current.name
  })
}
```

## The Refactoring Playbook (Do This Monday)

After untangling this mess across three months, here's the sequence that works:

### Week 1: Triage and Protect

```
# 1. Enable state file encryption and locking NOW
# 2. Add branch protection — no direct pushes to main
# 3. Run terraform plan and SAVE the output as your baseline
terraform plan -no-color > baseline_plan_$(date +%Y%m%d).txt

# 4. Enable detailed audit logging on your state bucket
```

### Week 2-4: Split the Monolith

```
# Use terraform state list to inventory everything
terraform state list > all_resources.txt
wc -l all_resources.txt  # Mine had 2,847 resources

# Group by service domain
grep "aws_vpc\|aws_subnet\|aws_route" all_resources.txt > networking.txt
grep "aws_iam\|aws_kms" all_resources.txt > security.txt
grep "aws_rds\|aws_elasticache\|aws_s3" all_resources.txt > data.txt
grep "aws_ecs\|aws_alb\|aws_autoscaling" all_resources.txt > compute.txt
```

### Week 5-8: Modularize Incrementally

Move **one service at a time** into a module. After each move:

- Run
`terraform plan`

— it should show**zero changes**. - If plan shows changes, you have a bug. Fix it before moving on.
- Get a PR review from another senior engineer.
- Apply and monitor for 24 hours.

### Week 9-12: Harden the Pipeline

- Add
`terraform validate`

and`tflint`

to CI. - Add
`checkov`

or`tfsec`

for security scanning. - Implement drift detection (scheduled plan that alerts on differences).
- Add cost estimation with
`infracost`

.

## The Drift Detection Cron That Saved Us

This is the thing nobody talks about. Even after a perfect refactor, **drift happens**. Someone clicks in the console. An auto-remediation tool makes changes. A Lambda modifies a security group.

```
# .github/workflows/drift-detection.yml
name: "Drift Detection"

on:
  schedule:
    - cron: '0 6 * * 1-5'  # Every weekday at 6 AM

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        layer: [foundation, security, data, compute, edge]
    steps:
      - uses: actions/checkout@v4

      - name: Terraform Plan (Drift Check)
        id: plan
        working-directory: infrastructure/${{ matrix.layer }}
        run: |
          terraform init
          terraform plan -detailed-exitcode -no-color > plan.txt 2>&1
          echo "exitcode=$?" >> $GITHUB_OUTPUT
        continue-on-error: true

      - name: Alert on Drift
        if: steps.plan.outputs.exitcode == '2'
        run: |
          # Exit code 2 = changes detected (drift!)
          curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H 'Content-type: application/json' \
            -d "{\"text\":\"🚨 Drift detected in *${{ matrix.layer }}* layer. Check the plan output.\"}"
```

We caught 3 unauthorized console changes in the first week alone.

## Parting Wisdom for the Senior Engineer Who Just Inherited a Mess

**Don't refactor everything at once.** You'll break things and lose credibility.**Document what you find before you fix it.** Screenshot the horrors. You'll need them for the post-mortem and for your performance review.**Get buy-in from leadership BEFORE you start.**"I need 3 sprints for tech debt" is a hard sell. "Our current setup means any infrastructure change has a 40% chance of causing an incident" gets budget approved.**Every** Not because it's technically necessary, but because when something breaks at step 37 of 50, you want a clean git history to bisect.`terraform state mv`

should be a separate, reviewed PR.**The goal isn't perfect Terraform. The goal is Terraform that your team can safely operate at 2 AM.** If a junior engineer can't run`terraform plan`

without fear, your refactor isn't done.

## TL;DR for the Scrollers

| Anti-Pattern | Fix | Priority |
|---|---|---|
| Monolith state file | Split by blast radius and change frequency | P0 |
| Copy-paste environments | Modules + environment configs | P1 |
`-auto-approve` in CI |
Plan artifacts + manual approval gates | P0 |
| Secrets in state/code | Secrets Manager + encrypted state + `ignore_changes`
|
P0 |
| God resources with inline JSON |
`templatefile` + SSM parameters |
P2 |
| No drift detection | Scheduled `plan` with alerting |
P1 |

*If you've ever stared at a Terraform codebase and whispered "who did this?!" into the void — you're not alone. We've all been there. The good news? It's fixable. One state move at a time.*

**Found this useful? Follow me for more battle-tested DevOps content. I write about the stuff that actually happens in production — not the happy path from the docs.**
