# Database Migration Strategies That Actually Work in Production

> Source: <https://dev.to/zny10289/database-migration-strategies-that-actually-work-in-production-4a8>
> Published: 2026-05-23 20:21:52+00:00

# Database Migration Strategies That Actually Work in Production

Database migrations are the thing that looks simple until you're凌晨3点 trying to recover from a migration that locked your production table for 45 minutes. After running migrations on systems with billions of rows, here's what I've learned about doing them safely.

## The Fundamental Problem

Most migration tutorials show you this:

```
class AddIndexesToOrders < ActiveRecord::Migration[7.0]
  def change
    add_index :orders, :user_id
  end
end
```

This is fine in development. In production with 50 million orders, this will lock your table and bring down your application.

## The Core Principle: Small Changes, Applied Incrementally

Every production-safe migration follows the same pattern: **make the change in small, non-breaking steps that can be rolled back independently.**

### Expanding-Contract Pattern

Instead of one big migration, use three:

**Migration 1: Add new column (nullable)**

```
-- Migration 1: Safe - never locks
ALTER TABLE orders ADD COLUMN user_email VARCHAR(255);

-- Update application to write BOTH old and new columns
-- Deploy this first
```

**Migration 2: Backfill data (in batches)**

``` bash
-- Safe batched backfill
DO $$
DECLARE
  batch_size INT := 10000;
  offset_val INT := 0;
  max_id INT;
  updated INT;
BEGIN
  SELECT MAX(id) INTO max_id FROM orders;

  LOOP
    UPDATE orders 
    SET user_email = (SELECT email FROM users WHERE users.id = orders.user_id)
    WHERE id IN (
      SELECT id FROM orders 
      WHERE user_email IS NULL 
      AND id <= max_id
      ORDER BY id 
      LIMIT batch_size
    );

    GET DIAGNOSTICS updated = ROW_COUNT;
    EXIT WHEN updated = 0;

    -- Prevent lock contention
    PERFORM pg_sleep(0.1);
  END LOOP;
END $$;
```

**Migration 3: Add NOT NULL constraint**

```
-- Now safe because all rows have values
ALTER TABLE orders ALTER COLUMN user_email SET NOT NULL;
```

## Handling Long-Running Migrations

### The Lock Timeout Strategy

```
-- Set a short lock timeout so migration fails fast instead of hanging
SET lock_timeout = '2s';

-- Migration that might need a lock
ALTER TABLE orders ADD COLUMN status VARCHAR(50);

-- If it can't get lock in 2s, it fails immediately
-- Instead of blocking for minutes
```

### Concurrent Index Building

Never use `CREATE INDEX`

in production. Always use `CREATE INDEX CONCURRENTLY`

.

```
-- BAD: Locks table, blocks reads/writes
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- GOOD: Runs without locking, takes longer but zero downtime
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);
```

**Critical note:** `CREATE INDEX CONCURRENTLY`

cannot run inside a transaction. Your migration framework needs to handle this.

```
# Rails: Tell it to run outside a transaction
class AddIndexesToOrders < ActiveRecord::Migration[7.0]
  disable_ddl_transaction!

  def change
    add_index :orders, :user_id, algorithm: :concurrently
  end
end
```

## Schema Versioning: The Branching Model

For complex systems, treat database schema like code with proper branching:

```
main (production schema)
  └── staging-test (validate migrations)
        └── feature/user-email-migration (your change)
# Before starting a migration
git checkout main
git pull
git checkout -b migration/user-email-fix

# Run migrations locally against fresh production copy
# Once validated:
git checkout main
git merge migration/user-email-fix
# Deploy migration to production
```

## The Pre-Migration Checklist

```
Before ANY production migration:

□ Tested on production-size dataset (at minimum on staging with production data snapshot)
□ Lock duration estimated (use EXPLAIN ANALYZE)
□ Rollback plan documented
□ Canary/deploy step prepared (migrate 1% of traffic, observe, then full deploy)
□ Alert thresholds set (if migration causes >X% error rate, auto-rollback)
□ Migrations scheduled during low-traffic window
□ On-call engineer aware and standing by
□ Database backup verified (point-in-time recovery tested)
□ Lock timeout set appropriately
□ Query plan examined for full table scans
```

## Real-World Example: Renaming a Column Safely

Renaming a column is a four-migration process:

**Migration 1: Add new column (double-write starts)**

```
ALTER TABLE users ADD COLUMN display_name VARCHAR(100);
# Update application to WRITE to both columns
# User.where(name: 'John').update(display_name: 'John') 
# runs in background
```

**Migration 2: Backfill**

```
UPDATE users SET display_name = name WHERE display_name IS NULL;
-- In batches of 10,000 with 0.1s sleep
```

**Migration 3: Stop reading from old column**

```
# Deploy code that only reads from display_name
# Verify everything works
```

**Migration 4: Drop old column**

```
ALTER TABLE users DROP COLUMN name;
-- Must run outside transaction for PostgreSQL
```

## PostgreSQL-Specific Tools

### pg_repack: Remove Bloat Without Table Locks

```
# Install
CREATE EXTENSION pg_repack;

# Repack a bloated table without locking
pg_repack -d mydb -t orders --no-indexes

# Repack with specific index
pg_repack -d mydb -t orders -i idx_orders_user_id
```

### pg_activity: Monitor Migration Progress

```
# Watch active queries during migration
pg_activity -h localhost -U postgres

# Or query directly
SELECT pid, state, query, query_start, now() - query_start AS duration
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
```

### The Migration Locking Hierarchy

Understanding lock modes prevents surprises:

| Lock Mode | Blocks |
|---|---|
| Access Share | DROP TABLE, TRUNCATE |
| Row Share | DELETE, UPDATE, SELECT FOR UPDATE |
| Row Exclusive | INSERT, UPDATE, DELETE |
| Share Update Exclusive | ANALYZE, CREATE INDEX CONCURRENTLY |
| Share | CREATE INDEX (blocking) |
| Share Row Exclusive | ALTER TABLE |
| Exclusive | REFRESH MATERIALIZED VIEW CONCURRENTLY |
| Access Exclusive | DROP TABLE, TRUNCATE, most ALTER TABLE |

## The Golden Rule

**If your migration takes more than 100ms on a production table, it's wrong.** Go back and break it into smaller pieces.

The goal is always: zero downtime, zero data loss, instant rollback capability.

*What migration horror stories do you have? What's your go-to strategy for risky migrations?*
