Database Migration Strategies That Actually Work in Production

This article provides practical strategies for performing database migrations safely in production environments, emphasizing that standard development approaches can lock tables and cause downtime on large datasets. It recommends breaking migrations into small, reversible steps—such as adding nullable columns, backfilling data in batches, and adding constraints later—and using concurrent index creation to avoid table locks. The article also stresses the importance of testing on production-sized data, documenting rollback plans, setting lock timeouts, and treating schema changes like code with proper version control and deployment procedures.

Database Migration Strategies That Actually Work in Production Database migrations are the thing that looks simple until you're凌晨3点 trying to recover from a migration that locked your production table for 45 minutes. After running migrations on systems with billions of rows, here's what I've learned about doing them safely. The Fundamental Problem Most migration tutorials show you this: class AddIndexesToOrders < ActiveRecord::Migration 7.0 def change add index :orders, :user id end end This is fine in development. In production with 50 million orders, this will lock your table and bring down your application. The Core Principle: Small Changes, Applied Incrementally Every production-safe migration follows the same pattern: make the change in small, non-breaking steps that can be rolled back independently. Expanding-Contract Pattern Instead of one big migration, use three: Migration 1: Add new column nullable -- Migration 1: Safe - never locks ALTER TABLE orders ADD COLUMN user email VARCHAR 255 ; -- Update application to write BOTH old and new columns -- Deploy this first Migration 2: Backfill data in batches bash -- Safe batched backfill DO $$ DECLARE batch size INT := 10000; offset val INT := 0; max id INT; updated INT; BEGIN SELECT MAX id INTO max id FROM orders; LOOP UPDATE orders SET user email = SELECT email FROM users WHERE users.id = orders.user id WHERE id IN SELECT id FROM orders WHERE user email IS NULL AND id <= max id ORDER BY id LIMIT batch size ; GET DIAGNOSTICS updated = ROW COUNT; EXIT WHEN updated = 0; -- Prevent lock contention PERFORM pg sleep 0.1 ; END LOOP; END $$; Migration 3: Add NOT NULL constraint -- Now safe because all rows have values ALTER TABLE orders ALTER COLUMN user email SET NOT NULL; Handling Long-Running Migrations The Lock Timeout Strategy -- Set a short lock timeout so migration fails fast instead of hanging SET lock timeout = '2s'; -- Migration that might need a lock ALTER TABLE orders ADD COLUMN status VARCHAR 50 ; -- If it can't get lock in 2s, it fails immediately -- Instead of blocking for minutes Concurrent Index Building Never use CREATE INDEX in production. Always use CREATE INDEX CONCURRENTLY . -- BAD: Locks table, blocks reads/writes CREATE INDEX idx orders user id ON orders user id ; -- GOOD: Runs without locking, takes longer but zero downtime CREATE INDEX CONCURRENTLY idx orders user id ON orders user id ; Critical note: CREATE INDEX CONCURRENTLY cannot run inside a transaction. Your migration framework needs to handle this. Rails: Tell it to run outside a transaction class AddIndexesToOrders < ActiveRecord::Migration 7.0 disable ddl transaction def change add index :orders, :user id, algorithm: :concurrently end end Schema Versioning: The Branching Model For complex systems, treat database schema like code with proper branching: main production schema └── staging-test validate migrations └── feature/user-email-migration your change Before starting a migration git checkout main git pull git checkout -b migration/user-email-fix Run migrations locally against fresh production copy Once validated: git checkout main git merge migration/user-email-fix Deploy migration to production The Pre-Migration Checklist Before ANY production migration: □ Tested on production-size dataset at minimum on staging with production data snapshot □ Lock duration estimated use EXPLAIN ANALYZE □ Rollback plan documented □ Canary/deploy step prepared migrate 1% of traffic, observe, then full deploy □ Alert thresholds set if migration causes X% error rate, auto-rollback □ Migrations scheduled during low-traffic window □ On-call engineer aware and standing by □ Database backup verified point-in-time recovery tested □ Lock timeout set appropriately □ Query plan examined for full table scans Real-World Example: Renaming a Column Safely Renaming a column is a four-migration process: Migration 1: Add new column double-write starts ALTER TABLE users ADD COLUMN display name VARCHAR 100 ; Update application to WRITE to both columns User.where name: 'John' .update display name: 'John' runs in background Migration 2: Backfill UPDATE users SET display name = name WHERE display name IS NULL; -- In batches of 10,000 with 0.1s sleep Migration 3: Stop reading from old column Deploy code that only reads from display name Verify everything works Migration 4: Drop old column ALTER TABLE users DROP COLUMN name; -- Must run outside transaction for PostgreSQL PostgreSQL-Specific Tools pg repack: Remove Bloat Without Table Locks Install CREATE EXTENSION pg repack; Repack a bloated table without locking pg repack -d mydb -t orders --no-indexes Repack with specific index pg repack -d mydb -t orders -i idx orders user id pg activity: Monitor Migration Progress Watch active queries during migration pg activity -h localhost -U postgres Or query directly SELECT pid, state, query, query start, now - query start AS duration FROM pg stat activity WHERE state = 'idle' ORDER BY duration DESC; The Migration Locking Hierarchy Understanding lock modes prevents surprises: | Lock Mode | Blocks | |---|---| | Access Share | DROP TABLE, TRUNCATE | | Row Share | DELETE, UPDATE, SELECT FOR UPDATE | | Row Exclusive | INSERT, UPDATE, DELETE | | Share Update Exclusive | ANALYZE, CREATE INDEX CONCURRENTLY | | Share | CREATE INDEX blocking | | Share Row Exclusive | ALTER TABLE | | Exclusive | REFRESH MATERIALIZED VIEW CONCURRENTLY | | Access Exclusive | DROP TABLE, TRUNCATE, most ALTER TABLE | The Golden Rule If your migration takes more than 100ms on a production table, it's wrong. Go back and break it into smaller pieces. The goal is always: zero downtime, zero data loss, instant rollback capability. What migration horror stories do you have? What's your go-to strategy for risky migrations?