← Back to Insights

The Hidden Risks in Database Migration Pipelines

Database migration systems are among the most security-sensitive components in any application stack, yet they routinely receive less scrutiny than application code. During a recent audit of a multi-version upgrade system—one designed to support jumping across several major versions in a single deployment—we uncovered a class of failures that are both common and dangerously silent. The migration framework used advisory locks to ensure only one instance could run migrations at a time, and individual migrations used CREATE INDEX CONCURRENTLY to avoid blocking production reads. These two mechanisms, each sensible in isolation, interacted catastrophically under specific upgrade paths.

The root cause was a deadlock between the advisory lock held by the migration runner and the concurrent index build. PostgreSQL's CREATE INDEX CONCURRENTLY requires two full table scans and must wait for all transactions that started before the second scan to complete. The advisory lock, held in the same transaction, prevented the migration runner from releasing its session-level hold until all migrations in the batch completed. On tables with moderate write traffic, the concurrent index build would wait indefinitely for the advisory lock transaction to finish, while the advisory lock transaction waited for the index build to complete. The result was not a reported error but a hung migration process. After the operator eventually killed the session, PostgreSQL left behind an invalid partial index—one that occupied disk space, was maintained on every write, but was never used by the query planner. These phantom indexes accumulated silently across upgrade cycles.

The Squashing Problem

A second finding from the same engagement concerned migration squashing. To keep the migration directory manageable, the team periodically squashed old migrations into a single baseline file that represented the current schema. This is a common and generally reasonable practice, but it erased a critical piece of information: the original upgrade path. When a new deployment needed to jump from version N to version N+5, the squashed baseline assumed a clean install. Instances running intermediate versions had schema states that the squashed migration did not account for—columns that had been renamed rather than dropped, check constraints that had been modified in-place, and partial indexes from the deadlock issue described above. The squashed migration would fail or, worse, succeed with a schema that diverged subtly from what the application expected.

Remediation

Our remediation guidance addressed both the immediate technical issues and the architectural pattern. First, we recommended decoupling the migration runner from application startup entirely. Migrations should execute as a distinct operational step—a Kubernetes Job, a CI pipeline stage, or a manual operator action—not as an implicit side effect of deploying new application code. This eliminates the time pressure that leads operators to kill stuck migrations and prevents the advisory lock from being held across the entire migration batch. Second, for concurrent index creation specifically, we recommended replacing the blocking wait with a polling loop against pg_stat_progress_create_index, with a configurable timeout and automatic cleanup of invalid indexes on failure. A post-migration health check should query pg_index for any entries where indisvalid = false and either rebuild them or alert the operator.

Third, we recommended maintaining explicit version-to-version migration paths alongside the squashed baseline, even if this means a larger migration directory. The squashed file serves new installations; the incremental files serve upgrades. This dual-path approach adds modest maintenance overhead but eliminates an entire class of schema drift bugs that are extremely difficult to diagnose in production. The broader principle is one we see violated repeatedly: database migration infrastructure must be treated as critical security infrastructure, subject to the same review standards, failure testing, and operational monitoring as the database itself.

Finally, every migration pipeline should include an automated integrity check that compares the actual schema of a running database against the expected schema derived from the migration history. Schema drift—whether from failed migrations, manual interventions, or squashing artifacts—is a latent vulnerability. It creates conditions where the application's assumptions about data types, constraints, and indexes diverge from reality, leading to data corruption, access control bypasses, or denial of service when a query plan changes unexpectedly due to a missing or invalid index.