2020 Infrastructure

Migrating to PostgreSQL: A Cloud of Mistakes

We were engaged to perform a post-incident assessment after a database migration from a fleet of SQLite files to a single managed PostgreSQL instance caused a multi-day production degradation. The client operated a code intelligence platform that had accumulated approximately one terabyte of data across thousands of per-repository SQLite databases. The decision to consolidate into PostgreSQL was architecturally sound—SQLite's single-writer limitation was becoming a bottleneck, and the operational burden of managing thousands of individual database files was unsustainable. The migration itself, however, was executed under conditions that turned a routine data movement exercise into a case study in compounding failures.

The Migration Window

The migration was planned for a 12-hour maintenance window over a weekend. The team wrote a migration script that iterated through each SQLite file, extracted rows, and inserted them into the corresponding PostgreSQL tables using batch inserts. The script ran successfully within the window. What the team did not do was create indexes on the target tables before loading data—a deliberate choice, as creating indexes after bulk load is faster than maintaining them during insert. The problem was that the post-load index creation step was not included in the migration script. It was tracked as a follow-up task. When the migration completed and traffic was restored, the application began issuing queries against tables that had no indexes whatsoever.

The most heavily queried table was 5.5 GB of structured data with an associated 261 GB TOAST table containing compressed binary payloads. Every query against this table required a sequential scan. The managed PostgreSQL instance, provisioned for normal indexed query patterns, saturated at 80% CPU within minutes of traffic restoration. Query timeouts cascaded through the application. The team recognised the missing indexes quickly, but the remediation itself introduced a second failure. Running CREATE INDEX on a 5.5 GB table with active queries would lock the table for the duration of the build. The correct approach—CREATE INDEX CONCURRENTLY—was attempted, but the managed database instance had a conservative temp_file_limit setting that was insufficient for the sort operations required by the index build on this table size. The concurrent index creation failed silently, leaving an invalid index that consumed resources on every write without accelerating any read.

The Branch Workflow Leak

A third contributing factor emerged during our review of the team's development workflow. The migration had been developed on an experimental branch that used a simplified schema—tables without indexes, without foreign key constraints, without check constraints—to accelerate iteration on the migration logic. When the branch was merged into the main branch for the production migration, the simplified schema definition was included in the merge. A code review had approved the migration script itself but had not examined the schema file, which was treated as a supporting artifact rather than a production-critical component. The schema that shipped to production was the development schema, not the production schema. This is a pattern we see repeatedly in infrastructure code: schema definitions, Terraform configurations, and Kubernetes manifests are reviewed with less rigour than application code, despite having equal or greater impact on production behaviour.

Lessons

The remediation required increasing temp_file_limit on the managed instance (which required a support ticket to the cloud provider), dropping the invalid indexes, rebuilding them concurrently, and verifying each index with pg_index.indisvalid. Total time from migration completion to stable production: four days. Our assessment report recommended several structural changes: migration scripts must include all schema components (indexes, constraints, grants) as atomic units rather than follow-up tasks; managed database instances must be load-tested with the actual query patterns and data volumes before migration, not just provisioned based on storage estimates; and schema files must be treated as first-class production artifacts in code review, with automated diff checks that flag missing indexes relative to a known-good baseline.

The broader security lesson is that database migrations are not just operational events—they are attack surface transitions. The period between completing a migration and achieving full operational readiness (indexes built, statistics gathered, connection pools tuned) is a window of elevated vulnerability. An adversary who can predict or observe a migration window knows that the system is operating without its normal defences. Query performance degradation creates retry storms that amplify any concurrent attack. Missing indexes mean the database cannot efficiently enforce uniqueness constraints, potentially allowing duplicate records that bypass application-level validation. Every migration plan should include an explicit security assessment of the transitional state, not just the final state.