2020Architecture

Eight Architectures in 527 Commits

Architectural security reviews of mature systems often reveal that the current design is the result of accretion rather than intention. During an engagement reviewing a code intelligence backend, we traced the system's evolution through eight major architectural iterations over the course of 527 commits. Each transition was driven by an operational problem, and each solution introduced its own security tradeoffs. The history of this system provides a useful case study in how architectural evolution creates and reshapes attack surface over time.

The initial implementation was an Express/TypeScript service that processed code intelligence data synchronously. The first major transition replaced an external graph database (Dgraph) with SQLite bundles after benchmarks showed Dgraph was 25 times slower for the access patterns required. This is a common and defensible architectural choice, but it moved data from a networked service with its own access control layer into local files that the application process could read and write directly. The attack surface shifted from network-based (authentication, TLS, query injection) to filesystem-based (path traversal, symlink attacks, file permission errors). The next iteration introduced asynchronous processing to handle growing data volumes, followed by a Redis-backed job queue using the node-resque library.

The Queue Evolution

The queue subsystem's evolution is particularly instructive from a security perspective. The initial Redis queue (node-resque) suffered from stuck workers: jobs that were dequeued but never completed due to process crashes would remain in a "working" state indefinitely, requiring manual intervention. The replacement, Bull, solved the stuck worker problem with automatic job timeouts and retry logic. However, both resque and Bull treated Redis as their persistent state store, and Redis was configured as an ephemeral cache with eviction policies enabled. Under memory pressure, Redis would evict queue entries, silently dropping jobs. This is a data integrity failure — queued work disappearing without error, without notification, and without any audit trail.

The resolution was to replace Redis as the queue backend with PostgreSQL. This consolidated all persistent state — data, metadata, and job queue — into a single transactional store with durability guarantees. From a security assessment perspective, this consolidation was a significant improvement. A single data store means a single set of access controls, a single audit log, a single backup and recovery procedure, and a single encryption-at-rest configuration. When persistent state is distributed across Redis, PostgreSQL, and the filesystem, each store requires its own security configuration, and the gaps between them become the most likely locations for vulnerabilities.

The Language Transition

The final major transitions — a full rewrite from TypeScript to Go, followed by the removal of a separate API layer in favour of direct database access — further reduced the system's attack surface. The Go rewrite eliminated an entire category of runtime type errors that TypeScript's compilation could not catch (as documented in our circular import vulnerability findings). The API layer removal eliminated a network boundary that had introduced authentication, serialisation, and versioning complexity without providing a corresponding security benefit — both sides of the API ran in the same trust domain with the same credentials.

The broader lesson from this engagement is that each architectural transition in a system's history leaves residue: configuration files, database tables, network policies, and code paths that were correct for a previous architecture but are vestigial or actively harmful in the current one. A Redis instance that was once the queue backend may still be running, still network-accessible, still containing stale data. A filesystem path that once held SQLite bundles may still have permissive access controls. Architectural security reviews must examine not only the current design but the history of transitions, because the attack surface of a system is the union of all architectures it has ever been, minus whatever cleanup was performed at each transition. In our experience, the cleanup is rarely complete.

We recommend that organisations maintain an architectural decision log that records not only what was adopted but what was deprecated and what decommissioning steps were taken. When we conduct reviews and find Redis instances with no active consumers, database tables with no referencing code, or network firewall rules for services that no longer exist, the cost of remediation is always higher than the cost of decommissioning at the time of the transition. Architectural hygiene is a security control.

← Back to Insights