2021 Application Security

Securing Commit Graph Traversal at Scale

Code intelligence systems that operate on version-controlled repositories must answer a deceptively simple question: given a user viewing a particular commit, which uploaded data is relevant? In a Git repository, this means determining which prior uploads are "visible" from the current commit—that is, reachable by traversing the commit graph backwards through parent links. The naive implementation, and the one we encountered during a security assessment of a code intelligence platform, performed this traversal in real time: for every user query, the system loaded the commit graph into memory and executed a breadth-first or depth-first search to identify all ancestor commits, then filtered uploaded data against that set. For repositories with a few thousand commits, this was imperceptible. For repositories with over 100,000 commits—common in any long-lived enterprise codebase—this traversal consumed multiple seconds of CPU time and hundreds of megabytes of memory per query.

The Algorithmic Complexity Attack

The security implication was immediate. An authenticated user with access to a large repository could issue repeated code intelligence queries that each triggered a full graph traversal. Because the traversal was performed in the request-handling goroutine with no timeout or resource limit, a modest number of concurrent requests against repositories with deep commit histories could exhaust the application server's memory and CPU. This is a classic algorithmic complexity vulnerability: the system's worst-case performance was polynomial in the size of the commit graph, and an attacker could select inputs (repository and commit) that reliably triggered the worst case. No exploit code was necessary—the normal API with normal parameters was sufficient to cause denial of service.

The remediation we recommended and helped implement replaced real-time graph traversal with pre-computed visibility maps stored in PostgreSQL. A background worker, triggered by new uploads and repository synchronisation events, computed the set of uploads visible from each commit and stored these associations in an indexed table. The computation itself still required graph traversal, but it ran asynchronously, outside the request path, with configurable resource limits and retry logic. The worker processed the commit graph incrementally: when a new upload arrived, it only needed to compute visibility for commits that were descendants of the upload's commit, not the entire graph. This incremental approach meant that even for repositories with millions of commits, the per-upload computation was proportional to the number of new commits since the last computation, not the total repository size.

Query-Time Performance

At query time, determining which uploads were visible from a given commit became a simple indexed lookup: SELECT upload_id FROM commit_visibility WHERE commit_hash = $1. This query executed in under a millisecond regardless of repository size, because the work of traversing the commit graph had been amortised across the background computation. The query path no longer allocated memory proportional to the commit graph, no longer consumed CPU proportional to the graph's depth, and no longer presented an attack surface that scaled with repository size. The system's worst-case query performance became its average-case performance: constant time, constant memory.

There were engineering subtleties in the implementation. The background worker needed to handle merge commits correctly—in a repository with frequent merges, the commit DAG is not a simple chain but a complex graph where visibility sets must be computed as the union of visibility sets from all parent branches. The worker also needed to handle force-pushes, where commits that were previously reachable become unreachable, requiring visibility entries to be invalidated. We implemented this as a mark-and-sweep process: the worker maintained a "dirty" flag on commits whose visibility might be stale, and periodically recomputed visibility for dirty commits in topological order.

The broader principle is one we emphasise in every application security engagement: unbounded computation in the request path is a vulnerability, full stop. It does not matter whether the computation is a graph traversal, a regular expression match, a recursive descent parse, or a database query with no result limit. If an attacker can select an input that causes the computation to consume resources proportional to some unbounded parameter, they can deny service. The fix is always the same in structure: move the expensive computation out of the request path, pre-compute results where possible, and ensure that query-time operations have bounded, predictable resource consumption. The commit graph visibility system is a particularly clean example of this pattern, but we see variations of the same vulnerability in nearly every system that processes graph-structured or tree-structured data at user-controlled scale.