raphting.dev

Resurrecting gosumdb: Independently Auditing the Go Module Supply Chain

When you run go get, you’re implicitly trusting Google’s sum.golang.org checksum database (gosumdb) to serve the correct cryptographic hash for your dependencies. This system is a massive leap forward for supply chain security, but only if we can independently verify its trustworthiness.

That’s why I built gosumdb (the project, not the service): an independent, third-party witness designed to audit Google’s sum.golang.org. It continuously fetches the database’s Merkle tree, verifies its cryptographic integrity, and checks for anomalies.

After a bit more than a month of planned downtime, I’m glad to announce that gosumdb is back online. This post is about how I re-architected it to handle millions of hashes on a single, inexpensive VM.

(For the complete background on the project’s origins, see the original posts: Part 1, Part 2, Part 3)

The Challenge: Surviving the OOM Killer

The original gosumdb implementation was simple but resource-hungry. The Go checksum database is a Merkle tree where module versions are hashed, and those hashes are hashed again, eight levels deep. The raw data for this tree now amounts to 2.8 GB.

My first implementation stored this entire tree in-memory for processing. This workes on machines with a lot of RAM, but with less resources the Linux OOM (out-of-memory) killer would step in to save the machine, shutting down the gosumdb process.

I was able to trade RAM for disk space and make gosumdb run on just a few hundred MB of RAM.

The Solution

I broke the problem into two parts: handling the massive Merkle tree itself and efficiently checking for duplicates among millions of entries.

Taming the Merkle Tree: From In-Memory to On-Disk

The 2.8 GB of hash data doesn’t need to live in RAM. A Merkle tree’s hash coordinates can be mapped to a dense linear ordering. This means I can treat a single large file as my data structure, using WriteAt and ReadAt to store or retrieve the 32 bytes of any given hash at a specific, calculated index.

Every 32-byte chunk in the file represents a hash. As long as I stick to the dense linear ordering, I don’t need any other markers or complex database logic.

With this approach, I can happily write gigabytes of module hashes to a file. Linux handles the complicated work of paging data between disk and RAM as needed. A few gigabytes of disk space are cheap, and the OOM killer is no longer a threat.

The Duplicate Problem: Finding a Needle in 46 Million records

The second major resource hog was the duplicate checker. The gosumdb’s security model relies on a critical promise: every module version (e.g., my-module@v1.0.0) only exists once in the entire database. If anyone could inject a second, different hash for the same version, they could selectively serve a malicious version to certain users.

This created a classic engineering problem: “How do you find duplicate strings in more than 46 million entries if you can’t fit them all in RAM?”

The solution I implemented is “hash partitions.” Here’s how it works:

  1. Take a module string, like "my-module@v1.0.0", and hash it to a number (e.g., 39568375).
  2. Take that number modulo 1024 (an arbitrary number of partitions), which gives a result like 1015.
  3. Append the original string ("my-module@v1.0.0") to a file named partition-1015.txt.
  4. Repeat this for all 46 million+ entries.

Any identical strings will hash to the same number and end up in the same partition file.

Once all entries are processed, I can check for duplicates one file at a time. In practice, these partition files are only about 6 MB large and trivially fit into RAM. I can then use standard in-memory sorting, sweep through the sorted list, and check if any adjacent strings are equal.

Why an Independent Witness Matters

Large parts of my implementation are unique. I didn’t copy existing projects, and starting this in July 2023 also makes me an early bird. I logged all reports since I started monitoring go’s database.

I’m happy to report that the anomaly of a duplicate module version with a different hash has never happened yet. This gives us high, independent confidence that all modules served via the module proxy are unaltered, which is exactly what we expect. Needless to say, the root hash was always correct too.

I verify all hashes from the raw data up. I don’t rely on intermediary tiles served by Google. This makes it not only an integrity check, but a full check of all accessible data.

A Personal Motivation

I’ve been working on gosumdb in my free time, driven by an interest that comes from 10 years of professional Go development. I started when Go didn’t even have modules and the big question was “What is your GOPATH?!” We used third-party tools like virtualgo just to manage our workflows.

When Go 1.11 introduced modules and the gosumdb, I was excited to see a simpler and more secure way of managing dependencies. Few languages secure their supply chain at this level. Having spent two years working professionally in supply chain security, I was inspired to investigate gosumdb deeply and write the witness this article is about.

The gosumdb witness is back online as another layer in the swiss cheese model to secure our infrastructure, available under https://monitor.raphting.dev/gosumdb/

By Raphael Sprenger