The Go Blog

Perfectly Reproducible, Verified Go Toolchains

Russ Cox
28 August 2023

One of the key benefits of open-source software is that anyone can read the source code and inspect what it does. And yet most software, even open-source software, is downloaded in the form of compiled binaries, which are much more difficult to inspect. If an attacker wanted to run a supply chain attack on an open-source project, the least visible way would be to replace the binaries being served while leaving the source code unmodified.

The best way to address this kind of attack is to make open-source software builds reproducible, meaning that a build that starts with the same sources produces the same outputs every time it runs. That way, anyone can verify that posted binaries are free of hidden changes by building from authentic sources and checking that the rebuilt binaries are bit-for-bit identical to the posted binaries. That approach proves the binaries have no backdoors or other changes not present in the source code, without having to disassemble or look inside them at all. Since anyone can verify the binaries, independent groups can easily detect and report supply chain attacks.

As supply chain security becomes more important, so do reproducible builds, because they provide a simple way to verify the posted binaries for open-source projects.

Go 1.21.0 is the first Go toolchain with perfectly reproducible builds. Earlier toolchains were possible to reproduce, but only with significant effort, and probably no one did: they just trusted that the binaries posted on go.dev/dl were the correct ones. Now it’s easy to “trust but verify.”

This post explains what goes into making builds reproducible, examines the many changes we had to make to Go to make Go toolchains reproducible, and then demonstrates one of the benefits of reproducibility by verifying the Ubuntu package for Go 1.21.0.

Making a Build Reproducible

Computers are generally deterministic, so you might think all builds would be equally reproducible. That’s only true from a certain point of view. Let’s call a piece of information a relevant input when the output of a build can change depending on that input. A build is reproducible if it can be repeated with all the same relevant inputs. Unfortunately, lots of build tools turn out to incorporate inputs that we would usually not realize are relevant and that might be difficult to recreate or provide as input. Let’s call an input an unintentional input when it turns out to be relevant but we didn’t mean it to be.

The most common unintentional input in build systems is the current time. If a build writes an executable to disk, the file system records the current time as the executable’s modification time. If the build then packages that file using a tool like “tar” or “zip”, the modification time is written into the archive. We certainly didn’t want our build to change based on the current time, but it does. So the current time turns out to be an unintentional input to the build. Worse, most programs don’t let you provide the current time as an input, so there is no way to repeat this build. To fix this, we might set the time stamps on created files to Unix time 0 or to a specific time read from one of the build’s source files. That way, the current time is no longer a relevant input to the build.

Common relevant inputs to a build include:

the specific version of the source code to build;
the specific versions of dependencies that will be included in the build;
the operating system running the build, which may affect path names in the resulting binaries;
the architecture of the CPU on the build system,
which may affect which optimizations the compiler uses or the layout of certain data structures;
the compiler version being used, as well as compiler options passed to it, which affect how the code is compiled;
the name of the directory containing the source code, which may appear in debug information;
the user name, group name, uid, and gid of the account running the build, which may appear in file metadata in an archive;
and many more.

To have a reproducible build, every relevant input must be configurable in the build, and then the binaries must be posted alongside an explicit configuration listing every relevant input. If you’ve done that, you have a reproducible build. Congratulations!

We’re not done, though. If the binaries can only be reproduced if you first find a computer with the right architecture, install a specific operating system version, compiler version, put the source code in the right directory, set your user identity correctly, and so on, that may be too much work in practice for anyone to bother.

We want builds to be not just reproducible but easy to reproduce. To do that, we need to identify relevant inputs and then, instead of documenting them, eliminate them. The build obviously has to depend on the source code being built, but everything else can be eliminated. When a build’s only relevant input is its source code, let’s call that perfectly reproducible.

Perfectly Reproducible Builds for Go

As of Go 1.21, the Go toolchain is perfectly reproducible: its only relevant input is the source code for that build. We can build a specific toolchain (say, Go for Linux/x86-64) on a Linux/x86-64 host, or a Windows/ARM64 host, or a FreeBSD/386 host, or any other host that supports Go, and we can use any Go bootstrap compiler, including bootstrapping all the way back to Go 1.4’s C implementation, and we can vary any other details. None of that changes the toolchains that are built. If we start with the same toolchain source code, we will get the exact same toolchain binaries out.

This perfect reproducibility is the culmination of efforts dating back originally to Go 1.10, although most of the effort was concentrated in Go 1.20 and Go 1.21. This section highlights some of the most interesting relevant inputs that we eliminated.

Reproducibility in Go 1.10

Go 1.10 introduced a content-aware build cache that decides whether targets are up-to-date based on a fingerprint of the build inputs instead of file modification times. Because the toolchain itself is one of those build inputs, and because Go is written in Go, the bootstrap process would only converge if the toolchain build on a single machine was reproducible. The overall toolchain build looks like this:

We start by building the sources for the current Go toolchain using an earlier Go version, the bootstrap toolchain (Go 1.10 used Go 1.4, written in C; Go 1.21 uses Go 1.17). That produces “toolchain1”, which we use to build everything again, producing “toolchain2”, which we use to build everything again, producing “toolchain3”.

Toolchain1 and toolchain2 have been built from the same sources but with different Go implementations (compilers and libraries), so their binaries are certain to be different. However, if both Go implementations are non-buggy, correct implementations, toolchain1 and toolchain2 should behave exactly the same. In particular, when presented with the Go 1.X sources, toolchain1’s output (toolchain2) and toolchain2’s output (toolchain3) should be identical, meaning toolchain2 and toolchain3 should be identical.

At least, that’s the idea. Making that true in practice required removing a couple unintentional inputs:

Randomness. Map iteration and running work in multiple goroutines serialized with locks both introduce randomness in the order that results may be generated. This randomness can make the toolchain produce one of several different possible outputs each time it runs. To make the build reproducible, we had to find each of these and sort the relevant list of items before using it to generate output.

Bootstrap Libraries. Any library used by the compiler that can choose from multiple different correct outputs might change its output from one Go version to the next. If that library output change causes a compiler output change, then toolchain1 and toolchain2 will not be semantically identical, and toolchain2 and toolchain3 will not be bit-for-bit identical.

The canonical example is the sort package, which can place elements that compare equal in any order it likes. A register allocator might sort to prioritize commonly used variables, and the linker sorts symbols in the data section by size. To completely eliminate any effect from the sorting algorithm, the comparison function used must never report two distinct elements as equal. In practice, this invariant turned out to be too onerous to impose on every use of sort in the toolchain, so instead we arranged to copy the Go 1.X sort package into the source tree that is presented to the bootstrap compiler. That way, the compiler uses the same sort algorithm when using the bootstrap toolchain as it does when built with itself.

Another package we had to copy was compress/zlib, because the linker writes compressed debug information, and optimizations to compression libraries can change the exact output. Over time, we’ve added other packages to that list too. This approach has the added benefit of allowing the Go 1.X compiler to use new APIs added to those packages immediately, at the cost that those packages must be written to compile with older versions of Go.

Reproducibility in Go 1.20

Work on Go 1.20 prepared for both easy reproducible builds and toolchain management by removing two more relevant inputs from the toolchain build.

Host C toolchain. Some Go packages, most notably net, default to using cgo on most operating systems. In some cases, such as macOS and Windows, invoking system DLLs using cgo is the only reliable way to resolve host names. When we use cgo, though, we invoke the host C toolchain (meaning a specific C compiler and C library), and different toolchains have different compilation algorithms and library code, producing different outputs. The build graph for a cgo package looks like:

The host C toolchain is therefore a relevant input to the pre-compiled net.a that ships with the toolchain. For Go 1.20, we decided to fix this by removing net.a from the toolchain. That is, Go 1.20 stopped shipping pre-compiled packages to seed the build cache with. Now, the first time a program uses package net, the Go toolchain compiles it using the local system’s C toolchain and caches that result. In addition to removing a relevant input from toolchain builds and making toolchain downloads smaller, not shipping pre-compiled packages also makes toolchain downloads more portable. If we build package net on one system with one C toolchain and then compile other parts of the program on a different system with a different C toolchain, in general there is no guarantee that the two parts can be linked together.

One reason we shipped the pre-compiled net package in the first place was to allow building programs that used package net even on systems without a C toolchain installed. If there’s no pre-compiled package, what happens on those systems? The answer varies by operating system, but in all cases we arranged for the Go toolchain to continue to work well for building pure Go programs without a host C toolchain.

On macOS, we rewrote package net using the underlying mechanisms that cgo would use, without any actual C code. This avoids invoking the host C toolchain but still emits a binary that refers to the required system DLLs. This approach is only possible because every Mac has the same dynamic libraries installed. Making the non-cgo macOS package net use the system DLLs also meant that cross-compiled macOS executables now use the system DLLs for network access, resolving a long-standing feature request.
On Windows, package net already made direct use of DLLs without C code, so nothing needed to be changed.
On Unix systems, we cannot assume a specific DLL interface to network code, but the pure Go version works fine for systems that use typical IP and DNS setups. Also, it is much easier to install a C toolchain on Unix systems than it is on macOS and especially Windows. We changed the go command to enable or disable cgo automatically based on whether the system has a C toolchain installed. Unix systems without a C toolchain fall back to the pure Go version of package net, and in the rare cases where that’s not good enough, they can install a C toolchain.

Having dropped the pre-compiled packages, the only part of the Go toolchain that still depended on the host C toolchain was binaries built using package net, specifically the go command. With the macOS improvements, it was now viable to build those commands with cgo disabled, completely removing the host C toolchain as an input, but we left that final step for Go 1.21.

Host dynamic linker. When programs use cgo on a system using dynamically linked C libraries, the resulting binaries contain the path to the system’s dynamic linker, something like /lib64/ld-linux-x86-64.so.2. If the path is wrong, the binaries don’t run. Typically each operating system/architecture combination has a single correct answer for this path. Unfortunately, musl-based Linuxes like Alpine Linux use a different dynamic linker than glibc-based Linuxes like Ubuntu. To make Go run at all on Alpine Linux, in Go bootstrap process looked like this:

The bootstrap program cmd/dist inspected the local system’s dynamic linker and wrote that value into a new source file compiled along with the rest of the linker sources, effectively hard-coding that default into the linker itself. Then when the linker built a program from a set of compiled packages, it used that default. The result is that a Go toolchain built on Alpine is different from a toolchain built on Ubuntu: the host configuration is a relevant input to the toolchain build. This is a reproducibility problem but also a portability problem: a Go toolchain built on Alpine doesn’t build working binaries or even run on Ubuntu, and vice versa.

For Go 1.20, we took a step toward fixing the reproducibility problem by changing the linker to consult the host configuration when it is running, instead of having a default hard-coded at toolchain build time:

This fixed the portability of the linker binary on Alpine Linux, although not the overall toolchain, since the go command still used package net and therefore cgo and therefore had a dynamic linker reference in its own binary. Just as in the previous section, compiling the go command without cgo enabled would fix this, but we left that change for Go 1.21. (We didn’t feel there was enough time left in the Go 1.20 cycle to test such that change properly.)

Reproducibility in Go 1.21

For Go 1.21, the goal of perfect reproducibility was in sight, and we took care of the remaining, mostly small, relevant inputs that remained.

Host C toolchain and dynamic linker. As discussed above, Go 1.20 took important steps toward removing the host C toolchain and dynamic linker as relevant inputs. Go 1.21 completed the removal of these relevant inputs by building the toolchain with cgo disabled. This improved portability of the toolchain too: Go 1.21 is the first Go release where the standard Go toolchain runs unmodified on Alpine Linux systems.

Removing these relevant inputs made it possible to cross-compile a Go toolchain from a different system without any loss in functionality. That in turn improved the supply chain security of the Go toolchain: we can now build Go toolchains for all target systems using a trusted Linux/x86-64 system, instead of needing to arrange a separate trusted system for each target. As a result, Go 1.21 is the first release to include posted binaries for all systems at go.dev/dl/.

Source directory. Go programs include full paths in the runtime and debugging metadata, so that when a program crashes or is run in a debugger, stack traces include the full path to the source file, not just the name of the file in an unspecified directory. Unfortunately, including the full path makes the directory where the source code is stored a relevant input to the build. To fix this, Go 1.21 changed the release toolchain builds to install commands like the compiler using go install -trimpath, which replaces the source directory with the module path of the code. If a released compiler crashes, the stack trace will print paths like cmd/compile/main.go instead of /home/user/go/src/cmd/compile/main.go. Since the full paths would refer to a directory on a different machine anyway, this rewrite is no loss. On the other hand, for non-release builds, we keep the full path, so that when developers working on the compiler itself cause it to crash, IDEs and other tools reading those crashes can easily find the correct source file.

Host operating system. Paths on Windows systems are backslash-separated, like cmd\compile\main.go. Other systems use forward slashes, like cmd/compile/main.go. Although earlier versions of Go had normalized most of these paths to use forward slashes, one inconsistency had crept back in, causing slightly different toolchain builds on Windows. We found and fixed the bug.

Host architecture. Go runs on a variety of ARM systems and can emit code using a software library for floating-point math (SWFP) or using hardware floating-point instructions (HWFP). Toolchains defaulting to one mode or the other will necessarily differ. Like we saw with the dynamic linker earlier, the Go bootstrap process inspected the build system to make sure that the resulting toolchain worked on that system. For historical reasons, the rule was “assume SWFP unless the build is running on an ARM system with flo

Vytvořeno 2y | 28. 8. 2023 18:10:11

Chcete-li přidat komentář, přihlaste se