Assessing the reproducibility workflow in freedesktop-sdk

Introduction

Freedesktop SDK is a free, community-developed, and open-source project that simplifies the process of creating various software artefacts. Common use cases include building containers, Flatpak runtimes, snaps, and complete operating systems.

In this article, we explore a more lightweight approach to assessing reproducibility that prioritises speed, whilst providing meaningful validation to the engineer at the artefact level. Additionally, we provide metrics to help quantify the performance gap and discuss the tradeoffs between the two approaches. Note that this article focuses on Freedesktop SDK as an example repository to analyse the reproducibility workflow inside it and explores an alternative workflow strategy that aligns more closely with how upstream Linux distributions handle reproducibility.

Summarising reproducibility

Reproducibility in the context of Software Engineering can refer to different definitions depending on the environment and requirements. In this article, we consider bit-for-bit reproducibility, i.e., the ability to generate an identical binary given the same source code input and build environment.

The concept of reproducibility as a principle applies broadly to scientific and engineering disciplines, where the results from an experiment should be achievable with a high degree of reliability when the study is replicated. The Wikipedia article on reproducible builds provides a broad overview of the concept and its historical context within Software Engineering. It defines reproducible builds as:

"Reproducible builds, also known as deterministic compilation, is a process of building software which ensures the resulting binary code can be reproduced. Source code compiled deterministically will always output the same binary."

In theory, this sounds both useful and doable. In reality, achieving reproducibility is harder than it sounds. Build processes can be influenced by a range of factors, such as timestamps, environment variables, filesystem ordering, and differences within the toolchain behaviour. Unless you control these factors, you risk introducing stochastic variations in the final artefact, which decreases the confidence in the system as a whole.

The reproducible-builds project is one well-known community effort focused on improving reproducible builds. They have regular talks and events that take place each year. In addition, they publish monthly reports.

The official definition on their website is:

“Reproducible builds are a set of software development practices that create an independently-verifiable path from source to binary code.”

In short, reproducibility is widely regarded as an important topic within Software Engineering, particularly in environments that prioritise trustworthiness and long-term maintainability.

BuildStream plays a crucial role in reducing this problem because it enables you to declare software stacks in a declarative format and control the build environment, which helps reduce sources of non-determinism and makes reproducibility achievable in practice.

At Codethink, reproducibility is one of the key concepts integrated into our Eclipse Trustable Software Framework.

The current solution inside of freedesktop-sdk

The freedesktop-sdk repository (at this moment in time) uses the buildstream-reprotest library to establish the reproducibility of the BuildStream elements that reside inside it.

The buildstream-reprotest is a Python script that essentially uses the subprocess module in Python to invoke BuildStream and orchestrate a method for assessing the reproducibility status of a given BuildStream element. The script evaluates the entire dependency graph and essentially asks the question: Is the entire dependency graph reproducible at every level?

The script runs the following phases:

Phase 1: Build the target element and pull the remote cache where possible, and fall back to manually compiling if necessary.
Phase 2: Iterate through the dependency graph in reverse order, disable the remote cache and build each element manually from scratch.
Phase 3: Use the output from Phases 1 and 2 and execute diffoscope to identify any differences in the output.

Applying this methodology, by definition, assesses the reproducibility of the build process. Our argument is that, although there is nothing incorrect about this approach, it is time-consuming due to its very nature.

This approach introduces the following key challenges:

High computational cost: rebuilding all dependencies from scratch.
Time inefficiency: running comparisons at every layer of the graph.
Limited iteration speed: not ideal for rapid development feedback loops.

This raises an important question for us all:

Do we always need full dependency graph validation to get useful reproducibility signals?

A target level approach

To address this, we developed a Bash script that focused on target-level reproducibility rather than full graph validation. The reason for choosing Bash was primarily that it requires minimal boilerplate, is highly interactive, and has low resource usage.

Instead of asking:

Is the entire dependency graph reproducible?

We are now asking:

If we rebuild this element in isolation, do we get the same artefact?

The distinction between the two questions enables a faster feedback cycle, while also addressing many real-world reproducibility issues.

Workflow overview

The buildstream-reprotest-script executes the following phases:

Phase 1:
- Build the target element with all dependencies, using remote cache where possible.
- Fetch sources to ensure we can build without the cache/internet.
- Check out the resulting artefact into a temporary directory.
- Remove the target element from the local cache whilst keeping the dependencies.
Phase 2:
- Disable cache via a temporary configuration file.
- Rebuild the target element without rebuilding its dependencies.
- Check out the new artefact.
Phase 3:
- Run diffoscope between the two outputs.
- Generate a detailed HTML report.
- Return the comparison exit status.

The key design decision here is to reuse the previously built dependencies from Phase One into Phase Two. The benefit of reusing previously built dependencies means that we significantly reduce the time taken to assess the reproducibility of a project because we’re not devoting time to rebuilding each dependency from scratch. The time reduction is particularly useful when processing large dependency graphs, e.g., those within the freedesktop-sdk environment.

We should note that this workflow aligns closely with how upstream Linux distributions approach reproducibility workflows.

The pros and cons of buildstream-reprotest-script

Pros:

Ability to assess the reproducibility of a BuildStream element.
Faster than the buildstream-reprotest equivalent.
Helps expose situations where the target element is reproducible, but it’s a dependency that’s actually at fault.

Cons:

Does not assess the reproducibility of the dependencies and thus cannot assess the reproducibility of the entire Directed Acyclic Graph (DAG), which represents the build dependencies.
Does not detect reproducibility issues introduced by dependencies that are reused from cache rather than rebuilt.

Metrics

Element	No. of dependencies	Time for execution (s)	Time for diffoscope (s)	Run type
components/podman.bst	291	4716	372	Cold
components/podman.bst	291	472	365	Hot
vm/desktop/filesystem.bst	726	4669	118	Cold
vm/desktop/filesystem.bst	726	203	128	Hot
abi/sdk-lib-headers.bst	615	3205	10	Cold
abi/sdk-lib-headers.bst	615	42	11	Hot
extensions/mesa/mesa.bst	220	3801	44	Cold
extensions/mesa/mesa.bst	220	369	46	Hot
components/nasm.bst	154	2369	24	Cold
components/nasm.bst	154	62	22	Hot
oci/layers/flatpak-stack.bst	618	5630	288	Cold
oci/layers/flatpak-stack.bst	618	408	314	Hot

_{The run type is split into Cold and Hot, which simply refer to running the script without and with local cache,
respectively. A Cold run has no local cache, and a Hot run has local cache.}

The above table shows us a series of elements that we tested the buildstream-reprotest-script against. Whilst we did not execute the original buildstream-reprotest workflow for direct comparison, prior experience indicates that full dependency graph reproducibility testing can take between 5 and 15 hours, depending on the graph size.

The original Merge Request contains a detailed series of tests against a specific instance of freedesktop-sdk, which includes two interesting cases:

components/nasm.bst
oci/toolbox-oci.bst

The buildstream-reprotest-script helped us deduce that the non-reproducibility of [1] is due to a dependency within the pipeline. In addition, the script helped expose an issue with the cache keys associated with [2].

In summary, the metrics show us that our script is significantly faster and raises an interesting follow-up question:

Should the diffoscope step be optional in fast validation workflows?

Practical Impact

In practice, this approach:

Reduces testing time significantly.
Enables faster debugging cycles.
Provides actionable feedback early.
Helps isolate whether issues originate in the target or its dependencies.

Conclusion

Reproducibility testing doesn’t always need to be exhaustive to be useful, and in practice, we can alter the mechanism to enable engineers to detect non-determinism earlier in the development cycle, rather than discovering it during expensive full-graph validation.

By focusing on target-level validation, we can improve:

Speed.
Practicality.
Engineering insight.

The critical question for us all (at least within the Freedesktop SDK environment and ones like it):

Can target-level validation complement the existing full dependency graph validation within Freedesktop SDK (and similar environments)?

Future work

Potential improvements for our buildstream-reprotest-script include:

Making logging depth dynamic, i.e., removing our brittle assumption and reliance on caller 1.
Providing an option to perform full dependency graph reproducibility testing.
Making the diffoscope operation optional.
Capturing metadata for deeper diagnostics.