Introduction
Freedesktop SDK is a free, community-developed, and open-source project that simplifies the process of creating various software artefacts. Common use cases include building containers, Flatpak runtimes, snaps, and complete operating systems.
In this article, we explore a more lightweight approach to assessing reproducibility that prioritises speed, whilst providing meaningful validation to the engineer at the artefact level. Additionally, we provide metrics to help quantify the performance gap and discuss the tradeoffs between the two approaches. Note that this article focuses on Freedesktop SDK as an example repository to analyse the reproducibility workflow inside it and explores an alternative workflow strategy that aligns more closely with how upstream Linux distributions handle reproducibility.
Summarising reproducibility
Reproducibility in the context of Software Engineering can refer to different definitions depending on the environment and requirements. In this article, we consider bit-for-bit reproducibility, i.e., the ability to generate an identical binary given the same source code input and build environment.
The concept of reproducibility as a principle applies broadly to scientific and engineering disciplines, where the results from an experiment should be achievable with a high degree of reliability when the study is replicated. The Wikipedia article on reproducible builds provides a broad overview of the concept and its historical context within Software Engineering. It defines reproducible builds as:
"Reproducible builds, also known as deterministic compilation, is a process of building software which ensures the resulting binary code can be reproduced. Source code compiled deterministically will always output the same binary."
In theory, this sounds both useful and doable. In reality, achieving reproducibility is harder than it sounds. Build processes can be influenced by a range of factors, such as timestamps, environment variables, filesystem ordering, and differences within the toolchain behaviour. Unless you control these factors, you risk introducing stochastic variations in the final artefact, which decreases the confidence in the system as a whole.
The reproducible-builds project is one well-known community effort focused on improving reproducible builds. They have regular talks and events that take place each year. In addition, they publish monthly reports.
The official definition on their website is:
“Reproducible builds are a set of software development practices that create an independently-verifiable path from source to binary code.”
In short, reproducibility is widely regarded as an important topic within Software Engineering, particularly in environments that prioritise trustworthiness and long-term maintainability.
BuildStream plays a crucial role in reducing this problem because it enables you to declare software stacks in a declarative format and control the build environment, which helps reduce sources of non-determinism and makes reproducibility achievable in practice.
At Codethink, reproducibility is one of the key concepts integrated into our Eclipse Trustable Software Framework.
The current solution inside of freedesktop-sdk
The freedesktop-sdk repository (at this moment in time) uses the buildstream-reprotest library to establish the reproducibility of the BuildStream elements that reside inside it.
The buildstream-reprotest is a Python script that essentially uses the subprocess module in Python to invoke BuildStream and orchestrate a method for assessing the reproducibility status of a given BuildStream element. The script evaluates the entire dependency graph and essentially asks the question: Is the entire dependency graph reproducible at every level?
The script runs the following phases:
- Phase 1: Build the target element and pull the remote cache where possible, and fall back to manually compiling if necessary.
- Phase 2: Iterate through the dependency graph in reverse order, disable the remote cache and build each element manually from scratch.
- Phase 3: Use the output from Phases 1 and 2 and execute diffoscope to identify any differences in the output.
Applying this methodology, by definition, assesses the reproducibility of the build process. Our argument is that, although there is nothing incorrect about this approach, it is time-consuming due to its very nature.
This approach introduces the following key challenges:
- High computational cost: rebuilding all dependencies from scratch.
- Time inefficiency: running comparisons at every layer of the graph.
- Limited iteration speed: not ideal for rapid development feedback loops.
This raises an important question for us all:
Do we always need full dependency graph validation to get useful reproducibility signals?
A target level approach
To address this, we developed a Bash script that focused on target-level reproducibility rather than full graph validation. The reason for choosing Bash was primarily that it requires minimal boilerplate, is highly interactive, and has low resource usage.
Instead of asking:
Is the entire dependency graph reproducible?
We are now asking:
If we rebuild this element in isolation, do we get the same artefact?
The distinction between the two questions enables a faster feedback cycle, while also addressing many real-world reproducibility issues.
Workflow overview
The buildstream-reprotest-script executes the following phases:
-
Phase 1:
- Build the target element with all dependencies, using remote cache where possible.
- Fetch sources to ensure we can build without the cache/internet.
- Check out the resulting artefact into a temporary directory.
- Remove the target element from the local cache whilst keeping the dependencies.
-
Phase 2:
- Disable cache via a temporary configuration file.
- Rebuild the target element without rebuilding its dependencies.
- Check out the new artefact.
-
Phase 3:
- Run diffoscope between the two outputs.
- Generate a detailed HTML report.
- Return the comparison exit status.
The key design decision here is to reuse the previously built dependencies from Phase One into Phase Two. The benefit of reusing previously built dependencies means that we significantly reduce the time taken to assess the reproducibility of a project because we’re not devoting time to rebuilding each dependency from scratch. The time reduction is particularly useful when processing large dependency graphs, e.g., those within the freedesktop-sdk environment.
We should note that this workflow aligns closely with how upstream Linux distributions approach reproducibility workflows.
The pros and cons of buildstream-reprotest-script
Pros:
- Ability to assess the reproducibility of a BuildStream element.
- Faster than the buildstream-reprotest equivalent.
- Helps expose situations where the target element is reproducible, but it’s a dependency that’s actually at fault.
Cons:
- Does not assess the reproducibility of the dependencies and thus cannot assess the reproducibility of the entire Directed Acyclic Graph (DAG), which represents the build dependencies.
- Does not detect reproducibility issues introduced by dependencies that are reused from cache rather than rebuilt.
Metrics
| Element | No. of dependencies | Time for execution (s) | Time for diffoscope (s) | Run type |
|---|---|---|---|---|
| components/podman.bst | 291 | 4716 | 372 | Cold |
| components/podman.bst | 291 | 472 | 365 | Hot |
| vm/desktop/filesystem.bst | 726 | 4669 | 118 | Cold |
| vm/desktop/filesystem.bst | 726 | 203 | 128 | Hot |
| abi/sdk-lib-headers.bst | 615 | 3205 | 10 | Cold |
| abi/sdk-lib-headers.bst | 615 | 42 | 11 | Hot |
| extensions/mesa/mesa.bst | 220 | 3801 | 44 | Cold |
| extensions/mesa/mesa.bst | 220 | 369 | 46 | Hot |
| components/nasm.bst | 154 | 2369 | 24 | Cold |
| components/nasm.bst | 154 | 62 | 22 | Hot |
| oci/layers/flatpak-stack.bst | 618 | 5630 | 288 | Cold |
| oci/layers/flatpak-stack.bst | 618 | 408 | 314 | Hot |
The run type is split into Cold and Hot, which simply refer to running the script without and with local cache, respectively. A Cold run has no local cache, and a Hot run has local cache.
The above table shows us a series of elements that we tested the buildstream-reprotest-script against. Whilst we did not execute the original buildstream-reprotest workflow for direct comparison, prior experience indicates that full dependency graph reproducibility testing can take between 5 and 15 hours, depending on the graph size.
The original Merge Request contains a detailed series of tests against a specific instance of freedesktop-sdk, which includes two interesting cases:
components/nasm.bstoci/toolbox-oci.bst
The buildstream-reprotest-script helped us deduce that the non-reproducibility of [1] is due to a dependency within the pipeline. In addition, the script helped expose an issue with the cache keys associated with [2].
In summary, the metrics show us that our script is significantly faster and raises an interesting follow-up question:
Should the diffoscope step be optional in fast validation workflows?
Practical Impact
In practice, this approach:
- Reduces testing time significantly.
- Enables faster debugging cycles.
- Provides actionable feedback early.
- Helps isolate whether issues originate in the target or its dependencies.
Conclusion
Reproducibility testing doesn’t always need to be exhaustive to be useful, and in practice, we can alter the mechanism to enable engineers to detect non-determinism earlier in the development cycle, rather than discovering it during expensive full-graph validation.
By focusing on target-level validation, we can improve:
- Speed.
- Practicality.
- Engineering insight.
The critical question for us all (at least within the Freedesktop SDK environment and ones like it):
Can target-level validation complement the existing full dependency graph validation within Freedesktop SDK (and similar environments)?
Future work
Potential improvements for our buildstream-reprotest-script include:
- Making logging depth dynamic, i.e., removing our brittle assumption and reliance on
caller 1. - Providing an option to perform full dependency graph reproducibility testing.
- Making the diffoscope operation optional.
- Capturing metadata for deeper diagnostics.
Other Content
- Deep Dive into Upstream RISC-V Boot Chain
- Porting an Automotive Operating System to RISC-V
- Understanding Codethink's IEC 61508 Mapping for the Eclipse Trustable Software Framework
- Resisting Hyrum's Law with Private Constructors in Python
- FOSDEM 2026
- Building on STPA: How TSF and RAFIA can uncover misbehaviours in complex software integration
- Adding big‑endian support to CVA6 RISC‑V FPGA processor
- Bringing up a new distro for the CVA6 RISC‑V FPGA processor
- Externally verifying Linux deadline scheduling with reproducible embedded Rust
- Engineering Trust: Formulating Continuous Compliance for Open Source
- Why Renting Software Is a Dangerous Game
- Linux vs. QNX in Safety-Critical Systems: A Pragmatic View
- Is Rust ready for safety related applications?
- The open projects rethinking safety culture
- RISC-V Summit Europe 2025: What to Expect from Codethink
- Cyber Resilience Act (CRA): What You Need to Know
- Podcast: Embedded Insiders with John Ellis
- To boldly big-endian where no one has big-endianded before
- How Continuous Testing Helps OEMs Navigate UNECE R155/156
- Codethink’s Insights and Highlights from FOSDEM 2025
- CES 2025 Roundup: Codethink's Highlights from Las Vegas
- FOSDEM 2025: What to Expect from Codethink
- Codethink/Arm White Paper: Arm STLs at Runtime on Linux
- Speed Up Embedded Software Testing with QEMU
- Open Source Summit Europe (OSSEU) 2024
- Watch: Real-time Scheduling Fault Simulation
- Improving systemd’s integration testing infrastructure (part 2)
- Meet the Team: Laurence Urhegyi
- A new way to develop on Linux - Part II
- Shaping the future of GNOME: GUADEC 2024
- Developing a cryptographically secure bootloader for RISC-V in Rust
- Meet the Team: Philip Martin
- Improving systemd’s integration testing infrastructure (part 1)
- A new way to develop on Linux
- RISC-V Summit Europe 2024
- Safety Frontier: A Retrospective on ELISA
- Codethink sponsors Outreachy
- The Linux kernel is a CNA - so what?
- GNOME OS + systemd-sysupdate
- Codethink has achieved ISO 9001:2015 accreditation
- Outreachy internship: Improving end-to-end testing for GNOME
- Lessons learnt from building a distributed system in Rust
- FOSDEM 2024
- QAnvas and QAD: Streamlining UI Testing for Embedded Systems
- Outreachy: Supporting the open source community through mentorship programmes
- Using Git LFS and fast-import together
- Testing in a Box: Streamlining Embedded Systems Testing
- SDV Europe: What Codethink has planned
- How do Hardware Security Modules impact the automotive sector? The final blog in a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part two of a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part one of a three part discussion
- Automated Kernel Testing on RISC-V Hardware
- Automated end-to-end testing for Android Automotive on Hardware
- GUADEC 2023
- Embedded Open Source Summit 2023
- Full archive