Codethink Trustable Reproducible Linux (CTRL OS) integrates open-source software components to provide a stable but leading-edge platform for safety-critical designs.1 We use the Eclipse Trustable Software Framework (TSF) to continuously quality-assure CTRL OS as it and its components evolve, and present this case study to show how TSF’s RAFIA methodology allowed us to close a feedback loop, by detecting and remedying defects in an upstream open-source component and, crucially, our initial analysis of that component.2
Setting the scene
TSF sets out a number of generic and high-level Trustable Assertions: testable statements which a robust software integration project should aim to fulfill through analysis and evidence. For this case study, we consider the TA-MISBEHAVIOURS assertion, which reads: “Prohibited misbehaviours for [CTRL OS] are identified, and mitigations are specified, verified and validated based on analysis”.3 TSF is not prescriptive, and instead encourages projects to think critically about their own unique circumstances and therefore choose appropriate methods to achieve and validate this.
TSF proposes, but does not require, RAFIA (Risk Analysis, Fault Induction and Automation) as an alternative process model for the development of complex safety-critical software systems.4 RAFIA is deliberately a cyclical and continuous process, making it well suited to TSF’s goal of continuous compliance, and setting it clearly apart from the V-model traditionally applied in these situations, where validation of test results against the original design may be a one-off activity.
Source: https://codethinklabs.gitlab.io/trustable/trustable/extensions/rafia/index.html
The RAFIA process has four ordered but cyclical stages: Analysis, Specification, Implementation, and Testing. Critically and uniquely, the Analysis stage involves analysing the results of testing from the previous iteration and using those results to re-analyse the system’s design to inform test specifications for the next iteration. For the first complete iteration of a RAFIA loop, however, an initial risk analysis of the system must be undertaken, typically without reference to test results. RAFIA strongly recommends that projects use the STPA methodology to enumerate the behaviours that the system should exhibit, and misbehaviours that should be either absent, or detected and mitigated. The results of this analysis are then used to drive testing of the system.
STPA (System Theoretic Process Analysis) is a technique for analysing complex systems and identifying modes of failure arising from ‘unsafe’ interactions between the components, which may involve otherwise correct behaviour from components as well as errors or faults. STPA is highly systematic in nature, and requires practitioners to sequentially identify:
- The scope of their system
- Losses (or harms) that the system should prevent
- Hazards that might lead to a loss occurring
- Interactions between the system’s components
- How each interaction may be “unsafe” and lead to a hazard
- Constraints that must apply to the system and its components to prevent or mitigate hazards
TSF provides guidance for using STPA in the Risk Analysis stage of RAFIA, and the STPA Handbook describes the procedure and its rationale in greater detail.5, 6 Yet, for the RAFIA procedure, its most important output is a set of constraints. We can close the RAFIA loop by writing a test specification to check these constraints are true for a real system, implementing that test specification, and then analysing the results of the tests to determine whether the STPA model is flawed or if our software misbehaves. In a TSF context, as STPA constraints are testable statements, they can be represented and continuously scored in a Trustable directed acyclic graph (DAG) for every iteration of the software, just like the overarching goal of TA-MISBEHAVIOURS itself.
Case study: integration of the Safety Monitor
For the integration of Codethink’s open-source Safety Monitor – a safety application that controls an external ‘mitigation of last resort’ – into CTRL OS, we chose to follow the RAFIA process to be certain that the mitigation is always triggered when the OS enters an unsafe state, but never triggered while CTRL OS is in a safe state.7
At the start of the RAFIA process, our safety engineer undertook an initial STPA analysis which started by considering the scope of the Safety Monitor system and its components. In summary we have:
- Watchdog: a device external to CTRL OS which transitions the system to a safe state (for example, by transferring responsibility to another system, or simply rebooting) if it does not receive input (a “pet” signal) every 100 ms
- Critical Processes: programs running on CTRL OS to provide the customer’s safety-related function(s)
- Safety Monitor: a program running on CTRL OS which receives updates from Critical Processes (called Extrinsic Indicators) and sends pet signals to the Watchdog every 100 ms only if:
- All received Extrinsic Indicators are within predefined numerical bounds
- All expected Extrinsic Indicators have been received
- No internal corruption or malfunction as been detected
Detecting a defect in the software
Having taken our system requirements and system components, we followed the STPA procedure and eventually generated a list of testable constraints, amongst them Constraint 1: If any of the monitored Extrinsic Indicators are not updated at the specified cadence and frequency, the Watchdog-petting thread exits.
Continuing with the RAFIA process, we wrote a specification to test Constraint 1 on real hardware running CTRL OS both in pre-merge testing (to assert the constraint is still met for each proposed change to CTRL OS) and post-merge soak testing (to assert the constraint is still met for CTRL OS’s releases and mainline branch over thousands of repetitions). As part of this, we applied the next part of RAFIA: Fault Injection. This involves deliberately causing the misbehaviours that are identified by STPA, to determine whether the system either prevents them, or detects them and responds appropriately.
Implementing the specification produced a test in which a mock Critical Process was configured to not send the expected Extrinsic Indicators to the Safety Monitor, and the Safety Monitor’s logs were checked to see if the Watchdog petting ceased, and if so, the time that took. In a real system, the Watchdog would trigger an appropriate response at the system level, such as restarting a misbehaving component, but our tests in this case measure how long the system takes to respond at all when an unsafe condition is reported. However, when we came to analyse the results of the soak tests (around 500 samples each week), we were surprised to see aberrant response times from the Safety Monitor.
According to Safety Monitor’s published User Documentation, system integrators can configure the Timeout for an Extrinsic Indicator, after which the Watchdog petting action will stop if no such Extrinsic Indicators are received.8 Having consulted the upstream documentation and CTRL OS’s Safety Monitor configuration files, we expected this timeout to be 135 seconds for the Constraint 1 test case. When soak test results reported a mean timeout of 117 seconds for the Constraint 1 test case we started to investigate.
In the implementation of the Constraint 1 test case, the mock Critical Process’s configuration was temporarily overridden so that Extrinsic Indicator reporting would cease; this involved restarting the Critical Process. Following conversations with upstream Safety Monitor engineers and a joint interrogation of the test logs, we realised that Safety Monitor maintained an internal buffer of Extrinsic Indicators, did not empty this buffer when a Critical Process restarted, and actually calculated the Timeout starting at the last Extrinsic Indicator in the buffer rather than when it detected a restart of the Critical Process. As the system had been running in a “happy path” state before the test case induced the fault, the state of this buffer was not certain at the start of the test, so neither was the point at which Safety Monitor would start counting the Timeout or the measured response time itself.
The CTRL OS team raised an issue in the open, which was triaged, implemented, and tested by the upstream team.9 As a result, Safety Monitor now clears Extrinsic Indicator buffers when a process restarts, the measured response times in these scenarios are now consistent and predictable, and the expected behaviour is clearly documented upstream. This was verified through soak tests against a CTRL OS image containing a patched Safety Monitor.
In this way, the RAFIA process detected and resolved an obscure misbehaviour in an upstream software component of a complex system, but, more importantly, it also provided documentation so that future STPA analyses in the RAFIA loop will consider this buffering behaviour.
Detecting a defect in the analysis
While methodically inspecting soak test logs to determine the reason for the unexpectedly short response times, we noticed messages such as safety-monitor: Adding pause indicator, which prompted the CTRL OS team to check Safety Monitor’s upstream documentation and source code for an explanation. This revealed that an Extrinsic Indicator can be “paused” by any process sending the requisite message to the safety monitor.10 Moreover, Safety Monitor enabled this pausing behaviour for all monitored Extrinsic Indicators by default, meaning that if a rogue process A somehow sent Safety Monitor the “pause” signal for a critical process B, then B could have stopped providing Extrinsic Indicators altogether without petting ceasing.
We identified this behaviour as a pathway to a potential dangerous failure of CTRL OS, where a fault reported by a Critical Process was not mitigated externally. However, this failure mode was completely absent from the initial STPA. We handled this in two ways: firstly, according to the RAFIA process, we revisited our STPA analysis with the new information about pausing from the tests, and considered this as a Control Action to be systematically analysed according to the methodology. Secondly, we raised a further issue upstream so that pausing had to be explicitly enabled in CTRL OS’s configuration of Safety Monitor, allowing us to later argue through our TSF graph that new constraints around unsafe “pause” control actions were satisfied.11 Here, we see that the thorough test data analysis that RAFIA mandated for one STPA Constraint revealed a different gap in the STPA itself, allowing immediate refinements to the model and neatly closing the development loop with Risk Analysis.
Conclusions
Historically, integrators have emphasised the "Theoretic” element of STPA, so when gaps in the analysis open up in the face of engineering reality, the analysis is re-done from first principles. As we have shown, TSF’s RAFIA process brings together safety, testing, and upstream expertise so that data (even anecdotal) from tests – with real hardware and software in the loop – can immediately be used to readjust the expectations of analysis and design, eliminating uncertainty early without necessarily starting again, just as software itself may evolve by incremental steps.
To find out more about how TSF, RAFIA, and CTRL OS can help you nail risks in complex software projects, please get in touch at connect@codethink.co.uk.
-
https://codethinklabs.gitlab.io/trustable/trustable/index.html ↩
-
https://codethinklabs.gitlab.io/trustable/trustable/trustable/TA.html#ta-misbehaviours ↩
-
https://codethinklabs.gitlab.io/trustable/trustable/extensions/rafia/index.html ↩
-
https://codethinklabs.gitlab.io/trustable/trustable/extensions/stpa/index.html ↩
-
https://psas.scripts.mit.edu/home/get_file.php?name=STPA_handbook.pdf ↩
-
https://gitlab.com/CodethinkLabs/safety-monitor/safety-monitor ↩
-
https://codethinklabs.gitlab.io/safety-monitor/safety-monitor/user/usage.html#monitoring-services ↩
-
https://gitlab.com/CodethinkLabs/safety-monitor/safety-monitor/-/issues/41 ↩
-
https://gitlab.com/CodethinkLabs/safety-monitor/safety-monitor/-/blob/b621f7015d69a91074a9d83276ce201aecc1fe8f/docs/usage.md#pausing-a-metric ↩
-
https://gitlab.com/CodethinkLabs/safety-monitor/safety-monitor/-/issues/48 ↩
Other Content
- Adding big‑endian support to CVA6 RISC‑V FPGA processor
- Bringing up a new distro for the CVA6 RISC‑V FPGA processor
- Externally verifying Linux deadline scheduling with reproducible embedded Rust
- Engineering Trust: Formulating Continuous Compliance for Open Source
- Why Renting Software Is a Dangerous Game
- Linux vs. QNX in Safety-Critical Systems: A Pragmatic View
- Is Rust ready for safety related applications?
- The open projects rethinking safety culture
- RISC-V Summit Europe 2025: What to Expect from Codethink
- Cyber Resilience Act (CRA): What You Need to Know
- Podcast: Embedded Insiders with John Ellis
- To boldly big-endian where no one has big-endianded before
- How Continuous Testing Helps OEMs Navigate UNECE R155/156
- Codethink’s Insights and Highlights from FOSDEM 2025
- CES 2025 Roundup: Codethink's Highlights from Las Vegas
- FOSDEM 2025: What to Expect from Codethink
- Codethink/Arm White Paper: Arm STLs at Runtime on Linux
- Speed Up Embedded Software Testing with QEMU
- Open Source Summit Europe (OSSEU) 2024
- Watch: Real-time Scheduling Fault Simulation
- Improving systemd’s integration testing infrastructure (part 2)
- Meet the Team: Laurence Urhegyi
- A new way to develop on Linux - Part II
- Shaping the future of GNOME: GUADEC 2024
- Developing a cryptographically secure bootloader for RISC-V in Rust
- Meet the Team: Philip Martin
- Improving systemd’s integration testing infrastructure (part 1)
- A new way to develop on Linux
- RISC-V Summit Europe 2024
- Safety Frontier: A Retrospective on ELISA
- Codethink sponsors Outreachy
- The Linux kernel is a CNA - so what?
- GNOME OS + systemd-sysupdate
- Codethink has achieved ISO 9001:2015 accreditation
- Outreachy internship: Improving end-to-end testing for GNOME
- Lessons learnt from building a distributed system in Rust
- FOSDEM 2024
- QAnvas and QAD: Streamlining UI Testing for Embedded Systems
- Outreachy: Supporting the open source community through mentorship programmes
- Using Git LFS and fast-import together
- Testing in a Box: Streamlining Embedded Systems Testing
- SDV Europe: What Codethink has planned
- How do Hardware Security Modules impact the automotive sector? The final blog in a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part two of a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part one of a three part discussion
- Automated Kernel Testing on RISC-V Hardware
- Automated end-to-end testing for Android Automotive on Hardware
- GUADEC 2023
- Embedded Open Source Summit 2023
- RISC-V: Exploring a Bug in Stack Unwinding
- Adding RISC-V Vector Cryptography Extension support to QEMU
- Introducing Our New Open-Source Tool: Quality Assurance Daemon
- Achieving Long-Term Maintainability with Open Source
- FOSDEM 2023
- Full archive