Building on STPA: How TSF and RAFIA can uncover misbehaviours in complex software integration

Codethink Trustable Reproducible Linux (CTRL OS) integrates open-source software components to provide a stable but leading-edge platform for safety-critical designs.¹ We use the Eclipse Trustable Software Framework (TSF) to continuously quality-assure CTRL OS as it and its components evolve, and present this case study to show how TSF’s RAFIA methodology allowed us to close a feedback loop, by detecting and remedying defects in an upstream open-source component and, crucially, our initial analysis of that component.²

Setting the scene

TSF sets out a number of generic and high-level Trustable Assertions: testable statements which a robust software integration project should aim to fulfill through analysis and evidence. For this case study, we consider the TA-MISBEHAVIOURS assertion, which reads: “Prohibited misbehaviours for [CTRL OS] are identified, and mitigations are specified, verified and validated based on analysis”.³ TSF is not prescriptive, and instead encourages projects to think critically about their own unique circumstances and therefore choose appropriate methods to achieve and validate this.

TSF proposes, but does not require, RAFIA (Risk Analysis, Fault Induction and Automation) as an alternative process model for the development of complex safety-critical software systems.⁴ RAFIA is deliberately a cyclical and continuous process, making it well suited to TSF’s goal of continuous compliance, and setting it clearly apart from the V-model traditionally applied in these situations, where validation of test results against the original design may be a one-off activity.

Diagram of the RAFIA cycle, linking the Analysis, Specification, Implementation, and Testing stages in a directed loop, returning to Analysis Source: https://codethinklabs.gitlab.io/trustable/trustable/extensions/rafia/index.html

The RAFIA process has four ordered but cyclical stages: Analysis, Specification, Implementation, and Testing. Critically and uniquely, the Analysis stage involves analysing the results of testing from the previous iteration and using those results to re-analyse the system’s design to inform test specifications for the next iteration. For the first complete iteration of a RAFIA loop, however, an initial risk analysis of the system must be undertaken, typically without reference to test results. RAFIA strongly recommends that projects use the STPA methodology to enumerate the behaviours that the system should exhibit, and misbehaviours that should be either absent, or detected and mitigated. The results of this analysis are then used to drive testing of the system.

STPA (System Theoretic Process Analysis) is a technique for analysing complex systems and identifying modes of failure arising from ‘unsafe’ interactions between the components, which may involve otherwise correct behaviour from components as well as errors or faults. STPA is highly systematic in nature, and requires practitioners to sequentially identify:

The scope of their system
Losses (or harms) that the system should prevent
Hazards that might lead to a loss occurring
Interactions between the system’s components
How each interaction may be “unsafe” and lead to a hazard
Constraints that must apply to the system and its components to prevent or mitigate hazards

TSF provides guidance for using STPA in the Risk Analysis stage of RAFIA, and the STPA Handbook describes the procedure and its rationale in greater detail.⁵, ⁶ Yet, for the RAFIA procedure, its most important output is a set of constraints. We can close the RAFIA loop by writing a test specification to check these constraints are true for a real system, implementing that test specification, and then analysing the results of the tests to determine whether the STPA model is flawed or if our software misbehaves. In a TSF context, as STPA constraints are testable statements, they can be represented and continuously scored in a Trustable directed acyclic graph (DAG) for every iteration of the software, just like the overarching goal of TA-MISBEHAVIOURS itself.

Case study: integration of the Safety Monitor

For the integration of Codethink’s open-source Safety Monitor – a safety application that controls an external ‘mitigation of last resort’ – into CTRL OS, we chose to follow the RAFIA process to be certain that the mitigation is always triggered when the OS enters an unsafe state, but never triggered while CTRL OS is in a safe state.⁷

At the start of the RAFIA process, our safety engineer undertook an initial STPA analysis which started by considering the scope of the Safety Monitor system and its components. In summary we have:

Watchdog: a device external to CTRL OS which transitions the system to a safe state (for example, by transferring responsibility to another system, or simply rebooting) if it does not receive input (a “pet” signal) every 100 ms
Critical Processes: programs running on CTRL OS to provide the customer’s safety-related function(s)
Safety Monitor: a program running on CTRL OS which receives updates from Critical Processes (called Extrinsic Indicators) and sends pet signals to the Watchdog every 100 ms only if:
- All received Extrinsic Indicators are within predefined numerical bounds
- All expected Extrinsic Indicators have been received
- No internal corruption or malfunction as been detected

Control structure diagram showing the interactions between the Watchdog, Critical Processes, and Safety Monitor in CTRL OS

Detecting a defect in the software

Having taken our system requirements and system components, we followed the STPA procedure and eventually generated a list of testable constraints, amongst them Constraint 1: If any of the monitored Extrinsic Indicators are not updated at the specified cadence and frequency, the Watchdog-petting thread exits.

Continuing with the RAFIA process, we wrote a specification to test Constraint 1 on real hardware running CTRL OS both in pre-merge testing (to assert the constraint is still met for each proposed change to CTRL OS) and post-merge soak testing (to assert the constraint is still met for CTRL OS’s releases and mainline branch over thousands of repetitions). As part of this, we applied the next part of RAFIA: Fault Injection. This involves deliberately causing the misbehaviours that are identified by STPA, to determine whether the system either prevents them, or detects them and responds appropriately.

Implementing the specification produced a test in which a mock Critical Process was configured to not send the expected Extrinsic Indicators to the Safety Monitor, and the Safety Monitor’s logs were checked to see if the Watchdog petting ceased, and if so, the time that took. In a real system, the Watchdog would trigger an appropriate response at the system level, such as restarting a misbehaving component, but our tests in this case measure how long the system takes to respond at all when an unsafe condition is reported. However, when we came to analyse the results of the soak tests (around 500 samples each week), we were surprised to see aberrant response times from the Safety Monitor.

According to Safety Monitor’s published User Documentation, system integrators can configure the Timeout for an Extrinsic Indicator, after which the Watchdog petting action will stop if no such Extrinsic Indicators are received.⁸ Having consulted the upstream documentation and CTRL OS’s Safety Monitor configuration files, we expected this timeout to be 135 seconds for the Constraint 1 test case. When soak test results reported a mean timeout of 117 seconds for the Constraint 1 test case we started to investigate.

In the implementation of the Constraint 1 test case, the mock Critical Process’s configuration was temporarily overridden so that Extrinsic Indicator reporting would cease; this involved restarting the Critical Process. Following conversations with upstream Safety Monitor engineers and a joint interrogation of the test logs, we realised that Safety Monitor maintained an internal buffer of Extrinsic Indicators, did not empty this buffer when a Critical Process restarted, and actually calculated the Timeout starting at the last Extrinsic Indicator in the buffer rather than when it detected a restart of the Critical Process. As the system had been running in a “happy path” state before the test case induced the fault, the state of this buffer was not certain at the start of the test, so neither was the point at which Safety Monitor would start counting the Timeout or the measured response time itself.

The CTRL OS team raised an issue in the open, which was triaged, implemented, and tested by the upstream team.⁹ As a result, Safety Monitor now clears Extrinsic Indicator buffers when a process restarts, the measured response times in these scenarios are now consistent and predictable, and the expected behaviour is clearly documented upstream. This was verified through soak tests against a CTRL OS image containing a patched Safety Monitor.

In this way, the RAFIA process detected and resolved an obscure misbehaviour in an upstream software component of a complex system, but, more importantly, it also provided documentation so that future STPA analyses in the RAFIA loop will consider this buffering behaviour.

Detecting a defect in the analysis

While methodically inspecting soak test logs to determine the reason for the unexpectedly short response times, we noticed messages such as safety-monitor: Adding pause indicator, which prompted the CTRL OS team to check Safety Monitor’s upstream documentation and source code for an explanation. This revealed that an Extrinsic Indicator can be “paused” by any process sending the requisite message to the safety monitor.¹⁰ Moreover, Safety Monitor enabled this pausing behaviour for all monitored Extrinsic Indicators by default, meaning that if a rogue process A somehow sent Safety Monitor the “pause” signal for a critical process B, then B could have stopped providing Extrinsic Indicators altogether without petting ceasing.

We identified this behaviour as a pathway to a potential dangerous failure of CTRL OS, where a fault reported by a Critical Process was not mitigated externally. However, this failure mode was completely absent from the initial STPA. We handled this in two ways: firstly, according to the RAFIA process, we revisited our STPA analysis with the new information about pausing from the tests, and considered this as a Control Action to be systematically analysed according to the methodology. Secondly, we raised a further issue upstream so that pausing had to be explicitly enabled in CTRL OS’s configuration of Safety Monitor, allowing us to later argue through our TSF graph that new constraints around unsafe “pause” control actions were satisfied.¹¹ Here, we see that the thorough test data analysis that RAFIA mandated for one STPA Constraint revealed a different gap in the STPA itself, allowing immediate refinements to the model and neatly closing the development loop with Risk Analysis.

Conclusions

Historically, integrators have emphasised the "Theoretic” element of STPA, so when gaps in the analysis open up in the face of engineering reality, the analysis is re-done from first principles. As we have shown, TSF’s RAFIA process brings together safety, testing, and upstream expertise so that data (even anecdotal) from tests – with real hardware and software in the loop – can immediately be used to readjust the expectations of analysis and design, eliminating uncertainty early without necessarily starting again, just as software itself may evolve by incremental steps.

To find out more about how TSF, RAFIA, and CTRL OS can help you nail risks in complex software projects, please get in touch at connect@codethink.co.uk.

Building on STPA: How TSF and RAFIA can uncover misbehaviours in complex software integration

Setting the scene

Case study: integration of the Safety Monitor

Detecting a defect in the software

Detecting a defect in the analysis

Conclusions

Other Content

Get in touch to find out how Codethink can help you

connect@codethink.co.uk +44 161 660 9930