The role of software in safety-critical systems - and of open source software in particular - is a topic that we have explored in previous articles; it was also the focus of the recent Safety and Open Source devroom at FOSDEM. Applying functional safety techniques to complex or software-intensive systems can be very challenging, and some of the established approaches have been criticised for failing to scale and adapt to meet these challenges. One recurring theme that you may encounter when reading about this topic is the assertion that safety is a property of systems rather than individual components, including software components.
Paul Albertella explains the background to this assertion, and why it is a recurrent theme.
Note: Safety discussions often include terms that can have a range of meanings in common speech, which are intended to have a more precise meaning in the discussion. Please see the Glossary at the end of this article for clarification of the terms shown in italics.
Background: Functional safety
When we are talking about safety in the context of software, we are normally referring to functional safety as it applies to electronic and electro-mechanical systems. This is a set of engineering practices that seek to reduce the level of risk in a device or system to an acceptable level, where the definition of 'acceptable' is determined by the nature of the system, its intended purpose and the type of risks involved.
As summarised by the International Electrotechnical Commission (IEC),:
"Functional safety identifies potentially dangerous conditions that could result in harm and automatically enables corrective actions to avoid or reduce the impact of an incident. It is part of the overall safety of a system or device that depends on automatic safeguards responding to a hazardous event."*
Functional safety engineering practices, and internationally-recognised standards such as IEC 61508 that formally describe them, focus on identifying the failures that lead to accidents and other hazardous events, and specifying how the system as a whole - and features identified as 'safety functions' in particular - can detect and respond to these, in order to avoid or minimise harmful consequences.
One of these practices is hazard and risk analysis, which is used to identify and examine the conditions that can lead to harmful outcomes (hazards), and to evaluate both the probability of these conditions occurring and the severity of the potential consequences (risks). This kind of analysis will always be informed by previous problems or accidents, and the known limitations of hardware or software components, but it also uses systematic techniques to try to identify new problems.
It's important to note, however, that functional safety practices are not expected to eliminate all potential hazards, or to reduce the identified risks to zero, only to reduce those risks to an acceptable level. A key role for the safety standards is in qualifying what is meant by acceptable in a given context, and describing what can be considered sufficient when evaluating how those risks have been mitigated.
To understand what this means, it is useful to distinguish between four broad objectives in safety, which we can characterise using the following questions:
- What do we mean by 'safe', and how can we achieve and maintain this desired state?
- How can we implement or realise the safety measures identified in #1?
- How can we be confident that the criteria in #1 and the measures in #2 are sufficient?
- How can we be confident that the processes and tools that we use to achieve #1, #2 and #3 are sufficient?
It's useful to make these distinctions when reading about safety, because a proposition or a technique that might apply to one of these objectives will not necessarily be helpful when applied to another.
The safety standards that I'm familiar with, such as IEC 61508 and ISO 26262, are primarily concerned with objectives #2, #3 and #4, but they also list techniques for elaborating #1, and identify some general criteria that are typically applied for this objective.
These standards also permit us to break down (decompose) systems into components, and to examine the distinct roles that software and hardware components play in the safety of a system. They also encourage us to consider the tools that we use to create or refine these elements, to examine how they contribute to objectives #2 and #3, and consider how objective #4 applies to them.
Part of the motivation here is the desire to have reusable components and tools, with clearly-defined and widely-applicable safety-relevant characteristics, which we can feel confident about using for different contexts and systems. There is explicit support for this at a system component level in ISO 26262 (the functional safety standard for road vehicles), in the form of the 'Safety Element out of Context" (SEooC) concept.
System vs component
The "Safety is a system property" assertion is associated with a school of thought that we might label the 'system theoretic safety perspective', which has been notably articulated by Dr. Nancy Leveson, who observed:
“Because safety is an emergent property, it is not possible to take a single system component, like a software module or a single human action, in isolation and assess its safety. A component that is perfectly safe in one system or in one environment may not be when used in another.”
--- Engineering a Safer World: Systems Thinking Applied to Safety
This might at first seem to be incompatible with a desire for reusable components. In my opinion, however, it does not mean that we can only undertake safety tasks in reference to a complete and specific system. Rather, it asserts that we cannot perform a complete safety analysis for a given component unless we also consider the wider context of that component.
Furthermore, this perspective applies primarily to objective #1 (although it also has some bearing on #3), which means that there are some tools and techniques relating to safety whose merits can be examined without explicit reference to a system context.
When we are trying to define what we mean by safe, however, and considering the completeness of that definition, we cannot ignore that context. For a software component in particular, this means answering the following questions:
- What system (or kind of system) we are considering?
- What is this system's intended purpose, and how does the software component contribute to it?
- What environment (or kinds of environment) is it intended to operate within, and how might this affect the software component?
- What does 'safe' mean in the context of this system, and how is the software component relevant to achieving or maintaining this desired state?
The system context that we consider might be concrete and complete (e.g. a specific integration of hardware and software as part of a product), or partial (e.g. only describing how our component interacts with the the rest of the system) or an abstraction (e.g. a set of assumptions and constraints for a given category of system), but without it we cannot make meaningful statements about what may be considered safe. And without defining that context, we have no way of evaluating the completeness (i.e. whether it is sufficient) of our analysis.
Of course, we can still consider a given piece of software in isolation, examining how its properties might contribute to the safety of a system, or represent a threat to its safe operation. But without at least an implied system context - and an understanding of what 'safe' means in that context - it is potentially misleading to label the software, or its properties, as 'safe'.
Hence, when safety is something that we need to consider, we should always attempt to answer these questions, even if our answers are unsatisfactory or provisional (e.g. "We have not considered the impact of any environmental factors on the operation of the system"). If we fail to consider and describe the context in which our software will operate - or if we base our reasoning on an assumed context that we have not described - then gaps or flaws in that reasoning may go unnoticed, with possibly dangerous consequences.
Furthermore, if we consider an abstract context - a system context that is partial, provisional, or conditional upon a set of assumptions or requirements - then any claims that we make about safety must be equally partial, provisional or conditional. As a minimum, these safety claims will need to be re-validated when the software is used in a real-world product or application (a concrete context), to ensure that the stated conditions apply. To have confidence in such claims, however, the analysis that underpins these claims will also need to be repeated or re-validated, to identify new hazards, and to consider whether the risks associated with existing hazards are altered by the new context.
The value of performing a full safety process for a component using such an abstract context, as with an SEooC, is thus questionable. Performing hazard and risk analysis in such a context can certainly be valuable, allowing us to identify and characterise the hazards that may apply for a defined class of systems, and develop safety measures that can be used to mitigate them. However, we can't be confident that either the analysis or the measures are sufficient until we consider them in a concrete context. If we claim that they are sufficient for the abstract context, then there is a risk that the necessary analysis and re-validation will be omitted when they are used in a concrete context.
Failures and hazards
The importance of considering a wider system context is only one aspect of the system theoretic safety perspective. Equally important is the observation that hazards, particularly in complex systems, can manifest even when all of the components in a system are performing their specified function correctly.
Functional safety practices have tended to focus on reducing the risk of a hazardous outcome in the event of a failure, by which we mean the manifestation of a fault: something that prevents a component or system from fulfilling its intended purpose. Commonly used hazard and risk analysis techniques, such as Failure Mode and Effect Analysis (FMEA) and Fault Tree Analysis (FTA), involve systematically examining the ways in which components can fail in order to understand the effect of this on the system. The probability of these failures occurring is then calculated and used to evaluate the associated risks of a hazard, which are then used to identify where safety measures - activities or technical solutions to detect, prevent or respond to failures - are required.
When the root cause of such failures is known (e.g. a hardware component with specified physical limitations or need for periodic maintenance), then this kind of analysis can be an effective way to mitigate the consequences of a failure, or to identify how the consequences of a failure can cascade through the system in a 'chain of events' to cause an accident. This enables fault-tolerant systems to be developed, enabling them to remain safe even when one or more components fail. However, this analytical approach has acknowledged limitations when used in isolation, and for identifying multi-factorial systemic failures.
Part of the reason for this is the focus on failures and the event-chain model. When we examine accidents from a different perspective, it becomes apparent that accidents can occur even in the absence of any failure, due to unforeseen interactions between components or environmental conditions (external factors that affect the system state), or a combination of factors that may involve several components. Furthermore, these factors may be external to the system or components as specified, most notably when human interaction contributes to a hazard.
To identify this kind of hazard, we need to examine a system from a different perspective, examining not just how individual components or elements of the system may fail, but how their interactions, and the influence of external factors that can affect their state (its environment), may combine to cause undesirable outcomes.
At the simplest level, this perspective can be applied by re-framing safety objective #1:
- What undesirable outcomes (losses) can occur?
- What sets of conditions (hazards) can lead to a loss?
- What criteria (constraints) must be satisfied in order to avoid these hazards?
Having identified hazards in this way, safety goals and requirements may then be described in terms of the constraints that must apply to the behaviour of the system in order to avoid or minimise the impact of these hazards.
This is the approach taken by System-Theoretic Process Analysis (STPA), a hazard analysis technique based on a relatively new accident model (STAMP: Systems-Theoretic Accident Modeling and Processes), which was developed by Dr. Nancy Leveson of MIT in direct response to the perceived limitations of existing event-chain models.
Comparisons of STPA and FMEA, and of both with FTA, examining their relative ease of use and effectiveness at identifying unique hazards, found that they "deliver similar analysis results" and have their own strengths and weaknesses; a more recent paper explored the potential benefits of combining STPA and FMEA to address some of these weaknesses. However, the merits of applying systems thinking when applied to safety go beyond hazard analysis.
STPA's approach to modelling control structures equips engineers with an analytical technique that is flexible enough to include factors ranging from signals exchanged between microprocessors to the impact of new legislation on the organisation producing them. While this can open up a breadth of analysis that may seem counter-productive, when applied correctly the method ensures the correct focus, by explicitly defining the scope of analysis at the outset.
The most immediate application of this technique is as a hazard and risk analysis technique, but it also provides a framework for developing reusable and extensible safety requirements. These can be iteratively developed and refined at different levels of abstraction and examine different levels of the control hierarchy as the system is developed. The technique can also be applied at a much earlier stage in the safety and development lifecycles, and remains useful throughout. It can also be used to analyse the processes involved in these lifecycles, to identify how measures intended to increase safety might best mitigate risks introduced via these processes (e.g. applying a security patch to a piece of software).
STPA facilitates a progression from identifying hazards, through defining constraints, to validating design and verification measures, and identifying the causes of an issue during integration and testing. If used in this wider mode, the technique has the potential to deliver far greater benefits, complemented by classic bottom-up hazard and risk analysis techniques (where appropriate) rather than replacing them.
Complex software is playing an ever-increasing role in systems where safety is a critical consideration, notably in vehicles that include advanced driver-assistance systems (ADAS) and in the development of autonomous driving capabilities, but also in medical applications, civil infrastructure and industrial automation.
When thinking about safety as it applies to software, logic dictates that we must consider the system and the wider context within which that software is used. This is necessary because hazards are an emergent property of this system context: they cannot be fully understood in the context of the software component alone. Furthermore, hazards do not only occur when a component fails: they may result from unanticipated interactions between components, or the influence of external conditions (including human interactions), even when all of the components involved are working correctly.
This systemic perspective is often missing from conventional functional safety practices and the associated standards, because they focus on hazards that arise from component failures. ISO 26262, for example, defines functional safety as "absence of unreasonable risk due to hazards caused by malfunctioning behaviour of E/E systems." The limitations of this perspective are explicitly recognised in the more recent ISO 21448, which notes that:
"For some systems, which rely on sensing the internal or external environment, there can be potentially hazardous behaviour caused by the intended functionality or performance limitation of a system that is free from faults addressed in the ISO 26262 series."
-- ISO 21448:2019 Road vehicles - Safety of the intended functionality
However, while this standard is a step in the right direction, and specifically mentions STPA as a technique for identifying hazards that arise from "usage of the system in a way not intended by the manufacturer if the system", it stops short of recommending this kind of analysis as a matter of course for systems involving complex software.
One explanation for this omission may be found in the historical origins of functional safety practices, which were largely developed in the context of electro-mechanical systems and relatively simple software components. In the automotive industry in particular, standardised functional safety practices are also explicitly built around components, reflecting the way that responsibility for safety is distributed throughout a vehicle manufacturer's supply chain.
Breaking down systems into discrete components is a familiar and necessary strategy in systems engineering, allowing different developers to work more efficiently, and promoting specialisation and re-use. However, because safety must be understood at a system level, it can be counter-productive to evaluate a component in isolation and then label it as 'safe', even for a tightly-specified use. This can lead to missed hazards, but it also means that the majority of safety engineering effort is expended on defining and validating a component in isolation, instead of examining its role in a wider, more concrete context.
As an alternative, top-down analytical techniques such as STPA can be used to identify and characterise hazards at both a system and a component level, and to analyse the processes used to develop them. Safety requirements are then derived from this analysis in the form of constraints, which can be iteratively developed at various levels of abstraction, or levels of a system-component hierarchy. By providing a common language to inform safety activities at all levels, these constraints can then be used to validate component behaviour and safety measures across the system, not only at the level of the component, or in an abstract system context.
This approach is not incompatible with the bottom-up, failure-focussed techniques that are prevalent in functional safety practices, but by providing a way to re-frame and re-focus safety engineering efforts at the system level, it may ensure that those efforts are more effective at identifying and mitigating the hazards that slip through existing nets.
The following definitions, many of them borrowed directly from the STPA Handbook, clarify the meaning of the terms highlighted in italics in this article.
abstract context: A system context that is partial, provisional or conditional, where missing or unspecified aspects of the context are described using assumptions or requirements. This may be contrasted with a concrete context.
component: A discrete element or part of a system, or systems. A component in one frame of reference may be considered a system in another.
concrete context: A system context that corresponds to a real-world system (e.g. a product) with specified components and environment. This may be contrasted with an abstract context.
constraints: Unambiguous criteria pertaining to the operation of a system. Constraints are described using “must” or “must not” rather than “shall"; a distinction is made between requirements (system goals or mission) and constraints on how those goals can be achieved.
environment: An aspect of the system context, which may include any external factor that may have an effect upon it. Depending on the nature and boundaries of the system, this might be anything that is external to it: an aspect of the physical world (e.g. weather) for a sensor, or the CPU hardware for an operating system.
failure: The manifestation of a fault: something that prevents a component or system from fulfilling its intended purpose.
hazard: A system state or set of conditions that, together with a particular set of worst-case environmental conditions, will lead to a loss.
loss: An undesirable outcome associated with the operation of a system, involving something of value to its stakeholders (users, producers, customers, operators, etc).
process: A formalised set of practices that are undertaken as part of a development lifecycle.
risk: Describes the probability of an undesirable outcome (one that may lead to a loss) and the severity of the consequences.
safety measure: An activity or technical solution that is intended to prevent a hazard, reduce the probability of the associated risk, or minimise the severity of the consequences.
SEooC: Safety Element out of Context. In the ISO 26262 standard, this term is used to describe a component that is subjected to a safety certification process for an abstract context. See the ISO 26262 definition for more information.
sufficient: What is considered acceptable for a given domain or category of systems when considering what safety measures need to be undertaken to identify or mitigate risks, and what criteria these need to satisfy.
system: A set of components that act together as a whole to achieve some common goal, objective, or end. A system may contain subsystems and may also be part of a larger system. It may have both hardware and software components.
system context: A defined scope of analysis, which encompasses a system (or component), its intended purpose and the factors (including its environment) that may have a bearing upon that purpose. Some of these factors may be implied by the identified purpose (e.g. a car driven on public highways is subject to weather and traffic regulations).
tool: A software or hardware solution that is used as part of the development process for a system or component. If a tool is responsible for providing a safety measure (e.g. constructing or verifying a component), then it has a bearing on safety, even though it does not form part of the resultant system or component.
Related blog posts:
- More articles by Paul Albertella: Trusting software in a pandemic >>
- Open source and Safety at Codethink: Meet the Codethings: Safety-critical systems and the benefits of STPA with Shaun Mooney >>
- Tracking Players at the Edge: An Overview
- What is Remote Asset API?
- Running a devroom: FOSDEM 2021 Safety and Open Source
- Meet the codethings: Understanding BuildGrid and BuildBox with Beth White
- Streamlining Terraform configuration with Jsonnet
- Bloodlight: Designing a Heart Rate Sensor with STM32, LEDs and Photodiode
- Making the tech industry more inclusive for women
- Bloodlight Case Design: Lessons Learned
- RISC-V: Codethink's first research about the open instruction set
- Meet the Codethings: Safety-critical systems and the benefits of STPA with Shaun Mooney
- Why Project Managers are essential in an effective software consultancy
- FOSDEM 2021: Devroom for Safety and Open Source
- Meet the Codethings: Ben Dooks talks about Linux kernel and RISC-V
- Here we go 2021: 4 open source events for software engineers and project leaders
- Xmas Greetings from Codethink
- Call for Papers: FOSDEM 2021 Dev Room Safety and Open Source Software
- Building the abseil-hello Bazel project for a different architecture using a dynamically generated toolchain
- Advent of Code: programming puzzle challenges
- Improving performance on Interrogizer with the stm32
- Introducing Interrogizer: providing affordable troubleshooting
- Improving software security through input validation
- More time on top: My latest work improving Topplot
- Cycling around the world
- Orchestrating applications by (ab)using Ansible's Network XML Parser
- My experience of the MIT STAMP workshop 2020
- Red Hat announces new Flatpak Runtime for RHEL
- How to keep your staff healthy in lockdown
- Bloodlight: A Medical PPG Testbed
- Bringing Lorry into the 2020s
- How to use Tracecompass to analyse kernel traces from LTTng
- Fixing Rust's test suite on RISC-V
- The challenges behind electric vehicle infrastructure
- Investigating kernel user-space access
- Consuming BuildStream projects in Bazel: the bazelize plugin
- Improving RISC-V Linux support in Rust
- Creating a Build toolkit using the Remote Execution API
- Trusting software in a pandemic
- The Case For Open Source Software In The Medical Industry
- My experiences moving to remote working
- Impact of COVID-19 on the Medical Devices Industry
- COVID-19 (Coronavirus) and Codethink
- Codethink develops Open Source drivers for Microsoft Azure Sphere MediaTek MT3620
- Codethink partners with Wirepas
- Testing Bazel's Remote Execution API
- Passing the age of retirement: our work with Fortran and its compilers
- Sharing technical knowledge at Codethink
- Using the REAPI for Distributed Builds
- An Introduction to Remote Execution and Distributed Builds
- Gluing hardware and software: Board Support Packages (BSPs)
- Engineering's jack of all trades: an intro to FPGAs
- Bust out your pendrives: Debian 10 is out!
- Why you should attend local open source meet-ups
- Acceptance, strife, and progress in the LGBTIQ+ and open source communities
- Codethink helps York Instruments to deliver world-beating medical brain-scanner
- Codethink open sources part of staff onboarding - 'How To Git Going In FOSS'
- Getting into open source
- How to put GitOps to work for your software delivery
- Open Source Safety Requirements Analysis for Autonomous Vehicles based on STPA
- Codethink engineers develop custom debug solution for customer project
- Codethink contributes to CIP Super Long Term Kernel maintenance
- Codethink creates custom USB 3 switch to support customer's CI/CD pipeline requirements
- Codethink unlocks data analysis potential for British Cycling
- MIT Doctor delivers Manchester masterclass on innovative safety methodology
- Balance for Better: Women in Technology Codethink Interviews
- Introducing BuildGrid
- Configuring Linux to stabilise latency
- GUADEC 2018 Talks
- Hypervisor Not Required
- Full archive