Many embedded/automotive vendors are recommending that electronic control unit (ECU) consolidation can be best achieved by adopting an architecture with a hypervisor. The idea is to isolate functions into guest operating system virtual machines and restrict access to sensitive resources. So examples of the consolidated architecture look something like:
We suggest that this approach is fundamentally incorrect, as follows:
- Any operating system that we trust to run safety critical processes on a multicore processor must be able to guarantee to schedule all resources so that safety critical processes get what they need, and are properly isolated from other processes. Without such guarantees, the operating system would not be fit for safety critical use in any case.
- If we have an operating system which provides trustable guarantees, we can rely on it to isolate and schedule multiple safety-critical processes for a single multicore processor. We don’t need additional separation. We either trust the OS or we don’t.
- Given that we are bound to have multiple boxes, we do not actually need an architecture that supports multiple OSes on the same physical machine. We could dedicate some boxes to Linux, and others to QNX, for example. Or port the missing functionality to the other OS (we’ll be porting anyway, since a lot of the functions are currently bare-metal).
- Multithreading operating systems are already designed to handle multiple applications. Moreover:
- Each guest requires its own copy of the operating system and system libraries. The guests are consuming memory multiple times for the same things. We do not have lots of spare memory to waste on copies of system software. This would only be justifiable if evidenced improvements in safety/reliability of the system can be shown to be worth the cost.
- Adding a hypervisor increases the amount of software in the system. More software means more bugs, more security vulnerabilities, and an increased attack surface. As one executive commented “How do they think they can close one door, by opening two?”
- Once we put a hypervisor underneath an OS, all guarantees that we had for the OS itself no longer apply. All bets are off. A critical bug or vulnerability in the hypervisor can definitely take down the system (the same is true for any Operating System, of course, and for the underlying hardware).
- More software means more things to update, which is more complicated, error-prone and higher risk, than updating a single software stack.
- The only fundamental justification for using a hypervisor is to support multiple different operating systems on a single CPU. Since we expect that each system/vehicle requires a set of domain controllers, it seems that we can avoid that situation altogether just by dedicating some controllers to run QNX, some to run Linux and so on.
Consolidation Without Hypervisors
Choosing an operating system that handles scheduling and separation of multiple applications across available resources (memory and CPU cores), the overall consolidated approach is just:
Thus if we want to combine Infotainment, Cluster and HUD, the desired approach would be to combine all of these functions on a single unit running Linux, QNX or similar. Separation can be achieved in various ways (for example namespaces, Discretionary Access Control) depending on the OS, and the architect clearly needs to establish that the OS is fit for this purpose. For example in discussion around this document it was suggested that in critical scenarios the OS must perform as a Separation Kernel.
There was a lot of useful discussion around the contents of this document which we distil down to two key points:
1) Improved security
"While adding the complexity of a hypervisor (or separation kernel) increases potential vulnerability attack surface, separation makes exploitation more difficult."
It’s not clear that there is any data/research about in-the-wild exploitations, to indicate whether the claimed increase in exploitation difficulty actually pays off. And as Geer’s Law states
Also, note that this approach does not mitigate against the various hardware-level vulnerabilities exposed in modern microprocessors (Rowhammer, Spectre, Meltdown etc).
2) Reduced costs and risks
The main justification for consolidation is to reduce engineering cost and risk because
- less boxes, less materials, less weight, less physical space, less wiring
- we can reuse the same code
- no need to revalidate the whole ECU when we make a change in one of the 'guests' (depending on implementation a 'guest' may be a stack with an OS, or a single executable)
However there are some clear additional costs and risks:
- direct costs of the 'hypervisor' or ‘separation kernel’ itself (including costs for licensing, support, porting, integration and validation)
- reduction in performance due to the hypervisor or separation kernel itself
- reduction in available memory (which can affect performance also) due to guest OS footprints
- increased complexity and risk for security updates (now we may need to update a hypervisor and/or multiple guest OS)
- another link in the 'chain of trust', resulting in an increase in the attack surface
- another vendor in the supply chain
- depending on the choice of hypervisor/separation-kernel, another binary blob where we have to trust the vendor for long-term support
- risk of missed issues due to the assumption of 'we don't need to re-validate'
- uncertainty leading to wrong implementation (will decision-makers safely distinguish between a dedicated minimal 'hypervisor' of 900 lines as described, a 'separation kernel', and some shiny 'product' based on KVM or similar?)
- risk that the 're-use' turns out to require 're-work' costs because the vendor contribution doesn't play nicely as a guest after all
- blame gaming between vendors when there are issues with shared functions such as power management, diagnostics, etc.
- risk that people assume 'hypervisor' solves all the problems, until results show that it didn't
- costs associated with recalls and/or accidents arising from failure to address the above
- Using Git LFS and fast-import together
- Testing in a Box: Streamlining Embedded Systems Testing
- SDV Europe: What Codethink has planned
- How do Hardware Security Modules impact the automotive sector? The final blog in a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part two of a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part one of a three part discussion
- Automated Kernel Testing on RISC-V Hardware
- Automated end-to-end testing for Android Automotive on Hardware
- GUADEC 2023
- Embedded Open Source Summit 2023
- RISC-V: exploring a bug in stack unwinding
- Adding RISC-V Vector Cryptography Extension support to QEMU
- Introducing Our New Open-Source Tool: Quality Assurance Daemon
- Long Term Maintainability
- FOSDEM 2023
- Think before you Pip
- BuildStream 2.0 is here, just in time for the holidays!
- A Valuable & Comprehensive Firmware Code Review by Codethink
- GNOME OS & Atomic Upgrades on the PinePhone
- Flathub-Codethink Collaboration
- Codethink proudly sponsors GUADEC 2022
- Tracking Down an Obscure Reproducibility Bug in glibc
- Web app test automation with `cdt`
- FOSDEM Testing and Automation talk
- Protecting your project from dependency access problems
- Porting GNOME OS to Microchip's PolarFire Icicle Kit
- YAML Schemas: Validating Data without Writing Code
- Deterministic Construction Service
- Codethink becomes a Microchip Design Partner
- Hamsa: Using an NVIDIA Jetson Development Kit to create a fully open-source Robot Nano Hand
- Using STPA with software-intensive systems
- Codethink achieves ISO 26262 ASIL D Tool Certification
- RISC-V: running GNOME OS on SiFive hardware for the first time
- Automated Linux kernel testing
- Native compilation on Arm servers is so much faster now
- Higher quality of FOSS: How we are helping GNOME to improve their test pipeline
- RISC-V: A Small Hardware Project
- Why aligning with open source mainline is the way to go
- Build Meetup 2021: The BuildTeam Community Event
- A new approach to software safety
- Does the "Hypocrite Commits" incident prove that Linux is unsafe?
- ABI Stability in freedesktop-sdk
- Why your organisation needs to embrace working in the open-source ecosystem
- RISC-V User space access Oops
- Tracking Players at the Edge: An Overview
- What is Remote Asset API?
- Running a devroom at FOSDEM: Safety and Open Source
- Meet the codethings: Understanding BuildGrid and BuildBox with Beth White
- Streamlining Terraform configuration with Jsonnet
- Bloodlight: Designing a Heart Rate Sensor with STM32, LEDs and Photodiode
- Making the tech industry more inclusive for women
- Bloodlight Case Design: Lessons Learned
- Safety is a system property, not a software property
- RISC-V: Codethink's first research about the open instruction set
- Meet the Codethings: Safety-critical systems and the benefits of STPA with Shaun Mooney
- Why Project Managers are essential in an effective software consultancy
- FOSDEM 2021: Devroom for Safety and Open Source
- Meet the Codethings: Ben Dooks talks about Linux kernel and RISC-V
- Here we go 2021: 4 open source events for software engineers and project leaders
- Xmas Greetings from Codethink
- Call for Papers: FOSDEM 2021 Dev Room Safety and Open Source Software
- Building the abseil-hello Bazel project for a different architecture using a dynamically generated toolchain
- Advent of Code: programming puzzle challenges
- Full archive