Many embedded/automotive vendors are recommending that electronic control unit (ECU) consolidation can be best achieved by adopting an architecture with a hypervisor. The idea is to isolate functions into guest operating system virtual machines and restrict access to sensitive resources. So examples of the consolidated architecture look something like:
We suggest that this approach is fundamentally incorrect, as follows:
- Any operating system that we trust to run safety critical processes on a multicore processor must be able to guarantee to schedule all resources so that safety critical processes get what they need, and are properly isolated from other processes. Without such guarantees, the operating system would not be fit for safety critical use in any case.
- If we have an operating system which provides trustable guarantees, we can rely on it to isolate and schedule multiple safety-critical processes for a single multicore processor. We don’t need additional separation. We either trust the OS or we don’t.
- Given that we are bound to have multiple boxes, we do not actually need an architecture that supports multiple OSes on the same physical machine. We could dedicate some boxes to Linux, and others to QNX, for example. Or port the missing functionality to the other OS (we’ll be porting anyway, since a lot of the functions are currently bare-metal).
- Multithreading operating systems are already designed to handle multiple applications. Moreover:
- Each guest requires its own copy of the operating system and system libraries. The guests are consuming memory multiple times for the same things. We do not have lots of spare memory to waste on copies of system software. This would only be justifiable if evidenced improvements in safety/reliability of the system can be shown to be worth the cost.
- Adding a hypervisor increases the amount of software in the system. More software means more bugs, more security vulnerabilities, and an increased attack surface. As one executive commented “How do they think they can close one door, by opening two?”
- Once we put a hypervisor underneath an OS, all guarantees that we had for the OS itself no longer apply. All bets are off. A critical bug or vulnerability in the hypervisor can definitely take down the system (the same is true for any Operating System, of course, and for the underlying hardware).
- More software means more things to update, which is more complicated, error-prone and higher risk, than updating a single software stack.
- The only fundamental justification for using a hypervisor is to support multiple different operating systems on a single CPU. Since we expect that each system/vehicle requires a set of domain controllers, it seems that we can avoid that situation altogether just by dedicating some controllers to run QNX, some to run Linux and so on.
Consolidation Without Hypervisors
Choosing an operating system that handles scheduling and separation of multiple applications across available resources (memory and CPU cores), the overall consolidated approach is just:
Thus if we want to combine Infotainment, Cluster and HUD, the desired approach would be to combine all of these functions on a single unit running Linux, QNX or similar. Separation can be achieved in various ways (for example namespaces, Discretionary Access Control) depending on the OS, and the architect clearly needs to establish that the OS is fit for this purpose. For example in discussion around this document it was suggested that in critical scenarios the OS must perform as a Separation Kernel.
There was a lot of useful discussion around the contents of this document which we distil down to two key points:
1) Improved security
"While adding the complexity of a hypervisor (or separation kernel) increases potential vulnerability attack surface, separation makes exploitation more difficult."
It’s not clear that there is any data/research about in-the-wild exploitations, to indicate whether the claimed increase in exploitation difficulty actually pays off. And as Geer’s Law states
Also, note that this approach does not mitigate against the various hardware-level vulnerabilities exposed in modern microprocessors (Rowhammer, Spectre, Meltdown etc).
2) Reduced costs and risks
The main justification for consolidation is to reduce engineering cost and risk because
- less boxes, less materials, less weight, less physical space, less wiring
- we can reuse the same code
- no need to revalidate the whole ECU when we make a change in one of the 'guests' (depending on implementation a 'guest' may be a stack with an OS, or a single executable)
However there are some clear additional costs and risks:
- direct costs of the 'hypervisor' or ‘separation kernel’ itself (including costs for licensing, support, porting, integration and validation)
- reduction in performance due to the hypervisor or separation kernel itself
- reduction in available memory (which can affect performance also) due to guest OS footprints
- increased complexity and risk for security updates (now we may need to update a hypervisor and/or multiple guest OS)
- another link in the 'chain of trust', resulting in an increase in the attack surface
- another vendor in the supply chain
- depending on the choice of hypervisor/separation-kernel, another binary blob where we have to trust the vendor for long-term support
- risk of missed issues due to the assumption of 'we don't need to re-validate'
- uncertainty leading to wrong implementation (will decision-makers safely distinguish between a dedicated minimal 'hypervisor' of 900 lines as described, a 'separation kernel', and some shiny 'product' based on KVM or similar?)
- risk that the 're-use' turns out to require 're-work' costs because the vendor contribution doesn't play nicely as a guest after all
- blame gaming between vendors when there are issues with shared functions such as power management, diagnostics, etc.
- risk that people assume 'hypervisor' solves all the problems, until results show that it didn't
- costs associated with recalls and/or accidents arising from failure to address the above
- Tracking Players at the Edge: An Overview
- What is Remote Asset API?
- Running a devroom: FOSDEM 2021 Safety and Open Source
- Meet the codethings: Understanding BuildGrid and BuildBox with Beth White
- Streamlining Terraform configuration with Jsonnet
- Bloodlight: Designing a Heart Rate Sensor with STM32, LEDs and Photodiode
- Making the tech industry more inclusive for women
- Bloodlight Case Design: Lessons Learned
- Safety is a system property, not a software property
- RISC-V: Codethink's first research about the open instruction set
- Meet the Codethings: Safety-critical systems and the benefits of STPA with Shaun Mooney
- Why Project Managers are essential in an effective software consultancy
- FOSDEM 2021: Devroom for Safety and Open Source
- Meet the Codethings: Ben Dooks talks about Linux kernel and RISC-V
- Here we go 2021: 4 open source events for software engineers and project leaders
- Xmas Greetings from Codethink
- Call for Papers: FOSDEM 2021 Dev Room Safety and Open Source Software
- Building the abseil-hello Bazel project for a different architecture using a dynamically generated toolchain
- Advent of Code: programming puzzle challenges
- Improving performance on Interrogizer with the stm32
- Introducing Interrogizer: providing affordable troubleshooting
- Improving software security through input validation
- More time on top: My latest work improving Topplot
- Cycling around the world
- Orchestrating applications by (ab)using Ansible's Network XML Parser
- My experience of the MIT STAMP workshop 2020
- Red Hat announces new Flatpak Runtime for RHEL
- How to keep your staff healthy in lockdown
- Bloodlight: A Medical PPG Testbed
- Bringing Lorry into the 2020s
- How to use Tracecompass to analyse kernel traces from LTTng
- Fixing Rust's test suite on RISC-V
- The challenges behind electric vehicle infrastructure
- Investigating kernel user-space access
- Consuming BuildStream projects in Bazel: the bazelize plugin
- Improving RISC-V Linux support in Rust
- Creating a Build toolkit using the Remote Execution API
- Trusting software in a pandemic
- The Case For Open Source Software In The Medical Industry
- My experiences moving to remote working
- Impact of COVID-19 on the Medical Devices Industry
- COVID-19 (Coronavirus) and Codethink
- Codethink develops Open Source drivers for Microsoft Azure Sphere MediaTek MT3620
- Codethink partners with Wirepas
- Testing Bazel's Remote Execution API
- Passing the age of retirement: our work with Fortran and its compilers
- Sharing technical knowledge at Codethink
- Using the REAPI for Distributed Builds
- An Introduction to Remote Execution and Distributed Builds
- Gluing hardware and software: Board Support Packages (BSPs)
- Engineering's jack of all trades: an intro to FPGAs
- Bust out your pendrives: Debian 10 is out!
- Why you should attend local open source meet-ups
- Acceptance, strife, and progress in the LGBTIQ+ and open source communities
- Codethink helps York Instruments to deliver world-beating medical brain-scanner
- Codethink open sources part of staff onboarding - 'How To Git Going In FOSS'
- Getting into open source
- How to put GitOps to work for your software delivery
- Open Source Safety Requirements Analysis for Autonomous Vehicles based on STPA
- Codethink engineers develop custom debug solution for customer project
- Codethink contributes to CIP Super Long Term Kernel maintenance
- Codethink creates custom USB 3 switch to support customer's CI/CD pipeline requirements
- Codethink unlocks data analysis potential for British Cycling
- MIT Doctor delivers Manchester masterclass on innovative safety methodology
- Balance for Better: Women in Technology Codethink Interviews
- Introducing BuildGrid
- Configuring Linux to stabilise latency
- GUADEC 2018 Talks
- Full archive