In 2018, Codethink worked with Lukas Bulwahn from BMW on a project investigating some core functionality of the Linux kernel that is relevant to safety considerations of an assumed system. We wanted to share some of our findings from this work, as it could prove useful to communities interested in using Linux in a safety-critical context.
The Linux kernel is already used in various systems of business-critical applications. The impact of an incorrect computation in such an application may be serious, whether this takes the form of a significant loss of financial assets (e.g. in a trading applications), a breach of sensitive information (e.g. from online banking), or harm to people (e.g. in safety-related systems). For this reason, the development of such applications involves more rigorous software engineering practices, including intensive testing and other verification and validation practices. However, applying such measures for the application software alone cannot address all possible risks. What if the hardware operations that the application relies upon go wrong? Or if the underlying operating system corrupts the trusted application’s context while it is running?
We might improve our confidence of detecting hardware errors by, for example, running the same application on two different machines, ideally from two different vendors with two different instruction sets, and comparing the results. But what steps could we take to gain more confidence in an operating system?
We reviewed various methods that current kernels use to increase the system safety, and we also looked at one example of how an operating system could mess up an application’s context - by unintentionally modifying the application memory - and what could be done to reduce the risks of this unintentional access when using the Linux kernel. You can find more details of the work and the results on CodethinkLabs, including the test code and details of the kernel features and tools that we used.
Kernel to user-space protection
The system memory of Linux is divided into two areas: kernel-space and user-space. This separation serves to provide memory protection and hardware protection from malicious or errant software behaviour.
- Kernel-space is where the kernel code is stored, and executed. The kernel code is executed under CPU Protection Ring 0, and it has access to all of the machine's instructions and system memory.
- User-space (also called userland) is where the user programs and libraries live. They are executed under Ring 3, and have limited access to system resources. A set of API calls - the system calls - are sent to kernel to request memory and physical hardware access when necessary.
Whilst a user-space program is not allowed to access kernel memory, it is possible for the kernel to access user memory. However, the kernel must never execute user-space memory and it must also never access user-space memory without explicit expectation to do so. This is also a relevant property for security, i.e., a violation of that property might indicate that the kernel itself has been compromised, and a malicious execution is attempting to collect sensitive application data. The Kernel Self-Protection design concept, which protects against security flaws in the kernel itself, has hence considered this scenario, which is also relevant for safety considerations, in the past and devised methods and measures to detect and protect against such threats.
There are two methods to ensure that the property under investigation holds:
Static Analysis: A set of accessor functions, such as copy_to_user(), are available for user-space access from the kernel; combined with a static analysis tool in the kernel, they provide protection by several mechanisms:
- Verification that the pointer is a valid user range.
- Handle any paging from the access as if it was from the user.
- Handle any exception from the access as an error-return from the copy.
Runtime analysis: There are several features in modern x86 CPUs to aid isolation of user space code from kernel access. These are Supervisor Mode Execution Prevention (SMEP) and Supervisor Mode Access Prevention (SMAP).
SMEP prohibits the kernel from executing any code that is in user memory. The default is to enable SMEP if the CPU feature is available (unless otherwise requested on the command line by passing
nosmep to the kernel).
To determine if the feature is enabled in the CPU use the following command:
cat /proc/cpuinfo | grep smep
SMAP is an X86 feature that allows controlled access to less-privileged memory from kernel context, and is supported by most Intel and AMD CPU cores.
The Linux kernel since 3.8 has had SMAP, which is configured via the CONFIG_X86_SMAP option which defaults to 'y' (enabled); this is only available to change if CONFIG_EXPERT=y is set. There is also a kernel parameter called
nosmap which can be passed at boot-time.
There are also ARM equivalents, Privilege Execute Never (PXN) and Privileged Access Never (PAN).
We constructed some test systems and created some test code to do both static analysis (build time) and dynamic checking (run time) to evaluate the effectiveness of these protection features.
To ensure that user-space-provided pointers are not used in the wrong way, the kernel provides a special tag to be added while using those pointers. Any function with a user-space pointer as a parameter requires the
__user attribute. This is then checked to make sure that the
__user attribute is not accidentally removed or added to a pointer. These pointers are also checked for any attempt to use the memory referenced directly.
__user attribute is defined in
#define __user __attribute__((noderef, address_space(1))) #define __kernel __attribute__((address_space(0)))
The kernel supports static analysis tools in the build system to find code problems. We used sparse in our analysis. We built our own version of Sparse with some patches, to address some of the issues that we identified during testing (see below).
Sparse is built on a simple C parser library, which can be used to look for common issues, such as violations of address-space. It is integrated into the kernel build system and can be run during build (by adding C= to the make) or after build on all the build files.
We wrote a sparse test which injects various deliberate
__user errors to demonstrate that sparse can and does show typical issues by default when run over the source code.
In our tests, sparse picked up all the issues we expected, with one exception. There is an issue with passing
__user pointers into variadic functions like
printk(). As of this time, we are working on adding proper printf formatting support; you can see examples of the changes here.
Sparse did identify some issues in the kernel's access to userspace, some of which may be regarded as false positives. The full reports are available here:
In our selective follow-up investigation of 25 reported findings, we identified 12 false alarms — cases where the kernel needs to access user-space memory to implement the intended functionality and appropriate cautions are implemented while doing so — and 13 cases where more in-depth analysis in future work is required, probably best addressed in discussion with the developers and maintainers of this code.
To get a rough estimate of how much of the codebase is analysed by the tool, the code coverage counts the number of C files checked by sparse for that build against the objects built. When we compare that with the numbers of C files included in the overall build, we obtain a good indication of how many files are already and by detailed inspection of the list of not covered files, we can determine further investigations for those parts.
However, the analysis does not take all the C files in the kernel source tree into account, as some are for other architectures or build configurations; so, our results and insights cannot simply be reused for other build configurations, but the investigation for other systems needs to be done with the specific build configuration at hand.
We created a kernel module that includes patches to expose some testing APIs, and a number of tests that work via various kernel interfaces, to allow both performance and robustness of the hardware functionality to be tested. These tests only currently cover a subset of the user to kernel interfaces (file read and write) to try to get an idea whether the copy to and from user interfaces are working, and check whether protection mechanisms can cope with a basic set of expected operations.
A VM test system was built using a KVM-based virtual machine with the Debian operating system. This was to allow for testing the features(SMAP/PAN) described, testing the driver code and verifying the features work under VMs. Testing was undertaken using a 4.18 kernel with some minor modifications to facilitate testing; these can be found here.
To allow consistent benchmarking, minimal test systems were also built using the following hardware: The Intel test system used was an Intel Core i5-7400 (3.0GHz) system with 8GB RAM and a minimal Debian 10 operating system installed. The ARM test system was a Raspberry Pi 3 Model B, with Raspbian 9.4 installed on an 16GiB MMC card.
The test module issues a direct read and a direct write from user-space on a kernel with SMAP enabled.
In both these cases the kernel detects a paging fault and kills the process that caused the access (dd). There is no special notification that this access was due to an SMAP trap in the OOPS. The OOPS code and the CR4 bit 21 being set should indicate if it was SMAP triggered.
The ARM system produces a "page domain fault" OOPS and the process is terminated. For testing this was changed into a catchable fault so that the testing process is not continually killed.
Our further test results also show that enabling SMAP and PAN has a relatively small impact on performance.
Static testing tools already fully integrated in the kernel build, such as sparse, can be used at build time to identify issues where there is improper use of user-pointers, to enable these to be fixed.
Hardware features, such as SMAP and PAN, can offer runtime protection on both x86 and ARM systems that support them. Our testing shows that these features add between 1% to 5% to the execution time of the test syscalls, which is acceptable for various systems given the benefits these hardware features provide.
These methods and features do not fully protect against deliberate accesses to user memory (the kmod we created for testing shows it is possible to build copies of the code that do this for the kernel-user API) or buffer overflows when copying data, and don't mean that some other methods of accessing (such as DMA) won't cause any issues.
We presented here some long-standing methods and features in the Linux kernel that can be used to build confidence in the kernel for business- or safety-critical applications. Our critical review did uncover a few points of improvement, which we followed up within the kernel community. This shows that the existing tools can give us some confidence that the kernel is unlikely to modify the application memory by unintentional access. However, there are further ways that the kernel user-space access can impact applications, and these also deserve deeper investigation in the future.
Lukas Bulwahn comments:
"This investigation shows the clear contribution of existing methods to argue with confidence on the absence of certain unintended behaviour. It starts from a very clear and specific goal, comprehensive from a system architecture perspective, derives clearly contributing methods, and practically describes and validates the methods' effect, coverage and limits. It does not only remain on a highly abstract level of argumentation, but it really connects its argumentation with validation tests, and with the actual findings in the kernel source code and hence provides high confidence of its validity, while improving the current state of kernel verification for the minor gaps identified. This is certainly a nice investigation of how a system property is systematically analysed, verified in detail and comprehensively documented for others to comprehend and reproduce. It may serve as a guiding example for others to provide results in the area of software dependability and especially for safety-related systems."