As part of Codethink's interest in RISC-V I have been following the RISC-V kernel list. Whilst looking through the postings the following bug (more information here) came up, titled:
[syzbot] BUG: unable to handle kernel access to user memory in schedule_tail
It contained a kernel Oops (see below) that looked interesting, so I decided to look at the bug and check out the kernel source and config file supplied in the bug. The bug was intriguing because there is a lot of interest in RISC-V, and I had previously worked on checking the Linux kernel's user access protection.
Unable to handle kernel access to user memory without uaccess routines at virtual address 000000002749f0d0
Oops [#1]
Modules linked in:
CPU: 1 PID: 4875 Comm: syz-executor.0 Not tainted 5.12.0-rc2-syzkaller-00467-g0d7588ab9ef9 #0
Hardware name: riscv-virtio,qemu (DT)
epc : schedule_tail+0x72/0xb2 kernel/sched/core.c:4264
ra : task_pid_vnr include/linux/sched.h:1421 [inline]
ra : schedule_tail+0x70/0xb2 kernel/sched/core.c:4264
epc : ffffffe00008c8b0 ra : ffffffe00008c8ae sp : ffffffe025d17ec0
gp : ffffffe005d25378 tp : ffffffe00f0d0000 t0 : 0000000000000000
t1 : 0000000000000001 t2 : 00000000000f4240 s0 : ffffffe025d17ee0
s1 : 000000002749f0d0 a0 : 000000000000002a a1 : 0000000000000003
a2 : 1ffffffc0cfac500 a3 : ffffffe0000c80cc a4 : 5ae9db91c19bbe00
a5 : 0000000000000000 a6 : 0000000000f00000 a7 : ffffffe000082eba
s2 : 0000000000040000 s3 : ffffffe00eef96c0 s4 : ffffffe022c77fe0
s5 : 0000000000004000 s6 : ffffffe067d74e00 s7 : ffffffe067d74850
s8 : ffffffe067d73e18 s9 : ffffffe067d74e00 s10: ffffffe00eef96e8
s11: 000000ae6cdf8368 t3 : 5ae9db91c19bbe00 t4 : ffffffc4043cafb2
t5 : ffffffc4043cafba t6 : 0000000000040000
status: 0000000000000120 badaddr: 000000002749f0d0 cause: 000000000000000f
Call Trace:
[<ffffffe00008c8b0>] schedule_tail+0x72/0xb2 kernel/sched/core.c:4264
[<ffffffe000005570>] ret_from_exception+0x0/0x14
Dumping ftrace buffer:
(ftrace buffer empty)
---[ end trace b5f8f9231dc87dda ]---
Build the kernel
The report contained the git repo and config used to build, so the first step is to build the kernel to give an image to run and the source output to look at. I've skipped the actions taken to get and build as they are no different from normal kernel development.
Given there is already a bug report, the first action is to assume the kernel we built is similar to the bot built to allow the disassembly, look at the instructions at the bug's location, and map these to the source code.
$ riscv64-linux-gnu-objdump -D kernel/sched/core.o
000000000000ea44 <schedule_tail>:
ea44: 1101 addi sp,sp,-32
ea46: e822 sd s0,16(sp)
ea48: ec06 sd ra,24(sp)
ea4a: e426 sd s1,8(sp)
ea4c: e04a sd s2,0(sp)
ea4e: 1000 addi s0,sp,32
ea50: 00000097 auipc ra,0x0
ea54: 000080e7 jalr ra # ea50 <schedule_tail+0xc>
...
eaa0: f00007b7 lui a5,0xf0000
eaa4: 83e9 srli a5,a5,0x1a
eaa6: 0297e163 bltu a5,s1,eac8 <.L3667>
eaaa: 00040937 lui s2,0x40
eaae: 10092073 csrs sstatus,s2
eab2: 4601 li a2,0
eab4: 4581 li a1,0
eab6: 8512 mv a0,tp
eab8: 00000097 auipc ra,0x0
eabc: 000080e7 jalr ra # eab8 <.LBB16792+0x6>
eac0: 4781 li a5,0
eac2: c088 sw a0,0(s1)
eac4: 10093073 csrc sstatus,s2
eac8: 00000097 auipc ra,0x0
eacc: 000080e7 jalr ra # eac8 <.L3667>
The above shows that 0xea44+0x70
is eab4
. In this dissasembly this is part of the put_user()
in schedule_tail()
function. It is likely given the function is small that this is the right place to look.
$ riscv64-linux-gnu-objdump -Dr -S kernel/sched/core.o
...
put_user(task_pid_vnr(current), current->set_child_tid);
ea8a: 6585 lui a1,0x1
ea8c: 0a858593 addi a1,a1,168 # 10a8 <.LVL499+0x6>
ea90: 00000517 auipc a0,0x0
...
return __task_pid_nr_ns(tsk, PIDTYPE_PID, NULL);
eab2: 4601 li a2,0
eab4: 4581 li a1,0
eab6: 8512 mv a0,tp
The above disassembly shows this build is not precisely the same as the one being built by the test bot. The abort's place in the original report is a li
instruction which means load immediate that should never fault. Looking further down the code, there is a sw
(store word) instruction at 0xeac2
. The Oops also back this up as the badaddr is showing the access is to 0x000000002749f0d0
, which is also register s1
and matches the sw a0,0(s1)
that stores the value in register a0
to the address found in s1
plus 0.
Build a test environment
To run tests, the first attempt is to use the qemu-riscv64 and a Debian rootfs. The Debian install can then be used with qemu-static-riscv64 to install the necessary build tools and use the instructions supple to install the syz-stress tool.
Run the syz-stress and reproduce
Now that the bug can be reproduced and the same methods as above can be used to track down where it is. This shows that the original disassembly and tracking was correct, and the bug is input user at the 'sw' instruction previously identified.
The relevant code for syz-stress was built using a chroot and copied into the test environment. Some basic instructions are included in the mailing list post that was part of the initial bug discussion.
Note, the initial stress testing failed with mmap()
issues which seem to be a side effect of the initial qemu vm using 9pfs as the rootfs. Making a real ext4 rootfs worked, and was a bit faster.
Analysing the issue
At this point, the bug has been confirmed to be reproducible and to exist in the initially identified place. The next step is to try and plan what to find and fix the bug.
If we take the output as true, and the crash was in a store to user space that has been detected as an "access to user memory without uaccess", then this list is our initial condition:
- the fault handler has not mapped the address properly
- the fault handler mapped the address but did not return correctly
- the code has been corrupted at run time
- the code got called into by accident (bad pointer elsewhere)
- the compiler did not compile the code correctly
- the user pointer has been corrupted
The above list might be easier to think about if we firstly look at how the kernel accesses user space. By default, the kernel does not allow access to user memory except via specific accessor functions. So the most accessible place to start would be with the code in <asm/uaccess.h>
which implements the user space accessors. These calls include put_user()
as shown above.
User space accessors are architecture-specific (the above <asm/uaccess.h>
is in the file arch/riscv/include/asm/uaccess.h
). This file implements the architecture-specific memory protection code. For the RISC-V case the hardware has a control bit in the status CSR to say if the kernel mode can access the user memory. Linux refers to this bit as SR_SUM, defined in arch/riscv/include/asm/csr.h
.
The following macros are in the user access code to control the SR_SUM, which is also where the put_user()
is implemented. The following two macros are used to enable user access:
#define __enable_user_access() \
__asm__ __volatile__ ("csrs sstatus, %0" : : "r" (SR_SUM) : "memory")
#define __disable_user_access() \
__asm__ __volatile__ ("csrc sstatus, %0" : : "r" (SR_SUM) : "memory")
#define put_user(x, ptr) \
({ \
__typeof__(*(ptr)) __user *__p = (ptr); \
might_fault(); \
access_ok(__p, sizeof(*__p)) ? \
__put_user((x), __p) : \
-EFAULT; \
})
The put_user()
macro uses __put_user()
to do the access with the appropriate memory protection set via `__enable_user_access().
#define __put_user(x, ptr) \
({ \
__typeof__(*(ptr)) __user *__gu_ptr = (ptr); \
long __pu_err = 0; \
\
__chk_user_ptr(__gu_ptr); \
\
__enable_user_access(); \
__put_user_nocheck(x, __gu_ptr, __pu_err); \
__disable_user_access(); \
\
__pu_err; \
})
If we look at the code for schedule tail, the instructions at addresses 0xeaaa
and 0xeaea
load the SR_SUM constant and the perform the csrs
that is in __enable_user_access()
.
eaa0: f00007b7 lui a5,0xf0000
eaa4: 83e9 srli a5,a5,0x1a
eaa6: 0297e163 bltu a5,s1,eac8 <.L3667>
eaaa: 00040937 lui s2,0x40
eaae: 10092073 csrs sstatus,s2
eab2: 4601 li a2,0
eab4: 4581 li a1,0
eab6: 8512 mv a0,tp
eab8: 00000097 auipc ra,0x0
eabc: 000080e7 jalr ra # eab8 <.LBB16792+0x6>
eac0: 4781 li a5,0
eac2: c088 sw a0,0(s1)
eac4: 10093073 csrc sstatus,s2
eac8: 00000097 auipc ra,0x0
eacc: 000080e7 jalr ra # eac8 <.L3667>
From this and the register dump above, the code was likely properly executed from where it was intended to. The 'ra' register's return address is for the jalr
instruction at 0xeabc
, and the s2 register contains the 0x40000
value loaded at 0xeaaa.
The next thing to note is that the code for __enable_user_access
is above the code that does the first argument to put_user (which is the call to task_pid_vnr()
) that got inlined into the schedule_tail()
code. This may not be the bug, but it is a possible security issue as the task_pid_vnr()
will run with user memory protection off.
To confirm the sequence, we can dump the same region and look at the relocations. In the following output, we can see the call is to __task_pid_nr_ns()
, which is what task_pid_vnr macros expand out to:
$ riscv64-linux-gnu-objdump -Dr kernel/sched/core.
eab2: 4601 li a2,0
eab4: 4581 li a1,0
eab6: 8512 mv a0,tp
eab8: 00000097 auipc ra,0x0
eab8: R_RISCV_CALL __task_pid_nr_ns
eab8: R_RISCV_RELAX *ABS*
eabc: 000080e7 jalr ra # eab8 <.LBB16792+0x6>
Going back to the dump, it is clear the status register has the SR_SUM flag cleared when the abort happened, and the code does not look like it should have gotten there with the flag cleared. We can probably discard a gcc or corruption bug as candidates, which leaves a fault handler or an issue in the code being called.
Verify we should have aborted
The first thing is to check the sstatus flags just before the instruction to see if they are still set. This could point to either a qemu issue clearing the sstatus or the code within __task_pid_nr_ns
call causing the flag to be cleared.
The following patch was attempted to check the flags; however, this made the system unstable.
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index d2e472c4c63a..977c0584f767 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -20,6 +20,13 @@
#include "../kernel/head.h"
+asmlinkage void bad_putuser_mstatus(unsigned addr, unsigned flags)
+{
+ printk(KERN_ERR "bad mflags from put_user, addr %x, flags %x\n",
+ addr, flags);
+ BUG_ON(1);
+}
+
static void die_kernel_fault(const char *msg, unsigned long addr,
struct pt_regs *regs)
{
diff --git a/arch/riscv/include/asm/uaccess.h b/arch/riscv/include/asm/uaccess.h
index 824b2c9da75b..7338b82631fc 100644
--- a/arch/riscv/include/asm/uaccess.h
+++ b/arch/riscv/include/asm/uaccess.h
@@ -1,3 +1,4 @@
+
/* SPDX-License-Identifier: GPL-2.0-only */
/*
* Copyright (C) 2012 Regents of the University of California
@@ -224,8 +225,17 @@ do { \
uintptr_t __tmp; \
__typeof__(*(ptr)) __x = x; \
__asm__ __volatile__ ( \
+ " csrr %1, sstatus\n" \
+ " srli %1, %1, 18\n" \
+ " andi %1, %1, 1\n" \
+ " beqz %1, 5f\n" \
"1:\n" \
" " insn " %z3, %2\n" \
+ " j 2f\n" \
+ "5:\n" \
+ " li a0, 0\n" \
+ " csrr a1, sstatus\n" \
+ " jump bad_putuser_mstatus, %1\n" \
"2:\n" \
" .section .fixup,\"ax\"\n" \
" .balign 4\n" \
The next attempt was to get the same functionality in C to check for the flag being cleared.
+static inline void check_user_access(void)
+{
+ unsigned long status = read_sstatus();
+ if ((status & (1 << 18)) == 0) {
+ printk(KERN_ERR "%s: status register 0x%lx in io (now %lx)\n", __func__, status, read_sstatus());
+ BUG_ON(1);
+ }
+}
+
+
/**
* __put_user: - Write a simple value into user space, with less checking.
* @x: Value to copy to user space.
@@ -321,7 +342,13 @@ do { \
__chk_user_ptr(__gu_ptr); \
\
__enable_user_access(); \
- __put_user_nocheck(x, __gu_ptr, __pu_err); \
+ if (sizeof(*__gu_ptr) <= 4) { \
+ unsigned long tx = x; \
+ check_user_access(); \
+ __put_user_nocheck(tx, __gu_ptr, __pu_err); \
+ } else { \
+ __put_user_nocheck(x, __gu_ptr, __pu_err); \
+ } \
__disable_user_access(); \
\
__pu_err; \
This produced the following output, which shows the code now traps on the BUG_ON() in our patch instead of the fault in sw.:
[ 479.600372][ T5772] can: request_module (can-proto-0) failed.
[ 517.002492][ T5865] check_user_access: status register 0x122 in io (now 122)
[ 517.004272][ T5865] ------------[ cut here ]------------
[ 517.005015][ T5865] kernel BUG at arch/riscv/include/asm/uaccess.h:304!
[ 517.006219][ T5865] Kernel BUG [#1]
[ 517.006736][ T5865] Modules linked in:
[ 517.007421][ T5865] CPU: 3 PID: 5865 Comm: syz-executor Not tainted 5.12.0-rc2-00017-ge5e6131f91c8-dirty #80
[ 517.008424][ T5865] Hardware name: riscv-virtio,qemu (DT)
[ 517.009119][ T5865] epc : schedule_tail+0xd8/0xda
[ 517.009848][ T5865] ra : schedule_tail+0xd8/0xda
[ 517.010520][ T5865] epc : ffffffe00008d04a ra : ffffffe00008d04a sp : ffffffe08be43ec0
[ 518.229029][ T5865] gp : ffffffe00528b130 tp : ffffffe0861cc440 t0 : ffffffe005d01b67
[ 518.230023][ T5865] t1 : ffffffc4117c8778 t2 : 0000000000000000 s0 : ffffffe08be43ee0
[ 518.230756][ T5865] s1 : 00000000178ca0d0 a0 : 0000000000000038 a1 : 00000000000f0000
[ 518.231526][ T5865] a2 : 0000000000000002 a3 : ffffffe0000d6c24 a4 : fec417340f8e1a00
[ 518.232506][ T5865] a5 : fec417340f8e1a00 a6 : 0000000000f00000 a7 : ffffffe08be43bc7
[ 518.233404][ T5865] s2 : 0000000000040000 s3 : ffffffe080994440 s4 : ffffffe0808e3fe0
[ 518.234278][ T5865] s5 : 0000000000004020 s6 : ffffffe0809948d8 s7 : ffffffe0dad26650
[ 518.235087][ T5865] s8 : ffffffe0dad25c18 s9 : 0000000000000000 s10: ffffffe089ce2280
[ 518.235885][ T5865] s11: 000000785a6f92a8 t3 : fec417340f8e1a00 t4 : ffffffc4117c8777
[ 518.236664][ T5865] t5 : ffffffc4117c8779 t6 : ffffffe08be43bc8
[ 518.237252][ T5865] status: 0000000000000120 badaddr: 0000000000000000 cause: 0000000000000003
[ 518.238028][ T5865] Call Trace:
[ 518.238422][ T5865] [<ffffffe00008d04a>] schedule_tail+0xd8/0xda
[ 518.239174][ T5865] [<ffffffe000005570>] ret_from_exception+0x0/0x14
[ 518.242135][ T5865] ---[ end trace 470e0fb737142c59 ]---
[ 518.243067][ T5865] Kernel panic - not syncing: Fatal exception
[ 518.243817][ T5865] SMP: stopping secondary CPUs
There was also a bug in the read_sstatus() implementation as there was no memory barrier in the original shown here. The following OOPS was triggered as the code produced put the CSR read before the code to set the CSR. (note the now value is showing the correct bit is set when the printk was executed):
[ 47.858102][ T1] check_user_access: status register 0x22 in io (now 40022)
[ 47.859931][ T1] ------------[ cut here ]------------
[ 47.860515][ T1] kernel BUG at arch/riscv/include/asm/uaccess.h:304!
[ 47.861435][ T1] Kernel BUG [#1]
[ 47.861835][ T1] Modules linked in:
[ 47.862358][ T1] CPU: 4 PID: 1 Comm: systemd Not tainted 5.12.0-rc2-00017-ge5e6131f91c8-dirty #79
[ 47.863021][ T1] Hardware name: riscv-virtio,qemu (DT)
[ 47.863563][ T1] epc : do_sys_poll+0x852/0x854
[ 47.864330][ T1] ra : do_sys_poll+0x852/0x854
[ 47.864880][ T1] epc : ffffffe000401f36 ra : ffffffe000401f36 sp : ffffffe07ffa7a20
[ 47.865461][ T1] gp : ffffffe00528b188 tp : ffffffe07ffb8000 t0 : ffffffe005d01b67
[ 47.866042][ T1] t1 : ffffffc40fff4ee4 t2 : 0000000000000000 s0 : ffffffe07ffa7e60
[ 47.866622][ T1] s1 : 0000000000000001 a0 : 0000000000000039 a1 : 00000000000f0000
[ 47.867198][ T1] a2 : 0000000000000002 a3 : ffffffe0000d6b66 a4 : d55d48e460f03000
[ 47.867766][ T1] a5 : d55d48e460f03000 a6 : 0000000000f00000 a7 : ffffffe07ffa7727
[ 47.868334][ T1] s2 : 0000000000000001 s3 : 0000003fffeaa8f6 s4 : ffffffe07ffa7a8c
[ 47.868909][ T1] s5 : 0000000000040000 s6 : 0000000000000000 s7 : 0000000000000022
[ 47.869482][ T1] s8 : 0000000000000000 s9 : 0000000000000001 s10: 0000000000000000
[ 47.870042][ T1] s11: 0000000000000022 t3 : d55d48e460f03000 t4 : ffffffc40fff4ee3
[ 47.870609][ T1] t5 : ffffffc40fff4ee5 t6 : ffffffe07ffa7728
[ 47.871087][ T1] status: 0000000000040120 badaddr: 0000000000000000 cause: 0000000000000003
[ 47.871704][ T1] Call Trace:
[ 47.872037][ T1] [<ffffffe000401f36>] do_sys_poll+0x852/0x854
[ 47.872714][ T1] [<ffffffe000402ed8>] sys_ppoll+0x110/0x136
[ 47.873309][ T1] [<ffffffe000005562>] ret_from_syscall+0x0/0x2
[ 47.874727][ T1] ---[ end trace 9529102862d244e4 ]---
The following is the correct read_sstatus() to properly read the value.
#define read_sstatus() ({ unsigned long result; asm(" csrr %0, sstatus" : "=r"(result) :: "memory"); result; })
The resut of this testing implies that the issue is somewhere within the code to get the task's pid.
Alternate user access tracking
The next idea was to try and add a task flag to say that we're in the user access so that we can check if our own code cleared it:
--- a/arch/riscv/include/asm/uaccess.h
+++ b/arch/riscv/include/asm/uaccess.h
@@ -21,11 +21,22 @@
#include <asm/extable.h>
#include <asm/asm.h>
-#define __enable_user_access() \
- __asm__ __volatile__ ("csrs sstatus, %0" : : "r" (SR_SUM) : "memory")
-#define __disable_user_access() \
+#define __enable_user_access() \
+ do { extern void notify_uaccess(unsigned x); notify_uaccess(1); __asm__ __volatile__ ("csrs sstatus, %0" : : "r" (SR_SUM) : "memory"); ; } while(0)
+#define __disable_user_access() \
+ do { extern void notify_uaccess(unsigned x); notify_uaccess(0); __asm__ __volatile__ ("csrc sstatus, %0" : : "r" (SR_SUM) : "memory"); } while(0)
+
With the following patch that tracks the state in the thread_struct assoicated with each process.
diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h
index 021ed64ee608..af7ea0460384 100644
--- a/arch/riscv/include/asm/processor.h
+++ b/arch/riscv/include/asm/processor.h
@@ -35,6 +35,7 @@ struct thread_struct {
unsigned long s[12]; /* s[0]: frame pointer */
struct __riscv_d_ext_state fstate;
unsigned long bad_cause;
+ unsigned long in_uaccess;
};
#define INIT_THREAD { \
diff --git a/arch/riscv/include/asm/switch_to.h b/arch/riscv/include/asm/switch_to.h
index 407bcc96a710..69442885d27d 100644
--- a/arch/riscv/include/asm/switch_to.h
+++ b/arch/riscv/include/asm/switch_to.h
@@ -72,6 +72,7 @@ do { \
struct task_struct *__next = (next); \
if (has_fpu) \
__switch_to_aux(__prev, __next); \
+ if (__prev->thread.in_uaccess != __next->thread.in_uaccess) { printk(KERN_INFO "switch_to is changing uaccess!\n"); } \
((last) = __switch_to(__prev, __next)); \
} while (0)
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 977c0584f767..3426da4aa307 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -20,6 +20,11 @@
#include "../kernel/head.h"
+void notify_uaccess(unsigned to)
+{
+ current->thread.in_uaccess = to;
+}
+
asmlinkage void bad_putuser_mstatus(unsigned addr, unsigned flags)
{
printk(KERN_ERR "bad mflags from put_user, addr %x, flags %x\n",
These additions showed that the thread had not cancelled the SR_SUM when the oops happened (apologies, the OOPS with the extra information was lost). The next thing is to have a look at the task_pid_vnr(current) call (this is defined in include/linux/sched.h) to be an inline call to '__task_pid_nr_ns(tsk, PIDTYPE_PID, NULL)' which is defined in the kernel/pid.c file.
The code for __task_pid_nr_ns does not look very large, although it does do some call outs to other functions:
pid_t __task_pid_nr_ns(struct task_struct *task, enum pid_type type,
struct pid_namespace *ns)
{
pid_t nr = 0;
rcu_read_lock();
if (!ns)
ns = task_active_pid_ns(current);
nr = pid_nr_ns(rcu_dereference(*task_pid_ptr(task, type)), ns);
rcu_read_unlock();
return nr;
}
The disassembly of the start of the function (shown below) also shows we have coverage tracing functions. Trying to trace the full function now will be difficult (readelf shows the function is 908 bytes).
$ riscv64-linux-gnu-objdump -Dr kernel/pid.o
0: 7179 addi sp,sp,-48
2: f022 sd s0,32(sp)
4: ec26 sd s1,24(sp)
6: e84a sd s2,16(sp)
8: f406 sd ra,40(sp)
a: e44e sd s3,8(sp)
c: e052 sd s4,0(sp)
e: 1800 addi s0,sp,48
10: 84aa mv s1,a0
12: 892e mv s2,a1
14: 00000097 auipc ra,0x0
14: R_RISCV_CALL __sanitizer_cov_trace_pc
14: R_RISCV_RELAX *ABS*
18: 000080e7 jalr ra # 14 <pid_nr_ns+0x14>
000000000000001c <.LVL1>:
1c: cc8d beqz s1,56 <.L4>
1c: R_RISCV_RVC_BRANCH .L4
1e: 00000097 auipc ra,0x0
1e: R_RISCV_CALL __sanitizer_cov_trace_pc
1e: R_RISCV_RELAX *ABS*
22: 000080e7 jalr ra # 1e <.LVL1+0x2>
Assume somehow switching tasks is not preserving SR_SUM
We now know something is clearing the SR_SUM flag, and it is not our code. A review of the entry code looks like it correctly clears and stores the previous state when entering due to an exception such as a page fault. The next target would be to assume that voluntarily switching out, such as sleep or re-schedule, is the next point to look at.
From previous work, I know that the scheduler will use switch_to(prev, next) to change from 'prev' to 'next' task (defined in arch/riscv/include/asm/switch_to.h), which does the job of transferring the control. The actual register saving is done in assembly in a sub-call of __switch_to() defined in arch/riscv/kernel/entry.S
Looking at __switch_to
The following is the section of __switch_to assembly code which enters as the 'prev' thread and exits as 'next'. The code stores the values that C would expect not to be changed by a procedure call and then reloads them from the new task. Once this is done, it then swaps the current thread information for that CPU.
ENTRY(__switch_to)
/* Save context into prev->thread */
li a4, TASK_THREAD_RA
add a3, a0, a4
add a4, a1, a4
REG_S ra, TASK_THREAD_RA_RA(a3)
REG_S sp, TASK_THREAD_SP_RA(a3)
REG_S s0, TASK_THREAD_S0_RA(a3)
REG_S s1, TASK_THREAD_S1_RA(a3)
REG_S s2, TASK_THREAD_S2_RA(a3)
REG_S s3, TASK_THREAD_S3_RA(a3)
REG_S s4, TASK_THREAD_S4_RA(a3)
REG_S s5, TASK_THREAD_S5_RA(a3)
REG_S s6, TASK_THREAD_S6_RA(a3)
REG_S s7, TASK_THREAD_S7_RA(a3)
REG_S s8, TASK_THREAD_S8_RA(a3)
REG_S s9, TASK_THREAD_S9_RA(a3)
REG_S s10, TASK_THREAD_S10_RA(a3)
REG_S s11, TASK_THREAD_S11_RA(a3)
/* Restore context from next->thread */
REG_L ra, TASK_THREAD_RA_RA(a4)
REG_L sp, TASK_THREAD_SP_RA(a4)
REG_L s0, TASK_THREAD_S0_RA(a4)
REG_L s1, TASK_THREAD_S1_RA(a4)
REG_L s2, TASK_THREAD_S2_RA(a4)
REG_L s3, TASK_THREAD_S3_RA(a4)
REG_L s4, TASK_THREAD_S4_RA(a4)
REG_L s5, TASK_THREAD_S5_RA(a4)
REG_L s6, TASK_THREAD_S6_RA(a4)
REG_L s7, TASK_THREAD_S7_RA(a4)
REG_L s8, TASK_THREAD_S8_RA(a4)
REG_L s9, TASK_THREAD_S9_RA(a4)
REG_L s10, TASK_THREAD_S10_RA(a4)
REG_L s11, TASK_THREAD_S11_RA(a4)
/* Swap the CPU entry around. */
lw a3, TASK_TI_CPU(a0)
lw a4, TASK_TI_CPU(a1)
sw a3, TASK_TI_CPU(a1)
sw a4, TASK_TI_CPU(a0)
/* The offset of thread_info in task_struct is zero. */
move tp, a1
ret
ENDPROC(__switch_to)
This looks like a good place to add the relevant saving and restoring of the SR_SUM bit. (see the patch below).
Looking for leaking SR_SUM at run time
static int test_thread1(void *data)
{
unsigned int cpu = (unsigned int)data;
unsigned long status;
pr_info("%s: thread starting on cpu %d\n", __func__, cpu);
while (!kthread_should_stop()) {
status = rd_sstatus();
if (status & SR_SUM)
printk_ratelimited("%s: found sstaus=0x%08lx\n",
__func__, status);
msleep(1);
}
pr_info("%s: thread exiting\n", __func__);
return 0;
}
Running the above as a thread on each cpu, we caught one instance under stress in around 2000 seconds of execution. Note, we could have also added a test for SR_SUM around switch_to() and this may be something to try at a later date.
[ 1192.124018] test_thread1: found sstaus=0x00040022
Fixing the issue
The bug is now confirmed as the kernel's change of task not saving the SR_SUM flag. This means the flag not only leaks into other tasks, but the schedule within put_user() such as in the schedule_tail() code can also see other tasks clearing the flag.
The first fix would be to ensure the switch_to() code saves and restores the state of sstatus SR_SUM flag. This is shown in the patch below, which changes the assembly in __switch_to(), which does the thread switch's inner bit.
A second fix is hinted at in the original test code: to move the call to anything complex outside of the code block, which does the user space memory access. This also has the advantage of reducing the code being executed with no user memory protection.
These are the fix patches for both, which have been submitted for review on the linux-riscv list:
Patch for put_user() ordering
The <asm/uaccess.h> header has a problem with put_user(a, ptr) if
the 'a' is not a simple variable, such as a function. This can lead
to the compiler producing code as so:
1: enable_user_access()
2: evaluate 'a' into register 'r'
3: put 'r' to 'ptr'
4: disable_user_acess()
The issue is that 'a' is now being evaluated with the user memory
protections disabled. So we try and force the evaulation by assinging
'x' to __val at the start, and hoping the compiler barriers in
enable_user_access() do the job of ordering step 2 before step 1.
This has shown up in a bug where 'a' sleeps and thus schedules out
and loses the SR_SUM flag. This isn't sufficient to fully fix, but
should reduce the window of opportunity. The first instance of this
we found is in scheudle_tail() where the code does:
$ less -N kernel/sched/core.c
4263 if (current->set_child_tid)
4264 put_user(task_pid_vnr(current), current->set_child_tid);
Here, the task_pid_vnr(current) is called within the block that has
enabled the user memory access. This can be made worse with KASAN
which makes task_pid_vnr() a rather large call with plenty of
opportunity to sleep.
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reported-by: syzbot+e74b94fe601ab9552d69@syzkaller.appspotmail.com
Suggested-by: Arnd Bergman <arnd@arndb.de>
--
Changes since v1:
- fixed formatting and updated the patch description with more info
Cc: dvyukov@google.com
Cc: linux-riscv@lists.infradead.org
Cc: paul.walmsley@sifive.com
Cc: palmer@dabbelt.com
Cc: alex@ghiti.fr
Cc: linux-kernel@lists.codethink.co.uk
Cc: hch@infradead.org
---
arch/riscv/include/asm/uaccess.h | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/arch/riscv/include/asm/uaccess.h b/arch/riscv/include/asm/uaccess.h
index 824b2c9da75b..7bf90d462ec9 100644
--- a/arch/riscv/include/asm/uaccess.h
+++ b/arch/riscv/include/asm/uaccess.h
@@ -306,7 +306,10 @@ do { \
* data types like structures or arrays.
*
* @ptr must have pointer-to-simple-variable type, and @x must be assignable
- * to the result of dereferencing @ptr.
+ * to the result of dereferencing @ptr. The @x is copied inside the macro
+ * to avoid code re-ordering where @x gets evaulated within the block that
+ * enables user-space access (thus possibly bypassing some of the protection
+ * this feautre provides).
*
* Caller must check the pointer with access_ok() before calling this
* function.
@@ -316,12 +319,13 @@ do { \
#define __put_user(x, ptr) \
({ \
__typeof__(*(ptr)) __user *__gu_ptr = (ptr); \
+ __typeof__(*__gu_ptr) __val = (x); \
long __pu_err = 0; \
\
__chk_user_ptr(__gu_ptr); \
\
__enable_user_access(); \
- __put_user_nocheck(x, __gu_ptr, __pu_err); \
+ __put_user_nocheck(__val, __gu_ptr, __pu_err); \
__disable_user_access(); \
\
__pu_err; \
Patch for SR_SUM saving over __switch_to()
RFC: riscv: save the SR_SUM status over switches
When threads/tasks are switched we need to ensure the old execution's
SR_SUM state is saved and the new thread has the old SR_SUM state
restored.
The issue is seen under heavy load especially with the syz-stress tool
running, with crashes as follows in schedule_tail:
Unable to handle kernel access to user memory without uaccess routines at virtual address 000000002749f0d0
Oops [#1]
Modules linked in:
CPU: 1 PID: 4875 Comm: syz-executor.0 Not tainted 5.12.0-rc2-syzkaller-00467-g0d7588ab9ef9 #0
Hardware name: riscv-virtio,qemu (DT)
epc : schedule_tail+0x72/0xb2 kernel/sched/core.c:4264
ra : task_pid_vnr include/linux/sched.h:1421 [inline]
ra : schedule_tail+0x70/0xb2 kernel/sched/core.c:4264
epc : ffffffe00008c8b0 ra : ffffffe00008c8ae sp : ffffffe025d17ec0
gp : ffffffe005d25378 tp : ffffffe00f0d0000 t0 : 0000000000000000
t1 : 0000000000000001 t2 : 00000000000f4240 s0 : ffffffe025d17ee0
s1 : 000000002749f0d0 a0 : 000000000000002a a1 : 0000000000000003
a2 : 1ffffffc0cfac500 a3 : ffffffe0000c80cc a4 : 5ae9db91c19bbe00
a5 : 0000000000000000 a6 : 0000000000f00000 a7 : ffffffe000082eba
s2 : 0000000000040000 s3 : ffffffe00eef96c0 s4 : ffffffe022c77fe0
s5 : 0000000000004000 s6 : ffffffe067d74e00 s7 : ffffffe067d74850
s8 : ffffffe067d73e18 s9 : ffffffe067d74e00 s10: ffffffe00eef96e8
s11: 000000ae6cdf8368 t3 : 5ae9db91c19bbe00 t4 : ffffffc4043cafb2
t5 : ffffffc4043cafba t6 : 0000000000040000
status: 0000000000000120 badaddr: 000000002749f0d0 cause: 000000000000000f
Call Trace:
[<ffffffe00008c8b0>] schedule_tail+0x72/0xb2 kernel/sched/core.c:4264
[<ffffffe000005570>] ret_from_exception+0x0/0x14
Dumping ftrace buffer:
(ftrace buffer empty)
---[ end trace b5f8f9231dc87dda ]---
The issue comes from the put_user() in schedule_tail (kernel/sched/core.c)
doing the following:
asmlinkage __visible void schedule_tail(struct task_struct *prev)
{
...
if (current->set_child_tid)
put_user(task_pid_vnr(current), current->set_child_tid);
...
}
the put_user() macro causes the code sequence to come out as follows:
1: __enable_user_access()
2: reg = task_pid_vnr(current);
3: *current->set_child_tid = reg;
4: __disable_user_access()
This means the task_pid_vnr() is being called with user-access enabled
which itself is not a good idea, but that is a seperate issue. Here we
have a function that /might/ sleep being called with the SR_SUM and if
it does, then it returns with the SR_SUM flag possibly cleared thus
causing the above abort.
To try and deal with this, and stop the SR_SUM leaking out into other
threads (this has also been tested and see under stress. It can rarely
happen but it /does/ under load) make sure the __switch_to() will save
and restore the SR_SUM flag, and clear it possibly for the next thread
if it does not need it.
Note, test code to be supplied once other checks have been finished.
There may be further issues with the mstatus flags with this, this
can be discussed further once some initial testing has been done.
diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h
index 021ed64ee608..e2398c6e7af9 100644
--- a/arch/riscv/include/asm/processor.h
+++ b/arch/riscv/include/asm/processor.h
@@ -35,6 +35,7 @@ struct thread_struct {
unsigned long s[12]; /* s[0]: frame pointer */
struct __riscv_d_ext_state fstate;
unsigned long bad_cause;
+ unsigned long flags;
};
#define INIT_THREAD { \
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index 9ef33346853c..0036570752d9 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -29,6 +29,7 @@ void asm_offsets(void)
OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
+ OFFSET(TASK_THREAD_FLAGS, task_struct, thread.flags);
OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
@@ -172,6 +173,10 @@ void asm_offsets(void)
offsetof(struct task_struct, thread.s[11])
- offsetof(struct task_struct, thread.ra)
);
+ DEFINE(TASK_THREAD_FLAGS_RA,
+ offsetof(struct task_struct, thread.flags)
+ - offsetof(struct task_struct, thread.ra)
+ );
DEFINE(TASK_THREAD_F0_F0,
offsetof(struct task_struct, thread.fstate.f[0])
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 744f3209c48d..6137f6c2a2ad 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -417,7 +417,14 @@ ENTRY(__switch_to)
REG_S s9, TASK_THREAD_S9_RA(a3)
REG_S s10, TASK_THREAD_S10_RA(a3)
REG_S s11, TASK_THREAD_S11_RA(a3)
+ /* save (and disable the user space access flag) */
+ li s0, SR_SUM
+ csrrc s1, CSR_STATUS, s0
+ REG_S s1, TASK_THREAD_FLAGS_RA(a3)
+
/* Restore context from next->thread */
+ REG_L s0, TASK_THREAD_FLAGS_RA(a4)
+ csrs CSR_STATUS, s0
REG_L ra, TASK_THREAD_RA_RA(a4)
REG_L sp, TASK_THREAD_SP_RA(a4)
REG_L s0, TASK_THREAD_S0_RA(a4)
Conclusion
The step of shadowing the enablement of user-access was probably not necessary (although interesting side quest), and going straight to checking if the call to get the pid had cleared the SR_SUM flag would have been sufficient. Just patching the code to move the __put_user()
code order would have fixed the issue in this case, the saving SR_SUM over __switch_to()
is also a good fix in the case where the SR_SUM may be enabled over several calls.
Future work
The first work will be sorting reviews for the patches above. There is already feedback on whether storing SR_SUM is necessary over the switch_to()
call.
The next thing is to check whether there are any other flags that switch_to()
should be saving. This will be discussed in future posts.
Keep up with our RISC-V news
Fill out the form to receive our latest updates on the open CPU architecture in your inbox.
Related blog posts:
- Written by Ben Dooks and Kejia Hu: Investigating kernel user-space access >>
- Codethink's research: RISC-V: Codethink's first research about the open instruction set >>
Other Content
- A new way to develop on Linux - Part II
- GUADEC 2024
- Developing a cryptographically secure bootloader for RISC-V in Rust
- Philip Martin, Meet the Team
- Improving systemd’s integration testing infrastructure (part 1)
- A new way to develop on Linux
- RISC-V Summit Europe 2024
- Safety Frontier: A Retrospective on ELISA
- Codethink sponsors Outreachy
- The Linux kernel is a CNA - so what?
- GNOME OS + systemd-sysupdate
- Codethink has achieved ISO 9001:2015 accreditation
- Outreachy internship: Improving end-to-end testing for GNOME
- Lessons learnt from building a distributed system in Rust
- FOSDEM 2024
- Introducing Web UI QAnvas and new features of Quality Assurance Daemon
- Outreachy: Supporting the open source community through mentorship programmes
- Using Git LFS and fast-import together
- Testing in a Box: Streamlining Embedded Systems Testing
- SDV Europe: What Codethink has planned
- How do Hardware Security Modules impact the automotive sector? The final blog in a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part two of a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part one of a three part discussion
- Automated Kernel Testing on RISC-V Hardware
- Automated end-to-end testing for Android Automotive on Hardware
- GUADEC 2023
- Embedded Open Source Summit 2023
- RISC-V: Exploring a Bug in Stack Unwinding
- Adding RISC-V Vector Cryptography Extension support to QEMU
- Introducing Our New Open-Source Tool: Quality Assurance Daemon
- Long Term Maintainability
- FOSDEM 2023
- Think before you Pip
- BuildStream 2.0 is here, just in time for the holidays!
- A Valuable & Comprehensive Firmware Code Review by Codethink
- GNOME OS & Atomic Upgrades on the PinePhone
- Flathub-Codethink Collaboration
- Codethink proudly sponsors GUADEC 2022
- Tracking Down an Obscure Reproducibility Bug in glibc
- Web app test automation with `cdt`
- FOSDEM Testing and Automation talk
- Protecting your project from dependency access problems
- Porting GNOME OS to Microchip's PolarFire Icicle Kit
- YAML Schemas: Validating Data without Writing Code
- Deterministic Construction Service
- Codethink becomes a Microchip Design Partner
- Hamsa: Using an NVIDIA Jetson Development Kit to create a fully open-source Robot Nano Hand
- Using STPA with software-intensive systems
- Codethink achieves ISO 26262 ASIL D Tool Certification
- RISC-V: running GNOME OS on SiFive hardware for the first time
- Automated Linux kernel testing
- Native compilation on Arm servers is so much faster now
- Full archive