Wed 21 April 2021

RISC-V User space access Oops

As part of Codethink's interest in RISC-V I have been following the RISC-V kernel list. Whilst looking through the postings the following bug (more information here) came up, titled:

[syzbot] BUG: unable to handle kernel access to user memory in schedule_tail

It contained a kernel Oops (see below) that looked interesting, so I decided to look at the bug and check out the kernel source and config file supplied in the bug. The bug was intriguing because there is a lot of interest in RISC-V, and I had previously worked on checking the Linux kernel's user access protection.

Unable to handle kernel access to user memory without uaccess routines at virtual address 000000002749f0d0
Oops [#1]
Modules linked in:
CPU: 1 PID: 4875 Comm: syz-executor.0 Not tainted 5.12.0-rc2-syzkaller-00467-g0d7588ab9ef9 #0
Hardware name: riscv-virtio,qemu (DT)
epc : schedule_tail+0x72/0xb2 kernel/sched/core.c:4264
 ra : task_pid_vnr include/linux/sched.h:1421 [inline]
 ra : schedule_tail+0x70/0xb2 kernel/sched/core.c:4264
epc : ffffffe00008c8b0 ra : ffffffe00008c8ae sp : ffffffe025d17ec0
 gp : ffffffe005d25378 tp : ffffffe00f0d0000 t0 : 0000000000000000
 t1 : 0000000000000001 t2 : 00000000000f4240 s0 : ffffffe025d17ee0
 s1 : 000000002749f0d0 a0 : 000000000000002a a1 : 0000000000000003
 a2 : 1ffffffc0cfac500 a3 : ffffffe0000c80cc a4 : 5ae9db91c19bbe00
 a5 : 0000000000000000 a6 : 0000000000f00000 a7 : ffffffe000082eba
 s2 : 0000000000040000 s3 : ffffffe00eef96c0 s4 : ffffffe022c77fe0
 s5 : 0000000000004000 s6 : ffffffe067d74e00 s7 : ffffffe067d74850
 s8 : ffffffe067d73e18 s9 : ffffffe067d74e00 s10: ffffffe00eef96e8
 s11: 000000ae6cdf8368 t3 : 5ae9db91c19bbe00 t4 : ffffffc4043cafb2
 t5 : ffffffc4043cafba t6 : 0000000000040000
status: 0000000000000120 badaddr: 000000002749f0d0 cause: 000000000000000f
Call Trace:
[<ffffffe00008c8b0>] schedule_tail+0x72/0xb2 kernel/sched/core.c:4264
[<ffffffe000005570>] ret_from_exception+0x0/0x14
Dumping ftrace buffer:
   (ftrace buffer empty)
---[ end trace b5f8f9231dc87dda ]---

Build the kernel

The report contained the git repo and config used to build, so the first step is to build the kernel to give an image to run and the source output to look at. I've skipped the actions taken to get and build as they are no different from normal kernel development.

Given there is already a bug report, the first action is to assume the kernel we built is similar to the bot built to allow the disassembly, look at the instructions at the bug's location, and map these to the source code.

$ riscv64-linux-gnu-objdump -D kernel/sched/core.o

000000000000ea44 <schedule_tail>:
    ea44:       1101                    addi    sp,sp,-32
    ea46:       e822                    sd      s0,16(sp)
    ea48:       ec06                    sd      ra,24(sp)
    ea4a:       e426                    sd      s1,8(sp)
    ea4c:       e04a                    sd      s2,0(sp)
    ea4e:       1000                    addi    s0,sp,32
    ea50:       00000097                auipc   ra,0x0
    ea54:       000080e7                jalr    ra # ea50 <schedule_tail+0xc>
...

    eaa0:       f00007b7                lui     a5,0xf0000
    eaa4:       83e9                    srli    a5,a5,0x1a
    eaa6:       0297e163                bltu    a5,s1,eac8 <.L3667>
    eaaa:       00040937                lui     s2,0x40
    eaae:       10092073                csrs    sstatus,s2
    eab2:       4601                    li      a2,0
    eab4:       4581                    li      a1,0
    eab6:       8512                    mv      a0,tp
    eab8:       00000097                auipc   ra,0x0
    eabc:       000080e7                jalr    ra # eab8 <.LBB16792+0x6>
    eac0:       4781                    li      a5,0
    eac2:       c088                    sw      a0,0(s1)
    eac4:       10093073                csrc    sstatus,s2
    eac8:       00000097                auipc   ra,0x0
    eacc:       000080e7                jalr    ra # eac8 <.L3667>

The above shows that 0xea44+0x70 is eab4. In this dissasembly this is part of the put_user() in schedule_tail() function. It is likely given the function is small that this is the right place to look.

$ riscv64-linux-gnu-objdump -Dr -S kernel/sched/core.o
...
         put_user(task_pid_vnr(current), current->set_child_tid);
    ea8a:       6585                    lui     a1,0x1
    ea8c:       0a858593                addi    a1,a1,168 # 10a8 <.LVL499+0x6>
    ea90:       00000517                auipc   a0,0x0
...
        return __task_pid_nr_ns(tsk, PIDTYPE_PID, NULL);
    eab2:       4601                    li      a2,0
    eab4:       4581                    li      a1,0
    eab6:       8512                    mv      a0,tp

The above disassembly shows this build is not precisely the same as the one being built by the test bot. The abort's place in the original report is a li instruction which means load immediate that should never fault. Looking further down the code, there is a sw (store word) instruction at 0xeac2. The Oops also back this up as the badaddr is showing the access is to 0x000000002749f0d0, which is also register s1 and matches the sw a0,0(s1) that stores the value in register a0 to the address found in s1 plus 0.

Build a test environment

To run tests, the first attempt is to use the qemu-riscv64 and a Debian rootfs. The Debian install can then be used with qemu-static-riscv64 to install the necessary build tools and use the instructions supple to install the syz-stress tool.

Run the syz-stress and reproduce

Now that the bug can be reproduced and the same methods as above can be used to track down where it is. This shows that the original disassembly and tracking was correct, and the bug is input user at the 'sw' instruction previously identified.

The relevant code for syz-stress was built using a chroot and copied into the test environment. Some basic instructions are included in the mailing list post that was part of the initial bug discussion.

Note, the initial stress testing failed with mmap() issues which seem to be a side effect of the initial qemu vm using 9pfs as the rootfs. Making a real ext4 rootfs worked, and was a bit faster.

Analysing the issue

At this point, the bug has been confirmed to be reproducible and to exist in the initially identified place. The next step is to try and plan what to find and fix the bug.

If we take the output as true, and the crash was in a store to user space that has been detected as an "access to user memory without uaccess", then this list is our initial condition:

  • the fault handler has not mapped the address properly
  • the fault handler mapped the address but did not return correctly
  • the code has been corrupted at run time
  • the code got called into by accident (bad pointer elsewhere)
  • the compiler did not compile the code correctly
  • the user pointer has been corrupted

The above list might be easier to think about if we firstly look at how the kernel accesses user space. By default, the kernel does not allow access to user memory except via specific accessor functions. So the most accessible place to start would be with the code in <asm/uaccess.h> which implements the user space accessors. These calls include put_user() as shown above.

User space accessors are architecture-specific (the above <asm/uaccess.h> is in the file arch/riscv/include/asm/uaccess.h). This file implements the architecture-specific memory protection code. For the RISC-V case the hardware has a control bit in the status CSR to say if the kernel mode can access the user memory. Linux refers to this bit as SR_SUM, defined in arch/riscv/include/asm/csr.h .

The following macros are in the user access code to control the SR_SUM, which is also where the put_user() is implemented. The following two macros are used to enable user access:

#define __enable_user_access()                                                  \
        __asm__ __volatile__ ("csrs sstatus, %0" : : "r" (SR_SUM) : "memory")
#define __disable_user_access()                                                 \
        __asm__ __volatile__ ("csrc sstatus, %0" : : "r" (SR_SUM) : "memory")
#define put_user(x, ptr)                                        \
({                                                              \
        __typeof__(*(ptr)) __user *__p = (ptr);                 \
        might_fault();                                          \
        access_ok(__p, sizeof(*__p)) ?          \
                __put_user((x), __p) :                          \
                -EFAULT;                                        \
})

The put_user() macro uses __put_user() to do the access with the appropriate memory protection set via `__enable_user_access().

#define __put_user(x, ptr)                                      \
({                                                              \
        __typeof__(*(ptr)) __user *__gu_ptr = (ptr);            \
        long __pu_err = 0;                                      \
                                                                \
        __chk_user_ptr(__gu_ptr);                               \
                                                                \
        __enable_user_access();                                 \
        __put_user_nocheck(x, __gu_ptr, __pu_err);              \
        __disable_user_access();                                \
                                                                \
        __pu_err;                                               \
})

If we look at the code for schedule tail, the instructions at addresses 0xeaaa and 0xeaea load the SR_SUM constant and the perform the csrs that is in __enable_user_access().

    eaa0:       f00007b7                lui     a5,0xf0000
    eaa4:       83e9                    srli    a5,a5,0x1a
    eaa6:       0297e163                bltu    a5,s1,eac8 <.L3667>
    eaaa:       00040937                lui     s2,0x40
    eaae:       10092073                csrs    sstatus,s2
    eab2:       4601                    li      a2,0
    eab4:       4581                    li      a1,0
    eab6:       8512                    mv      a0,tp
    eab8:       00000097                auipc   ra,0x0
    eabc:       000080e7                jalr    ra # eab8 <.LBB16792+0x6>
    eac0:       4781                    li      a5,0
    eac2:       c088                    sw      a0,0(s1)
    eac4:       10093073                csrc    sstatus,s2
    eac8:       00000097                auipc   ra,0x0
    eacc:       000080e7                jalr    ra # eac8 <.L3667>

From this and the register dump above, the code was likely properly executed from where it was intended to. The 'ra' register's return address is for the jalr instruction at 0xeabc, and the s2 register contains the 0x40000 value loaded at 0xeaaa.

The next thing to note is that the code for __enable_user_access is above the code that does the first argument to put_user (which is the call to task_pid_vnr()) that got inlined into the schedule_tail() code. This may not be the bug, but it is a possible security issue as the task_pid_vnr() will run with user memory protection off.

To confirm the sequence, we can dump the same region and look at the relocations. In the following output, we can see the call is to __task_pid_nr_ns(), which is what task_pid_vnr macros expand out to:

$ riscv64-linux-gnu-objdump -Dr kernel/sched/core.
    eab2:       4601                    li      a2,0
    eab4:       4581                    li      a1,0
    eab6:       8512                    mv      a0,tp
    eab8:       00000097                auipc   ra,0x0
                        eab8: R_RISCV_CALL      __task_pid_nr_ns
                        eab8: R_RISCV_RELAX     *ABS*
    eabc:       000080e7                jalr    ra # eab8 <.LBB16792+0x6>

Going back to the dump, it is clear the status register has the SR_SUM flag cleared when the abort happened, and the code does not look like it should have gotten there with the flag cleared. We can probably discard a gcc or corruption bug as candidates, which leaves a fault handler or an issue in the code being called.

Verify we should have aborted

The first thing is to check the sstatus flags just before the instruction to see if they are still set. This could point to either a qemu issue clearing the sstatus or the code within __task_pid_nr_ns call causing the flag to be cleared.

The following patch was attempted to check the flags; however, this made the system unstable.

diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index d2e472c4c63a..977c0584f767 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -20,6 +20,13 @@

 #include "../kernel/head.h"

+asmlinkage void bad_putuser_mstatus(unsigned addr, unsigned flags)
+{
+       printk(KERN_ERR "bad mflags from put_user, addr %x, flags %x\n",
+              addr, flags);
+       BUG_ON(1);
+}
+ 
 static void die_kernel_fault(const char *msg, unsigned long addr,
                struct pt_regs *regs)
 {
diff --git a/arch/riscv/include/asm/uaccess.h b/arch/riscv/include/asm/uaccess.h
index 824b2c9da75b..7338b82631fc 100644
--- a/arch/riscv/include/asm/uaccess.h
+++ b/arch/riscv/include/asm/uaccess.h
@@ -1,3 +1,4 @@
+
 /* SPDX-License-Identifier: GPL-2.0-only */
 /*
  * Copyright (C) 2012 Regents of the University of California
@@ -224,8 +225,17 @@ do {                                                               \
        uintptr_t __tmp;                                        \
        __typeof__(*(ptr)) __x = x;                             \
        __asm__ __volatile__ (                                  \
+               "       csrr %1, sstatus\n"                     \
+               "       srli %1, %1, 18\n"                      \
+               "       andi %1, %1, 1\n"                       \
+               "       beqz %1, 5f\n"                          \
                "1:\n"                                          \
                "       " insn " %z3, %2\n"                     \
+               "       j 2f\n"                                 \
+               "5:\n"                                          \
+               "       li a0, 0\n"                             \
+               "       csrr a1, sstatus\n"                     \
+               "       jump bad_putuser_mstatus, %1\n"         \
                "2:\n"                                          \
                "       .section .fixup,\"ax\"\n"               \
                "       .balign 4\n"                            \

The next attempt was to get the same functionality in C to check for the flag being cleared.

+static inline void check_user_access(void)
+{
+       unsigned long status = read_sstatus();
+       if ((status & (1 << 18)) == 0) {
+               printk(KERN_ERR "%s: status register 0x%lx in io (now %lx)\n", __func__, status, read_sstatus());
+               BUG_ON(1);
+       }
+}
+       
+
 /**
  * __put_user: - Write a simple value into user space, with less checking.
  * @x:   Value to copy to user space.
@@ -321,7 +342,13 @@ do {                                                               \
        __chk_user_ptr(__gu_ptr);                               \
                                                                \
        __enable_user_access();                                 \
-       __put_user_nocheck(x, __gu_ptr, __pu_err);              \
+       if (sizeof(*__gu_ptr) <= 4) {                           \
+               unsigned long tx = x;                           \
+               check_user_access();                            \
+               __put_user_nocheck(tx, __gu_ptr, __pu_err);     \
+       } else {                                                \
+               __put_user_nocheck(x, __gu_ptr, __pu_err);      \
+       }                                                       \
        __disable_user_access();                                \
                                                                \
        __pu_err;                                               \

This produced the following output, which shows the code now traps on the BUG_ON() in our patch instead of the fault in sw.:

[  479.600372][ T5772] can: request_module (can-proto-0) failed.
[  517.002492][ T5865] check_user_access: status register 0x122 in io (now 122)
[  517.004272][ T5865] ------------[ cut here ]------------
[  517.005015][ T5865] kernel BUG at arch/riscv/include/asm/uaccess.h:304!
[  517.006219][ T5865] Kernel BUG [#1]
[  517.006736][ T5865] Modules linked in:
[  517.007421][ T5865] CPU: 3 PID: 5865 Comm: syz-executor Not tainted 5.12.0-rc2-00017-ge5e6131f91c8-dirty #80
[  517.008424][ T5865] Hardware name: riscv-virtio,qemu (DT)
[  517.009119][ T5865] epc : schedule_tail+0xd8/0xda
[  517.009848][ T5865]  ra : schedule_tail+0xd8/0xda
[  517.010520][ T5865] epc : ffffffe00008d04a ra : ffffffe00008d04a sp : ffffffe08be43ec0
[  518.229029][ T5865]  gp : ffffffe00528b130 tp : ffffffe0861cc440 t0 : ffffffe005d01b67
[  518.230023][ T5865]  t1 : ffffffc4117c8778 t2 : 0000000000000000 s0 : ffffffe08be43ee0
[  518.230756][ T5865]  s1 : 00000000178ca0d0 a0 : 0000000000000038 a1 : 00000000000f0000
[  518.231526][ T5865]  a2 : 0000000000000002 a3 : ffffffe0000d6c24 a4 : fec417340f8e1a00
[  518.232506][ T5865]  a5 : fec417340f8e1a00 a6 : 0000000000f00000 a7 : ffffffe08be43bc7
[  518.233404][ T5865]  s2 : 0000000000040000 s3 : ffffffe080994440 s4 : ffffffe0808e3fe0
[  518.234278][ T5865]  s5 : 0000000000004020 s6 : ffffffe0809948d8 s7 : ffffffe0dad26650
[  518.235087][ T5865]  s8 : ffffffe0dad25c18 s9 : 0000000000000000 s10: ffffffe089ce2280
[  518.235885][ T5865]  s11: 000000785a6f92a8 t3 : fec417340f8e1a00 t4 : ffffffc4117c8777
[  518.236664][ T5865]  t5 : ffffffc4117c8779 t6 : ffffffe08be43bc8
[  518.237252][ T5865] status: 0000000000000120 badaddr: 0000000000000000 cause: 0000000000000003
[  518.238028][ T5865] Call Trace:
[  518.238422][ T5865] [<ffffffe00008d04a>] schedule_tail+0xd8/0xda
[  518.239174][ T5865] [<ffffffe000005570>] ret_from_exception+0x0/0x14
[  518.242135][ T5865] ---[ end trace 470e0fb737142c59 ]---
[  518.243067][ T5865] Kernel panic - not syncing: Fatal exception
[  518.243817][ T5865] SMP: stopping secondary CPUs

There was also a bug in the read_sstatus() implementation as there was no memory barrier in the original shown here. The following OOPS was triggered as the code produced put the CSR read before the code to set the CSR. (note the now value is showing the correct bit is set when the printk was executed):

[   47.858102][    T1] check_user_access: status register 0x22 in io (now 40022)
[   47.859931][    T1] ------------[ cut here ]------------
[   47.860515][    T1] kernel BUG at arch/riscv/include/asm/uaccess.h:304!
[   47.861435][    T1] Kernel BUG [#1]
[   47.861835][    T1] Modules linked in:
[   47.862358][    T1] CPU: 4 PID: 1 Comm: systemd Not tainted 5.12.0-rc2-00017-ge5e6131f91c8-dirty #79
[   47.863021][    T1] Hardware name: riscv-virtio,qemu (DT)
[   47.863563][    T1] epc : do_sys_poll+0x852/0x854
[   47.864330][    T1]  ra : do_sys_poll+0x852/0x854
[   47.864880][    T1] epc : ffffffe000401f36 ra : ffffffe000401f36 sp : ffffffe07ffa7a20
[   47.865461][    T1]  gp : ffffffe00528b188 tp : ffffffe07ffb8000 t0 : ffffffe005d01b67
[   47.866042][    T1]  t1 : ffffffc40fff4ee4 t2 : 0000000000000000 s0 : ffffffe07ffa7e60
[   47.866622][    T1]  s1 : 0000000000000001 a0 : 0000000000000039 a1 : 00000000000f0000
[   47.867198][    T1]  a2 : 0000000000000002 a3 : ffffffe0000d6b66 a4 : d55d48e460f03000
[   47.867766][    T1]  a5 : d55d48e460f03000 a6 : 0000000000f00000 a7 : ffffffe07ffa7727
[   47.868334][    T1]  s2 : 0000000000000001 s3 : 0000003fffeaa8f6 s4 : ffffffe07ffa7a8c
[   47.868909][    T1]  s5 : 0000000000040000 s6 : 0000000000000000 s7 : 0000000000000022
[   47.869482][    T1]  s8 : 0000000000000000 s9 : 0000000000000001 s10: 0000000000000000
[   47.870042][    T1]  s11: 0000000000000022 t3 : d55d48e460f03000 t4 : ffffffc40fff4ee3
[   47.870609][    T1]  t5 : ffffffc40fff4ee5 t6 : ffffffe07ffa7728
[   47.871087][    T1] status: 0000000000040120 badaddr: 0000000000000000 cause: 0000000000000003
[   47.871704][    T1] Call Trace:
[   47.872037][    T1] [<ffffffe000401f36>] do_sys_poll+0x852/0x854
[   47.872714][    T1] [<ffffffe000402ed8>] sys_ppoll+0x110/0x136
[   47.873309][    T1] [<ffffffe000005562>] ret_from_syscall+0x0/0x2
[   47.874727][    T1] ---[ end trace 9529102862d244e4 ]---

The following is the correct read_sstatus() to properly read the value.

#define read_sstatus() ({ unsigned long result; asm("  csrr %0, sstatus" : "=r"(result) :: "memory"); result; })

The resut of this testing implies that the issue is somewhere within the code to get the task's pid.

Alternate user access tracking

The next idea was to try and add a task flag to say that we're in the user access so that we can check if our own code cleared it:

--- a/arch/riscv/include/asm/uaccess.h
+++ b/arch/riscv/include/asm/uaccess.h
@@ -21,11 +21,22 @@
 #include <asm/extable.h>
 #include <asm/asm.h>

-#define __enable_user_access()                                                 \
-       __asm__ __volatile__ ("csrs sstatus, %0" : : "r" (SR_SUM) : "memory")
-#define __disable_user_access()                                                        \
+#define __enable_user_access()                                                 \
+       do { extern void notify_uaccess(unsigned x); notify_uaccess(1);  __asm__ __volatile__ ("csrs sstatus, %0" : : "r" (SR_SUM) : "memory"); ; } while(0)
+#define __disable_user_access()                                                \
+       do { extern void notify_uaccess(unsigned x); notify_uaccess(0); __asm__ __volatile__ ("csrc sstatus, %0" : : "r" (SR_SUM) : "memory"); } while(0)
+

With the following patch that tracks the state in the thread_struct assoicated with each process.

diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h
index 021ed64ee608..af7ea0460384 100644
--- a/arch/riscv/include/asm/processor.h
+++ b/arch/riscv/include/asm/processor.h
@@ -35,6 +35,7 @@ struct thread_struct {
        unsigned long s[12];    /* s[0]: frame pointer */
        struct __riscv_d_ext_state fstate;
        unsigned long bad_cause;
+       unsigned long in_uaccess;
 };

 #define INIT_THREAD {                                  \
diff --git a/arch/riscv/include/asm/switch_to.h b/arch/riscv/include/asm/switch_to.h
index 407bcc96a710..69442885d27d 100644
--- a/arch/riscv/include/asm/switch_to.h
+++ b/arch/riscv/include/asm/switch_to.h
@@ -72,6 +72,7 @@ do {                                                  \
        struct task_struct *__next = (next);            \
        if (has_fpu)                                    \
                __switch_to_aux(__prev, __next);        \
+       if (__prev->thread.in_uaccess != __next->thread.in_uaccess) { printk(KERN_INFO "switch_to is changing uaccess!\n"); } \
        ((last) = __switch_to(__prev, __next));         \
 } while (0)

diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 977c0584f767..3426da4aa307 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -20,6 +20,11 @@

 #include "../kernel/head.h"

+void notify_uaccess(unsigned to)
+{
+       current->thread.in_uaccess = to;
+}
+
 asmlinkage void bad_putuser_mstatus(unsigned addr, unsigned flags)
 {
        printk(KERN_ERR "bad mflags from put_user, addr %x, flags %x\n",

These additions showed that the thread had not cancelled the SR_SUM when the oops happened (apologies, the OOPS with the extra information was lost). The next thing is to have a look at the task_pid_vnr(current) call (this is defined in include/linux/sched.h) to be an inline call to '__task_pid_nr_ns(tsk, PIDTYPE_PID, NULL)' which is defined in the kernel/pid.c file.

The code for __task_pid_nr_ns does not look very large, although it does do some call outs to other functions:

pid_t __task_pid_nr_ns(struct task_struct *task, enum pid_type type,
                        struct pid_namespace *ns)
{
        pid_t nr = 0;

        rcu_read_lock();
        if (!ns)
                ns = task_active_pid_ns(current);
        nr = pid_nr_ns(rcu_dereference(*task_pid_ptr(task, type)), ns);
        rcu_read_unlock();

        return nr;
}

The disassembly of the start of the function (shown below) also shows we have coverage tracing functions. Trying to trace the full function now will be difficult (readelf shows the function is 908 bytes).

$ riscv64-linux-gnu-objdump -Dr kernel/pid.o

       0:       7179                    addi    sp,sp,-48
       2:       f022                    sd      s0,32(sp)
       4:       ec26                    sd      s1,24(sp)
       6:       e84a                    sd      s2,16(sp)
       8:       f406                    sd      ra,40(sp)
       a:       e44e                    sd      s3,8(sp)
       c:       e052                    sd      s4,0(sp)
       e:       1800                    addi    s0,sp,48
      10:       84aa                    mv      s1,a0
      12:       892e                    mv      s2,a1
      14:       00000097                auipc   ra,0x0
                        14: R_RISCV_CALL        __sanitizer_cov_trace_pc
                        14: R_RISCV_RELAX       *ABS*
      18:       000080e7                jalr    ra # 14 <pid_nr_ns+0x14>

000000000000001c <.LVL1>:
      1c:       cc8d                    beqz    s1,56 <.L4>
                        1c: R_RISCV_RVC_BRANCH  .L4
      1e:       00000097                auipc   ra,0x0
                        1e: R_RISCV_CALL        __sanitizer_cov_trace_pc
                        1e: R_RISCV_RELAX       *ABS*
      22:       000080e7                jalr    ra # 1e <.LVL1+0x2>

Assume somehow switching tasks is not preserving SR_SUM

We now know something is clearing the SR_SUM flag, and it is not our code. A review of the entry code looks like it correctly clears and stores the previous state when entering due to an exception such as a page fault. The next target would be to assume that voluntarily switching out, such as sleep or re-schedule, is the next point to look at.

From previous work, I know that the scheduler will use switch_to(prev, next) to change from 'prev' to 'next' task (defined in arch/riscv/include/asm/switch_to.h), which does the job of transferring the control. The actual register saving is done in assembly in a sub-call of __switch_to() defined in arch/riscv/kernel/entry.S

Looking at __switch_to

The following is the section of __switch_to assembly code which enters as the 'prev' thread and exits as 'next'. The code stores the values that C would expect not to be changed by a procedure call and then reloads them from the new task. Once this is done, it then swaps the current thread information for that CPU.

ENTRY(__switch_to)
        /* Save context into prev->thread */
        li    a4,  TASK_THREAD_RA
        add   a3, a0, a4
        add   a4, a1, a4
        REG_S ra,  TASK_THREAD_RA_RA(a3)
        REG_S sp,  TASK_THREAD_SP_RA(a3)
        REG_S s0,  TASK_THREAD_S0_RA(a3)
        REG_S s1,  TASK_THREAD_S1_RA(a3)
        REG_S s2,  TASK_THREAD_S2_RA(a3)
        REG_S s3,  TASK_THREAD_S3_RA(a3)
        REG_S s4,  TASK_THREAD_S4_RA(a3)
        REG_S s5,  TASK_THREAD_S5_RA(a3)
        REG_S s6,  TASK_THREAD_S6_RA(a3)
        REG_S s7,  TASK_THREAD_S7_RA(a3)
        REG_S s8,  TASK_THREAD_S8_RA(a3)
        REG_S s9,  TASK_THREAD_S9_RA(a3)
        REG_S s10, TASK_THREAD_S10_RA(a3)
        REG_S s11, TASK_THREAD_S11_RA(a3)
        /* Restore context from next->thread */
        REG_L ra,  TASK_THREAD_RA_RA(a4)
        REG_L sp,  TASK_THREAD_SP_RA(a4)
        REG_L s0,  TASK_THREAD_S0_RA(a4)
        REG_L s1,  TASK_THREAD_S1_RA(a4)
        REG_L s2,  TASK_THREAD_S2_RA(a4)
        REG_L s3,  TASK_THREAD_S3_RA(a4)
        REG_L s4,  TASK_THREAD_S4_RA(a4)
        REG_L s5,  TASK_THREAD_S5_RA(a4)
        REG_L s6,  TASK_THREAD_S6_RA(a4)
        REG_L s7,  TASK_THREAD_S7_RA(a4)
        REG_L s8,  TASK_THREAD_S8_RA(a4)
        REG_L s9,  TASK_THREAD_S9_RA(a4)
        REG_L s10, TASK_THREAD_S10_RA(a4)
        REG_L s11, TASK_THREAD_S11_RA(a4)
        /* Swap the CPU entry around. */
        lw a3, TASK_TI_CPU(a0)
        lw a4, TASK_TI_CPU(a1)
        sw a3, TASK_TI_CPU(a1)
        sw a4, TASK_TI_CPU(a0)
        /* The offset of thread_info in task_struct is zero. */
        move tp, a1
        ret
ENDPROC(__switch_to)

This looks like a good place to add the relevant saving and restoring of the SR_SUM bit. (see the patch below).

Looking for leaking SR_SUM at run time

static int test_thread1(void *data)
{
        unsigned int cpu = (unsigned int)data;
        unsigned long status;

        pr_info("%s: thread starting on cpu %d\n", __func__, cpu);

        while (!kthread_should_stop()) {
                status = rd_sstatus();
                if (status & SR_SUM)
                        printk_ratelimited("%s: found sstaus=0x%08lx\n",
                                           __func__, status);
                msleep(1);
        }

        pr_info("%s: thread exiting\n", __func__);
        return 0;
}

Running the above as a thread on each cpu, we caught one instance under stress in around 2000 seconds of execution. Note, we could have also added a test for SR_SUM around switch_to() and this may be something to try at a later date.

[ 1192.124018] test_thread1: found sstaus=0x00040022

Fixing the issue

The bug is now confirmed as the kernel's change of task not saving the SR_SUM flag. This means the flag not only leaks into other tasks, but the schedule within put_user() such as in the schedule_tail() code can also see other tasks clearing the flag.

The first fix would be to ensure the switch_to() code saves and restores the state of sstatus SR_SUM flag. This is shown in the patch below, which changes the assembly in __switch_to(), which does the thread switch's inner bit.

A second fix is hinted at in the original test code: to move the call to anything complex outside of the code block, which does the user space memory access. This also has the advantage of reducing the code being executed with no user memory protection.

These are the fix patches for both, which have been submitted for review on the linux-riscv list:

Patch for put_user() ordering

The <asm/uaccess.h> header has a problem with put_user(a, ptr) if
the 'a' is not a simple variable, such as a function. This can lead
to the compiler producing code as so:

1:  enable_user_access()
2:  evaluate 'a' into register 'r'
3:  put 'r' to 'ptr'
4:  disable_user_acess()

The issue is that 'a' is now being evaluated with the user memory
protections disabled. So we try and force the evaulation by assinging
'x' to __val at the start, and hoping the compiler barriers in
 enable_user_access() do the job of ordering step 2 before step 1.

This has shown up in a bug where 'a' sleeps and thus schedules out
and loses the SR_SUM flag. This isn't sufficient to fully fix, but
should reduce the window of opportunity. The first instance of this
we found is in scheudle_tail() where the code does:

$ less -N kernel/sched/core.c

   4263         if (current->set_child_tid)
   4264                 put_user(task_pid_vnr(current), current->set_child_tid);

Here, the task_pid_vnr(current) is called within the block that has
enabled the user memory access. This can be made worse with KASAN
which makes task_pid_vnr() a rather large call with plenty of
opportunity to sleep.

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reported-by: syzbot+e74b94fe601ab9552d69@syzkaller.appspotmail.com
Suggested-by: Arnd Bergman <arnd@arndb.de>

--
Changes since v1:
- fixed formatting and updated the patch description with more info

Cc: dvyukov@google.com
Cc: linux-riscv@lists.infradead.org
Cc: paul.walmsley@sifive.com
Cc: palmer@dabbelt.com
Cc: alex@ghiti.fr
Cc: linux-kernel@lists.codethink.co.uk
Cc: hch@infradead.org
---
 arch/riscv/include/asm/uaccess.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/include/asm/uaccess.h b/arch/riscv/include/asm/uaccess.h
index 824b2c9da75b..7bf90d462ec9 100644
--- a/arch/riscv/include/asm/uaccess.h
+++ b/arch/riscv/include/asm/uaccess.h
@@ -306,7 +306,10 @@ do {                               \
  * data types like structures or arrays.
  *
  * @ptr must have pointer-to-simple-variable type, and @x must be assignable
- * to the result of dereferencing @ptr.
+ * to the result of dereferencing @ptr. The @x is copied inside the macro
+ * to avoid code re-ordering where @x gets evaulated within the block that
+ * enables user-space access (thus possibly bypassing some of the protection
+ * this feautre provides).
  *
  * Caller must check the pointer with access_ok() before calling this
  * function.
@@ -316,12 +319,13 @@ do {                              \
 #define __put_user(x, ptr)                 \
 ({                             \
    __typeof__(*(ptr)) __user *__gu_ptr = (ptr);        \
+   __typeof__(*__gu_ptr) __val = (x);          \
    long __pu_err = 0;                  \
                                \
    __chk_user_ptr(__gu_ptr);               \
                                \
    __enable_user_access();                 \
-   __put_user_nocheck(x, __gu_ptr, __pu_err);      \
+   __put_user_nocheck(__val, __gu_ptr, __pu_err);      \
    __disable_user_access();                \
                                \
    __pu_err;                       \

Patch for SR_SUM saving over __switch_to()

   RFC: riscv: save the SR_SUM status over switches

    When threads/tasks are switched we need to ensure the old execution's
    SR_SUM state is saved and the new thread has the old SR_SUM state
    restored.

    The issue is seen under heavy load especially with the syz-stress tool
    running, with crashes as follows in schedule_tail:

    Unable to handle kernel access to user memory without uaccess routines at virtual address 000000002749f0d0
    Oops [#1]
    Modules linked in:
    CPU: 1 PID: 4875 Comm: syz-executor.0 Not tainted 5.12.0-rc2-syzkaller-00467-g0d7588ab9ef9 #0
    Hardware name: riscv-virtio,qemu (DT)
    epc : schedule_tail+0x72/0xb2 kernel/sched/core.c:4264
     ra : task_pid_vnr include/linux/sched.h:1421 [inline]
     ra : schedule_tail+0x70/0xb2 kernel/sched/core.c:4264
    epc : ffffffe00008c8b0 ra : ffffffe00008c8ae sp : ffffffe025d17ec0
     gp : ffffffe005d25378 tp : ffffffe00f0d0000 t0 : 0000000000000000
     t1 : 0000000000000001 t2 : 00000000000f4240 s0 : ffffffe025d17ee0
     s1 : 000000002749f0d0 a0 : 000000000000002a a1 : 0000000000000003
     a2 : 1ffffffc0cfac500 a3 : ffffffe0000c80cc a4 : 5ae9db91c19bbe00
     a5 : 0000000000000000 a6 : 0000000000f00000 a7 : ffffffe000082eba
     s2 : 0000000000040000 s3 : ffffffe00eef96c0 s4 : ffffffe022c77fe0
     s5 : 0000000000004000 s6 : ffffffe067d74e00 s7 : ffffffe067d74850
     s8 : ffffffe067d73e18 s9 : ffffffe067d74e00 s10: ffffffe00eef96e8
     s11: 000000ae6cdf8368 t3 : 5ae9db91c19bbe00 t4 : ffffffc4043cafb2
     t5 : ffffffc4043cafba t6 : 0000000000040000
    status: 0000000000000120 badaddr: 000000002749f0d0 cause: 000000000000000f
    Call Trace:
    [<ffffffe00008c8b0>] schedule_tail+0x72/0xb2 kernel/sched/core.c:4264
    [<ffffffe000005570>] ret_from_exception+0x0/0x14
    Dumping ftrace buffer:
       (ftrace buffer empty)
    ---[ end trace b5f8f9231dc87dda ]---

    The issue comes from the put_user() in schedule_tail (kernel/sched/core.c)
    doing the following:

    asmlinkage __visible void schedule_tail(struct task_struct *prev)
    {
    ...
            if (current->set_child_tid)
                    put_user(task_pid_vnr(current), current->set_child_tid);
    ...
    }

    the put_user() macro causes the code sequence to come out as follows:

    1:      __enable_user_access()
    2:      reg = task_pid_vnr(current);
    3:      *current->set_child_tid = reg;
    4:      __disable_user_access()

    This means the task_pid_vnr() is being called with user-access enabled
    which itself is not a good idea, but that is a seperate issue. Here we
    have a function that /might/ sleep being called with the SR_SUM and if
    it does, then it returns with the SR_SUM flag possibly cleared thus
    causing the above abort.

    To try and deal with this, and stop the SR_SUM leaking out into other
    threads (this has also been tested and see under stress. It can rarely
    happen but it /does/ under load) make sure the __switch_to() will save
    and restore the SR_SUM flag, and clear it possibly for the next thread
    if it does not need it.

    Note, test code to be supplied once other checks have been finished.

    There may be further issues with the mstatus flags with this, this
    can be discussed further once some initial testing has been done.


diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h
index 021ed64ee608..e2398c6e7af9 100644
--- a/arch/riscv/include/asm/processor.h
+++ b/arch/riscv/include/asm/processor.h
@@ -35,6 +35,7 @@ struct thread_struct {
        unsigned long s[12];    /* s[0]: frame pointer */
        struct __riscv_d_ext_state fstate;
        unsigned long bad_cause;
+       unsigned long flags;
 };

 #define INIT_THREAD {                                  \
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index 9ef33346853c..0036570752d9 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -29,6 +29,7 @@ void asm_offsets(void)
        OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
        OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
        OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
+       OFFSET(TASK_THREAD_FLAGS, task_struct, thread.flags);
        OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
        OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
        OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
@@ -172,6 +173,10 @@ void asm_offsets(void)
                  offsetof(struct task_struct, thread.s[11])
                - offsetof(struct task_struct, thread.ra)
        );
+       DEFINE(TASK_THREAD_FLAGS_RA,
+                 offsetof(struct task_struct, thread.flags)
+               - offsetof(struct task_struct, thread.ra)
+       );

        DEFINE(TASK_THREAD_F0_F0,
                  offsetof(struct task_struct, thread.fstate.f[0])
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 744f3209c48d..6137f6c2a2ad 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -417,7 +417,14 @@ ENTRY(__switch_to)
        REG_S s9,  TASK_THREAD_S9_RA(a3)
        REG_S s10, TASK_THREAD_S10_RA(a3)
        REG_S s11, TASK_THREAD_S11_RA(a3)
+       /* save (and disable the user space access flag) */
+       li    s0, SR_SUM
+       csrrc s1, CSR_STATUS, s0
+       REG_S s1,  TASK_THREAD_FLAGS_RA(a3)
+
        /* Restore context from next->thread */
+       REG_L s0,  TASK_THREAD_FLAGS_RA(a4)
+       csrs  CSR_STATUS, s0
        REG_L ra,  TASK_THREAD_RA_RA(a4)
        REG_L sp,  TASK_THREAD_SP_RA(a4)
        REG_L s0,  TASK_THREAD_S0_RA(a4)

Conclusion

The step of shadowing the enablement of user-access was probably not necessary (although interesting side quest), and going straight to checking if the call to get the pid had cleared the SR_SUM flag would have been sufficient. Just patching the code to move the __put_user() code order would have fixed the issue in this case, the saving SR_SUM over __switch_to() is also a good fix in the case where the SR_SUM may be enabled over several calls.

Future work

The first work will be sorting reviews for the patches above. There is already feedback on whether storing SR_SUM is necessary over the switch_to() call.

The next thing is to check whether there are any other flags that switch_to() should be saving. This will be discussed in future posts.

Related blog posts:

Other Articles

Get in touch to find out how Codethink can help you

sales@codethink.co.uk +44 161 660 9930