Thu 08 June 2023

RISC-V: Exploring a Bug in Stack Unwinding

On one of our RISC-V projects, we were recently debugging a complex driver probing sequence in the U-Boot bootloader. Along the way we improved the debugging facilities available for U-Boot on RISC-V, specifically allowing developers and system integrators to see the callstack in the case that there is a crash during the early boot sequence. Read on to find out what we did.

Adding Stack Unwinding

The U-Boot exception code prints out the state of the system when an exception happens (for example pointer outside of usable memory) but for RISC-V this is just the current cpu registers. To add more information, we decided to add code to print the call stack and since this is a useful feature we submitted this patch.

There are several ways to do stack unwinding, either using table built by the compiler or using a frame pointer register which points to the current stack frame. We decided to go with the frame pointer as that would be the easiest to write code for (approx 23 lines) from this example. The only downside is the code is bigger so this feature is put behind a config option CONFIG_FRAMEPOINTER

For GCC, the default is not to use a frame pointer so in U-Boot when the configuration for unwinding is enabled the compiler flags need the -mno-omit-frame-pointer option. As noted above this makes the U-Boot code larger as each function requires a few more instructions to save the frame pointer and to ensure it is updated. For RISC-V, the frame pointer is in register s0.

The stack frame is always setup so that this pointer can be used to find the previous frame and caller's address and defined by the architecture and compiler options used. The frame pointer (fp) always points to the following data on the stack and thus can be used to follow back to the first frame. On main entry to U-Boot we set the fp to NULL or use the global data gd pointer to avoid changing some of the other entry points:

fp[-1] = return address
fp[-2] = previous frame pointer

In the above example, the fp is a pointer type, so on 64bit systems it is stored as an 8-byte unsigned value.

The Bug Report

After the initial submission, we got a report our patch was not working and the unwinder was going off into places that could not be valid code. Since this was working for us for our build, we tried to dig into the report in case this was being caused by a difference in build config, compiler version or other unforseen issues. We asked the reporter to send some dumps from the code to aid investigation. When we saw the code it was clear there was something we had not seen going on:

0000002000817228 <board_init_r>:
  2000817228:   7d9612ef                jal     t0,2000879200 <__riscv_save_2>
  200081722c:   81aa                    mv      gp,a0

There is a call to __riscv_save_2 here instead of the sequence we expected below which stores the registers onto the stack and updates the frame register to the new frame.

000000200081b3ce <board_init_r>:
  200081b3ce:   1101                    add     sp,sp,-32
  200081b3d0:   e822                    sd      s0,16(sp)
  200081b3d2:   e04a                    sd      s2,0(sp)
  200081b3d4:   ec06                    sd      ra,24(sp)
  200081b3d6:   e426                    sd      s1,8(sp)
  200081b3d8:   1000                    add     s0,sp,32
  200081b3da:   81aa                    mv      gp,a0

Since we hadn't seen this before, we did a quick search for the __riscv_save_2 and this turns out to be a provided by the compiler support library for a feature -msave-restore. The save-restore is a RISC-V compiler option to try and make function call start and end code smaller by using a library routine to do it.

The issue is obviously that the function the error was injected into looked like there was a valid stack frame pointer, but there was no actual frame unwinding data saved and therefore no way to follow this function back. It seems that no-one had noticed before that using the -msave-restore option clashed with the -mno-omit-frame-pointer and that the reporter's local change to either compiler or U-Boot build options caused an issue we had not seen.

How to Fix

The quick fix is to disable the save-restore code, but due to the reporter's local changes to U-Boot we can't change the configuration system to disable having both in at the same time. Better would be to update the compiler to warn, error or disable one of these clashing options. The last and probably the most difficult would be to add support for having both by updating both the compiler and compiler library to have versions of __riscv_save_%0 which would work with the frame pointer.

Since we are using GCC for our builds, the first action was to clone the code and look into the places that generate the code for function wrappers to see what would be required. During the initial checkout and test of GCC, it turns out that we could not repeat the issue. Looking through the gcc/config/riscv/riscv.cc file it turns out this has been fixed recently in another update. In refactoring the test for whether to use the save/restore lib-calls a test function was added which also added a check for the use of frame pointer, and if that was set avoid using the save/restore calls as so:

+riscv_avoid_save_libcall (void)
+{
+  if (!TARGET_SAVE_RESTORE
+      || crtl->calls_eh_return
+      || frame_pointer_needed
+      || cfun->machine->interrupt_handler_p
+      || cfun->machine->varargs_size != 0
+      || crtl->args.pretend_args_size != 0)
+    return true;
+
+  return false;
+}

This means future GCC will not have this issue, which is an improvement. Either this patch could be backported to earlier versions or something similar could be added to avoid this.

LLVM is left as as a future exercise.

Improving GCC

It would be great if the two options could be used together, but when starting to modify gcc/config/riscv/riscv.cc and the gcc/config/riscv/riscv.md we ran into issues with not knowing how the internals of gcc well enough to decipher the compiler errors we got when trying to build some test code. Adding UNSPECV_GPR_SAVE_FP as a version of UNSPECV_GPR_SAVE just gave us an unrecognizable instruction error, as shown below:

test.c: In function ‘main’:
test.c:12:1: error: unrecognizable insn:
   12 | }
      | ^
(insn/f 39 11 40 3 (parallel [
            (unspec_volatile [
                    (const_int 3 [0x3])
                ] UNSPECV_GPR_SAVE_FP)

If we want to go further with this, then we would have to work out what we missed in this addition to get gcc to output the relevant code to call the frame-pointer versions when frame_pointer_needed is set.

Summary

When adding features it is difficult to work out all the permutations of configuration, build environment and where the code is running. Working with the community to iron out issues quickly is good, and hopefully if there are things you cannot fix then someone else can look at fixing.

Codethink are available to help with your RISC-V project, please contact us to find out more about our services.

The U-Boot logo thumbnail is used under CC-BY-SA-4 license.

Other Content

Get in touch to find out how Codethink can help you

sales@codethink.co.uk +44 161 660 9930

Contact us