Fri 02 June 2023

Adding RISC-V Vector Cryptography Extension support to QEMU

RISC-V is an open source instruction set architecture (ISA) based on reduced instruction set computer (RISC) principles. Codethink has been working with the RISC-V CPU architecture for several years. We've done some internal projects around hardware design, toolchain support and porting a desktop environment. We also do commercial work in this area, and a project team recently added support to QEMU for an extension to the RISC-V instruction set that provides Vector Cryptography. Read on for details of how they did this work.

Our task, sponsored by SiFive, was to add full support into QEMU (a generic system emulator capable of simulating different architectures) for RISC-V's vector cryptography, vcrypto, extension set. This extension set provides instructions for implementing various cryptographic programs, including AES, SHA-2 and the ShangMi suites. Adding support to QEMU is one of the required steps for getting the extension ratified.

Unlike its scalar equivalent (which had already been implemented in QEMU) the vector cryptography extension leverages vector registers to increase the throughput of cryptographic operations. Such registers can be of varying bit lengths and are divided into element groups of some smaller length. A cryptographic operation can then be applied to all the element groups at once.

Vector processing achieves a similar goal as to SIMD's, but purports to have some subtle advantages. With vector processing the CPU is given maximal information regarding the data it is operating on, which may in principle allow it to implement optimisations such as "vector chaining"1. The RISC-V implementation is also more flexible – for example the instructions are independent of vector register length.

QEMU

Instruction
setup

For each instruction in the vcrypto extension we had to add support for:

  1. decoding the instruction's bit pattern

  2. checking the instruction was valid

  3. translating it into the host's instruction set.

To show how this was done let's work through vaesz as an example – one of the simplest vector crypto instructions. See the entry (at the time of writing) for this instruction in the specification here. This instruction is part of the Zvkned extension, which implements the AES block cipher.

As you can see in the specification document, this instruction takes as arguments two vector registers, labelled vd and vs2 (only the first element group from vs2 is used – hence the term scalar). vd is then overwritten by the output from the instruction.

In order to support this instruction in QEMU, the first step is to add its bitwise encoding to target/riscv/insn32.decode, allowing QEMU to recognise the instruction when it appears in binary. The encoding is given by the table in the "Encoding (Vector-Scalar)" section, which is shown graphically above. Note, OP-P and OPMVV correspond to 1110111 and 010 respectively.

The RISC-V vector spec defines several parameters that can be used to tune the behaviour of the vector instructions. Such changes can render an instruction illegal, so the next step is to have QEMU check this. For our instruction vaesz, the requirements are contained within the if clause of the pseudo code. (vstart%EGS)<>0 checks that the starting location of the data within the register is valid and LMUL*VLEN < EGW that enough register space is provided for at least one element group.

Next we need to handle translating the RISC-V instruction into the host's instruction set. As you might expect, QEMU has quite a lot of tooling already setup for implementing this. We just needed to implement the pseudo code written in the spec as C code and then QEMU would handle the dynamic translation of this to the host's instruction set using its backend, TCG. You may be able to decipher that the pseudo code iterates across element groups in vd and applies a bitwise or operation against the scalar in vs2. QEMU treats vector registers as arrays, so this was relatively simple to implement.

Implementing more sophisticated instructions follows the same principles as the simple one outlined here, it just involves more complicated pseudo-code.

Testing

The vcrypto specification was in its early stages when our project began and had no prior implementation, so testing was paramount to ensure we had understood and implemented the specification as written.

Our sponsor provided a test suite that they had written internally which we would use to test against our implementation. This suite consisted of auto-generated assembly code containing positive and negative test cases, which we could run within QEMU.

It was important to have rapid testing we could do ourselves to verify our work against the latest specifications, therefore we developed 'framework' tests. Our tests ran in Linux userland and generated JIT instructions with random parameters. However, while our tests covered weaknesses of the client's test suite: primarily the assembly code being harder to debug than C as well as lagging behind the specification, we were limited as we could only test positive cases.

Endianness

Running the above test suites on our x86-64 laptops gave us some confidence that our implementation was correct. However, it didn't provide the full picture, which to understand we need to go off on a (hopefully) interesting tangent.

There are two (sane) standards for loading data in and out of CPU registers: little endian (LE) and big endian (BE). With LE, the data's least significant byte is stored in the smallest memory address and the most significant in the largest. It is vice versa for BE.

Like almost all CPUs in personal computers, RISC-V CPUs are LE. QEMU, on the other hand, can run on LE and BE hosts. Hence someone may try emulating an LE RISC-V CPU on some BE host. If we weren't careful in our handling of vector registers we could've introduced some subtle bugs in this scenario, so we endeavoured to be as careful as possible.

Ideally we would've run our test suites on BE CPUs but alas we had no access to such hardware, as BE consumer hardware has been largely phased out in favour of LE CPUs which can run in BE mode such as some ARM chips. Somewhat bizarrely though, QEMU-in-QEMU is actually a supported use of the program. Hence we could run our QEMU tests within a QEMU emulation of BE hardware! Finding a pre-built BE linux image proved tricky so we opted for running FreeBSD in a 64-bit PowerPC emulation. As is typical of FreeBSD, they provide helpful documentation for doing this.

With this in hand we could ensure that our tests passed on BE CPUs 🥳. The only drawback to this approach came from the inherent performance hit associated with emulation. Building QEMU and running tests went from taking a few minutes on a laptop to multiple hours when done within the PowerPC emulation.

Upstreaming

The upstreaming process has been a cycle of submitting email patch submissions and implementing feedback. This began with an RFC before posting formal submissions, of which currently the 4th revision is being prepared: https://lists.nongnu.org/archive/html/qemu-devel/2023-04/msg05580.html. The vcrypto spec has ostensibly been frozen so hopefully future changes to the patchset will be minimal!

Cover photo by Laura Ockel on Unsplash.


  1. There is a talk at the 2020 RISC-V Summit that discusses the concept of vector chaining. See for example 12:17

Other Content

Get in touch to find out how Codethink can help you

sales@codethink.co.uk +44 161 660 9930

Contact us