Thu 27 September 2018

Configuring Linux to stabilise latency

by Beth White , 2018 , Tags Linux FOSS latency configuration deterministic performance

Configuring Linux systems to stabilise latency

Over the course of the last few months, Codethink have conducted an investigation into whether or not Linux systems can be configured to be deterministic, so that performance over time is made to be more predictable and overall improved by tweaking the kernel in certain ways. The investigation was kicked off by Niall Dalton at Tensyr, who started running tests with certain kernel boot parameters changed and saw positive results. Codethink reproduced Niall's experiments and then attempted further performance improvements.

Possible applications of this would be in critical systems, where processes must be able to run either uninterrupted or with a predictable level of interference factored in.

We conclude that, with appropriate separation and tunings, we can get something approximating soft real-time on Linux without too much issue. However, difficulty arises from interference due to the housekeeping processes.

The results we saw from our tuned systems were far better than our baseline systems, and variance was predictably limited.

Approach

We decided to measure performance through the latency, and variation in latency seen when running simple processes on a Linux system, both with and without external stresses running. We used 3 different bits of hardware; an automotive infotainment rig (Intel Atom E3840 processor, mainline 4.14.55 kernel), a Jetson TK1 board (ARM Cortex-A15 processor, Tegra 4.17 kernel), and a Lenovo ThinkPad X240 running minimal Debian (Intel i7-4600U 2.10GHz processor, mainline 4.14.0 kernel).

Our test program was a simple C script that essentially said

'Get time' -> 'Do some work' -> 'Get time'

The stress program we used was stress-ng. The latency was the difference between the start and end time of each measurement. We ran a number of different work processes:

  • usleep()
  • clock_nanosleep()
  • Memory operations
  • Register operations
  • Kernel operation kill(0,0)
  • Kernel operation clock_gettime()

We ran a mixture of individual tests with each process running on each isolated CPU core (1-3), and parallel tests with a different process on each core, running simultaneously. With the latter, we could see certain events (for example dips and spikes in the latency) occurring at the same time across all three cores. Example below:

Laptop test; CPU1 SysCalls (kill) Laptop test; CPU2 Register Bound Laptop test; CPU3 Memory Bound

Kernel Configurations

The following boot parameters (supplied by Niall Dalton) were used:

  • Ignore corrected errors and associated scans that cause periodic latency spikes with mce=ignore_ce
  • Avoid logging of backtraces when a process executes on a CPU for longer than the softlockup threshold with nosoftlockup
  • Set vm.stat_interval=120 to limit updating VM statistics
  • Disable the kernel trying to coalesce our pages with transparent_hugepage=never
  • Disable CPU idling so that we're running at max performance with processor.max_cstate=1 idle=poll intel_idle.max_cstate=0(this negates power saving)
  • Isolate CPUs 1,2 and 3 with isolcpus=1,2,3 (replacing the numbers with whichever CPU you wish to isolate). We found this to cause the most significant improvement in latency and variance. Our setup was as follows:
    • CPU0 - housekeeping processes
    • CPUs 1-3 - running the latency test
    • Stress applied to CPUs 1-3
  • Enable fully tickless mode for each isolated CPU with nohz_full=1,2,3

Various "housekeeping" and "actually do the IO work" threads exist and need to run in the kernel, but we want to keep them away from our designated cores. So, we moved IRQs away from our CPUs and pointed them towards the housekeeping CPU with

echo <CPU bitmask> > /proc/irq/<IRQ number>/smp_affinity

The rest of the configurations are as follows (note, each item of hardware used a slightly different set of configurations, depending on its capabilities):

  • PREEMPT_RT patch applied.
  • Set the CPU governor to performance and locked the CPU frequency.
  • Set the clock source to TSC (as opposed to HPET).
  • Set the policy scheduler to FIFO (as opposed to RR; round robin).
  • Set the priority of our process to 20 so it would take precedence over other processes occurring in the kernel.

Results

A selection of graphs plotted from data from the automotive rig are below, with the tuned kernel results plotted against baseline results. Comparing the two, we can see less variance as well as a marked improvement in overall latency of anywhere between 40-96%, when using the tuned kernel.

Rig test; CPU1 Memory Bound Rig test; CPU1 Register Bound Rig test; CPU2 SysCalls clock_gettime()

Full presentation of our results can be found here; you will need to run a program called Jupyter to view them. Instructions on how to install Jupyter and the necessary dependencies can be found here.

For the truly curious, our GitLab instance is open and can be viewed here. There are a number of wiki pages containing information and results from the tests we ran and developed throughout the project.

Analyses

We found with the tests run on the laptop that we were seeing jumps in latency every so often. These correspond with CPU frequency changes seen in the datasets, probably due to thermal variance in the Intel processor. Example below:

Laptop test; CPU1 Memory Bound

From the facts that we saw similar (albeit less frequent) frequency changes on the automotive rig, and that we didn't see these jumps on the Jetson TK1, we conclude that:

a) ARM processors may be less susceptible than Intel processors to CPU frequency changes, and b) Intel Atom processors may not jump in frequency as often as the Intel i7, but the jumps will likely last for longer when they do occur.

It is worth noting that even with the jumps, the variance remains within ~5-6%.

Another interesting event we saw with the automotive rig parallel tests was periodic spikes in the tuned kernel when under stress, only in one CPU. Example below:

Rig parallel tests; CPU3 Memory Bound

Since we were allowing stress to run across CPUs 1-3, we believe the kernel had allocated it to that core and, despite our process having priority, was being fair to both processes by allowing stress to interrupt intermittently. From this we can conclude that even when we tune and set ourselves to be high priority, without a real-time scheduler we can't stop ourselves being pre-empted eventually. However, the periodicity means this can be factored in, and the results are still much better than baseline.

Conclusions

  • We are seeing largely positive results.
  • Where we don't have positive results, there is justification - but even when Linux forces us to give up, it's by a predictable amount, it's for a predictable amount of time, and both of these elements can be factored in. Real-time is not about being super-fast after all, it's about being predictable.
  • We can confidently say that with the tunings used, you can expect latency variance of within ~5%, ~6% as a maximum.
  • Next steps would be to investigate how far we can isolate the housekeeping processes.
  • Things we don't yet have a grasp of in this context would be other kernel threads, and system management mode.