Codethink is a software company that works on various client projects; ranging from medical, finance, automotive. In these different areas of engagement; we are trusted to work on various types of problems that clients face with their systems. One such problem encountered required us looking at a userspace software where the program was occasionally not responding on time. This proved to be a challenging problem, due to the fact that the problem is not seen on every cycle or every try. This is how we used Tracecompass to solve that problem.
After initial debugging, we were fairly convinced that this problem was caused by process scheduling, and that the process was not being allocated CPU on time. The scheduler is a component of the kernel that decides which runnable thread will be executed by the CPU next. Since this problem is now a suspected problem from the scheduler we decided to look into how the different threads are getting scheduled. There are many tools to look into Linux kernel internals, the most basic of them is ftrace. This can be used to trace functions or kernel events through hundreds of static event points that are placed inside the kernel code at various places. There are also tools which have used ftrace and extended its functionality even more and one such tool is LTTng. LTTng has been designed to provide a low overhead tracing on production systems. The tracers achieve this great performance through a combination of essential techniques such as per-CPU buffering, RCU data structures, a compact and efficient binary trace format, and more. LTTng disturbs the traced system as little as possible in order to make tracing of subtle race conditions and rare interrupt cascades possible.
Installing LTTng was a fairly straight forward,two step process:
- Add LTTng module to the kernel.
- Build and install the userspace component.
Once LTTng was working on our client hardware, we started looking for different tools that can analyse the trace and present it in a user-friendly form. Lots of viewers are available like:
Babeltrace can be used to convert the generated trace to text format or write custom analysis with python. Looking at all the available options and evaluating them against our task, we decided to go for tracecompass, which is an Open Source application to solve performance and reliability issues by reading and analysing traces and logs of a system. Its goal is to provide views, graphs, metrics, and more to help the user to extract useful information from traces. This is done in a way that is more user-friendly and informative than producing huge text dumps.
After a decision on the project tools was made, we started setting up the client system for an auto-test where a script will start an LTTng session, trigger the steps needed to reproduce the problem and save the logs. This is left to run overnight, before checking the logs next morning. Then we run another script on it to find the particular cycle where the problem was reproduced (if it was). After a few days of testing we have collected some cycles of logs with the problem, and we started looking at the LTTng trace using tracecompass. Our log collection script had been collecting top logs also and we got the pid and tid from that log which we will need to trace in tracecompass. Tracecompass will then display the data in the form of graphs which are easy to visualize. An example trace is:
The colour codes used by the graphs are also defined.
After going through the logs we found the problem and it was indeed because of a thread getting blocked due to the CPU being unavailable. We can not show the client log here, but we recreated the original problem by tweaking the process priorities which can show a similar trace in tracecompass.
The test_thread process with tid: 2263 and tid: 2264 are our test threads. At point A, thread tid:2263 started waiting for CPU, at point B thread tid: 2264 started wait and they both were in “Wait for CPU” mode till point C. At that point the scheduler decided to allot the CPU to tid:2264 even though 2263 had been waiting for more time (from A). The thread tid:2263 got CPU at point D when tid:2264 finished. It was easy to suggest a fix once we knew what the problem was, and the suggested solution was again tested in our auto-test setup to confirm that it actually works before delivering to the client.
- Tracking Players at the Edge: An Overview
- What is Remote Asset API?
- Running a devroom: FOSDEM 2021 Safety and Open Source
- Meet the codethings: Understanding BuildGrid and BuildBox with Beth White
- Streamlining Terraform configuration with Jsonnet
- Bloodlight: Designing a Heart Rate Sensor with STM32, LEDs and Photodiode
- Making the tech industry more inclusive for women
- Bloodlight Case Design: Lessons Learned
- Safety is a system property, not a software property
- RISC-V: Codethink's first research about the open instruction set
- Meet the Codethings: Safety-critical systems and the benefits of STPA with Shaun Mooney
- Why Project Managers are essential in an effective software consultancy
- FOSDEM 2021: Devroom for Safety and Open Source
- Meet the Codethings: Ben Dooks talks about Linux kernel and RISC-V
- Here we go 2021: 4 open source events for software engineers and project leaders
- Xmas Greetings from Codethink
- Call for Papers: FOSDEM 2021 Dev Room Safety and Open Source Software
- Building the abseil-hello Bazel project for a different architecture using a dynamically generated toolchain
- Advent of Code: programming puzzle challenges
- Improving performance on Interrogizer with the stm32
- Introducing Interrogizer: providing affordable troubleshooting
- Improving software security through input validation
- More time on top: My latest work improving Topplot
- Cycling around the world
- Orchestrating applications by (ab)using Ansible's Network XML Parser
- My experience of the MIT STAMP workshop 2020
- Red Hat announces new Flatpak Runtime for RHEL
- How to keep your staff healthy in lockdown
- Bloodlight: A Medical PPG Testbed
- Bringing Lorry into the 2020s
- Fixing Rust's test suite on RISC-V
- The challenges behind electric vehicle infrastructure
- Investigating kernel user-space access
- Consuming BuildStream projects in Bazel: the bazelize plugin
- Improving RISC-V Linux support in Rust
- Creating a Build toolkit using the Remote Execution API
- Trusting software in a pandemic
- The Case For Open Source Software In The Medical Industry
- My experiences moving to remote working
- Impact of COVID-19 on the Medical Devices Industry
- COVID-19 (Coronavirus) and Codethink
- Codethink develops Open Source drivers for Microsoft Azure Sphere MediaTek MT3620
- Codethink partners with Wirepas
- Testing Bazel's Remote Execution API
- Passing the age of retirement: our work with Fortran and its compilers
- Sharing technical knowledge at Codethink
- Using the REAPI for Distributed Builds
- An Introduction to Remote Execution and Distributed Builds
- Gluing hardware and software: Board Support Packages (BSPs)
- Engineering's jack of all trades: an intro to FPGAs
- Bust out your pendrives: Debian 10 is out!
- Why you should attend local open source meet-ups
- Acceptance, strife, and progress in the LGBTIQ+ and open source communities
- Codethink helps York Instruments to deliver world-beating medical brain-scanner
- Codethink open sources part of staff onboarding - 'How To Git Going In FOSS'
- Getting into open source
- How to put GitOps to work for your software delivery
- Open Source Safety Requirements Analysis for Autonomous Vehicles based on STPA
- Codethink engineers develop custom debug solution for customer project
- Codethink contributes to CIP Super Long Term Kernel maintenance
- Codethink creates custom USB 3 switch to support customer's CI/CD pipeline requirements
- Codethink unlocks data analysis potential for British Cycling
- MIT Doctor delivers Manchester masterclass on innovative safety methodology
- Balance for Better: Women in Technology Codethink Interviews
- Introducing BuildGrid
- Configuring Linux to stabilise latency
- GUADEC 2018 Talks
- Hypervisor Not Required
- Full archive