Thu 22 February 2024

Lessons learnt from building a distributed system in Rust

One of Codethink's goals is to promote the existence of safe and reproducible software out in the wild. With its numerous safety guarantees and sane package management, Rust is a natural tool to utilise in this endeavour.

We were recently hired by a client to write a 3-node distributed system in Rust, where each node runs a multistage pipeline in parallel and communicates with the other nodes throughout. The project ran to a short and tight schedule, with rolling deadlines by which pre-defined work packages had to be delivered (this was not stressful at all 😅).

The project was successful in the end and here are some of the lessons we learnt in retrospect.

Rust does not slow down development time

A criticism you may often hear levelled against Rust is it takes longer for developers to write when compared to less pedantic languages like Python or C. Let me assure you, this was very much not the case here!

This project began with us building libraries and sub-components of the system. Only in the final third did we integrate these components together into the final services that would be ran in production. Due to the project's tight time schedule the pressure was on for us to develop the sub-components quickly and get them right first time.

Rust's design choice to use explicit error return types rather than exceptions significantly reduces the existence of hidden points of failure. This made it so much easier to reason about how the sub-components could fail and how we wanted to handle this. The ability to propagate errors also let us handle failures all in one place - again easier to reason about a priori. Remember, this is all before we were able to setup proper integration testing.

Of course, we did not get everything right first time - but it was not far off! I dread to think what kind of knots we would have got wrapped in had we tried to follow this approach in an exception-based language.

Navigating async Rust - a pain worth enduring

We used async programming on this project to ensure nodes could perform CPU-bound work whilst concurrently keeping multiple communication channels open with other nodes. Async in Rust is starting to gain a reputation for a being an overly-complicated beast. You can see withoutboats' discussion of this for a nice detailed summary but the pertinent point for our purposes is that the complexity of async is partly a consequence of Rust's unwillingness to sacrifice on performance1.

Admittedly, we too found ourselves stumbling over Rust's async complexity several times. Particularly with regards to shutdown and cancellation safety. We wanted it to be possible to stop and/or restart the services without corruption to data and this required ensuring all tasks, regardless of what thread they were on, listened for and correctly handled certain signals. This is not something the compiler gives you and required careful thought on our part.

From our perspective though, this complexity was worth enduring for the sake of performance. Each node in our distributed system could be performing 1500-2000 concurrent operations, which is at the scale where the runtime overheads from, for example, stackful green threads can come into play. We spent the final quarter of this project squeezing every last drop performance out of our services, so such an overhead may have been intolerable.

Scientific Rust is not quite there yet

One last thing that struck us when working on this is project was the gap in maturity between the scientific/statistical libraries in C++ and Python compared to Rust. There is no real equivalent in Rust yet for the advanced algorithms to be found in scipy or sympy on the Python side or the boost Maths module on the C++ side. Crates like statrs, for example, certainly provide useful functions but they are a long way off feature parity with the afore mentioned libraries. In our case we noticed this especially for functions relating to probability distributions.

Now, this is not exactly surprising. The Python libraries had a head start of about a decade on the release of Rust 1.0 alone - and boost even more - so of course there will be differences in maturity. Having said this, a concern of ours is that progress will not be made without greater uptake of Rust within the scientific communities.

It will be interesting to watch this space. Scientists could surely benefit from all the usual advantages Rust brings to writing programs with high complexity - but they also have a reputation for wanting their languages to "get out of their way" and just let them do their clever sciencey stuff. Rust, with its opinionated compiler who is desperate to save you from yourself, is certainly not one of those!

Photo credit: Clint Adair on Unsplash

  1. and the rest of it is explained by async Rust still being work in progress. 

Other Content

Get in touch to find out how Codethink can help you +44 161 660 9930

Contact us