In working on
ethercrab I found myself down the rabbit hole
of trying to reduce binary size, in this case for the
thumbv7m-none-eabi architecture used in
embedded. This post is a collection of examples of attempted code size reductions by refactoring
sometimes the most tiny of functions. It will hopefully serve as a reference to both me and you,
dear reader, in your travels through the Rust lands.
I don't have much knowledge of the deeper bits of the optimisation pipeline in both LLVM or Rust, so these experiments are probably super obvious to anyone with an inkling, but it was a good learning experience for me either way so I thought I'd share.
🔗1. Clamp length offsets to the buffer length
This is code that will quite happily panic for what it's worth. The
skip function is part of a
larger set of uh "generator-combinator" functions (generate stuff instead of parsing it like
and will jump over a given number of bytes in the provided buffer, returning the rest.
// Original impl
// Naive attempt: a little improvement
// Important note: the actual asm compiles to the same 4 instructions as `skip` above, however it
// generates one less exception jump and one less string in the final binary.
// Clamp length: maybe now LLVM knows this won't panic?
// There are no more assertion checks so we're pretty much as small as we can get.
/// Slightly more instructions than `skip3` but maybe a little bit clearer if that matters to you.
skip3 seems to be the best here. If the returned buffer length is zero, the other original code
that uses it will panic instead so we've probably just moved the assertion check in the wide program
than removed it entirely.
🔗2. Idiomatic method chaining is smarter than you think
Fewer lines really is faster!
/// Original attempt at getting optimised output, with sad trombone bonus `unsafe`` :(
/// Look at this nice API!
The two latter solutions produce identical assembly, so there's no need for
unsafe here - the
performance is already there.
There's also an in-between if you find a lot of chained methods hard to read, which is understandable:
Again, identical assembly as the two above.
🔗3. (currently) nightly:
This PR will soon stabilise some methods that make parsing integers from slices much more pleasant, but do they help with performance?
// Requires Rust nightly at time of writing
The methods all generate the same assembly.
Alright now I'm just impressed. The two following functions generate the same assembly both for x64
Note that the first function is copied from the output of
cargo expand - it's generated by a proc
macro derive, therefore is pretty terrible code. But the optimiser doesn't seem to care.
unpack_from_slice_new is prettier if that matters. Either way it's nice to see that prettier code
doesn't make for worse code.