A study of Rust micro-optimisations
In working on ethercrab
I found myself down the rabbit hole
of trying to reduce binary size, in this case for the thumbv7m-none-eabi
architecture used in
embedded. This post is a collection of examples of attempted code size reductions by refactoring
sometimes the most tiny of functions. It will hopefully serve as a reference to both me and you,
dear reader, in your travels through the Rust lands.
I don't have much knowledge of the deeper bits of the optimisation pipeline in both LLVM or Rust, so these experiments are probably super obvious to anyone with an inkling, but it was a good learning experience for me either way so I thought I'd share.
1. Clamp length offsets to the buffer length
This is code that will quite happily panic for what it's worth. The skip
function is part of a
larger set of uh "generator-combinator" functions (generate stuff instead of parsing it like nom
)
and will jump over a given number of bytes in the provided buffer, returning the rest.
// Original impl
// Naive attempt: a little improvement
//
// Important note: the actual asm compiles to the same 4 instructions as `skip` above, however it
// generates one less exception jump and one less string in the final binary.
// Clamp length: maybe now LLVM knows this won't panic?
//
// There are no more assertion checks so we're pretty much as small as we can get.
/// Slightly more instructions than `skip3` but maybe a little bit clearer if that matters to you.
skip3
seems to be the best here. If the returned buffer length is zero, the other original code
that uses it will panic instead so we've probably just moved the assertion check in the wide program
than removed it entirely.
2. Idiomatic method chaining is smarter than you think
Fewer lines really is faster!
use unreachable_unchecked;
use TryInto;
/// Original attempt at getting optimised output, with sad trombone bonus `unsafe`` :(
/// Look at this nice API!
The two latter solutions produce identical assembly, so there's no need for unsafe
here - the
performance is already there.
There's also an in-between if you find a lot of chained methods hard to read, which is understandable:
Again, identical assembly as the two above.
3. (currently) nightly: slice_first_last_chunk
This PR will soon stabilise some methods that make parsing integers from slices much more pleasant, but do they help with performance?
// Requires Rust nightly at time of writing
The methods all generate the same assembly.
4. Another scenario
Alright now I'm just impressed. The two following functions generate the same assembly both for x64
and thumbv7m-none-eabi
.
Note that the first function is copied from the output of cargo expand
- it's generated by a proc
macro derive, therefore is pretty terrible code. But the optimiser doesn't seem to care.
// https://github.com/rust-lang/rust/pull/117561
unpack_from_slice_new
is prettier if that matters. Either way it's nice to see that prettier code
doesn't make for worse code.
4. Pointer offsets just turn into numbers
As you might expect, but it's nice to have the confirmation. Original code is this:
unsafe
This just turns into
example::ethercat_payload_ptr:
lea rax, [rdi + 16]
ret
Because both byte_add
s are given constants.
5. Converting Duration
into u64
nanoseconds isn't that bad
EtherCAT uses nanosecond time everywhere, stored as a u64
. I'm using
core::time::Duration
for convenience, but wanted to know if there's much of a performance hit when
doing some conversions.
Just converting a Duration
to u64
The assembly for thumbv7
comes out as this:
nanos_u64:
movw r3, #51712
movt r3, #15258
muls r1, r3, r1
umlal r2, r1, r0, r3
mov r0, r2
bx lr
nanos_u128:
push {r4, r6, r7, lr}
add r7, sp, #8
movw r12, #51712
movs r4, #0
movt r12, #15258
mov.w lr, #0
umull r0, r3, r0, r12
umlal r3, r4, r1, r12
adds r0, r0, r2
adcs r1, r3, #0
adcs r2, r4, #0
adc r3, lr, #0
pop {r4, r6, r7, pc}
Closer to real life
The IRL code does a bit more maths. Let's see how that checks out.
171 lines of assembly. ROBLOX_OOF.mp3
. Lots of divisions and overflow checks.
Optimising: make everything u64
The real code is library-internal, so we can just pass u64
s around and convert to a Duration
at
the public boundary.
Much less assembly now with fewer branches to boot.
Optimising: checked/saturating operations
No more panics which is nice, but still quite a bit of branching. I'll leave this here for now and revisit if performance becomes an issue on embedded systems.