A study of Rust micro-optimisations
In working on ethercrab
I found myself down the rabbit hole
of trying to reduce binary size, in this case for the thumbv7m-none-eabi
architecture used in
embedded. This post is a collection of examples of attempted code size reductions by refactoring
sometimes the most tiny of functions. It will hopefully serve as a reference to both me and you,
dear reader, in your travels through the Rust lands.
I don't have much knowledge of the deeper bits of the optimisation pipeline in both LLVM or Rust, so these experiments are probably super obvious to anyone with an inkling, but it was a good learning experience for me either way so I thought I'd share.
1. Clamp length offsets to the buffer length
This is code that will quite happily panic for what it's worth. The skip
function is part of a
larger set of uh "generator-combinator" functions (generate stuff instead of parsing it like nom
)
and will jump over a given number of bytes in the provided buffer, returning the rest.
// Original impl
// Naive attempt: a little improvement
//
// Important note: the actual asm compiles to the same 4 instructions as `skip` above, however it
// generates one less exception jump and one less string in the final binary.
// Clamp length: maybe now LLVM knows this won't panic?
//
// There are no more assertion checks so we're pretty much as small as we can get.
/// Slightly more instructions than `skip3` but maybe a little bit clearer if that matters to you.
skip3
seems to be the best here. If the returned buffer length is zero, the other original code
that uses it will panic instead so we've probably just moved the assertion check in the wide program
than removed it entirely.
2. Idiomatic method chaining is smarter than you think
Fewer lines really is faster!
use unreachable_unchecked;
use TryInto;
/// Original attempt at getting optimised output, with sad trombone bonus `unsafe`` :(
/// Look at this nice API!
The two latter solutions produce identical assembly, so there's no need for unsafe
here - the
performance is already there.
There's also an in-between if you find a lot of chained methods hard to read, which is understandable:
Again, identical assembly as the two above.
3. (currently) nightly: slice_first_last_chunk
This PR will soon stabilise some methods that make parsing integers from slices much more pleasant, but do they help with performance?
// Requires Rust nightly at time of writing
The methods all generate the same assembly.
4. Another scenario
Alright now I'm just impressed. The two following functions generate the same assembly both for x64
and thumbv7m-none-eabi
.
Note that the first function is copied from the output of cargo expand
- it's generated by a proc
macro derive, therefore is pretty terrible code. But the optimiser doesn't seem to care.
// https://github.com/rust-lang/rust/pull/117561
unpack_from_slice_new
is prettier if that matters. Either way it's nice to see that prettier code
doesn't make for worse code.
4. Pointer offsets just turn into numbers
As you might expect, but it's nice to have the confirmation. Original code is this:
unsafe
This just turns into
example::ethercat_payload_ptr:
lea rax, [rdi + 16]
ret
Because both byte_add
s are given constants.
5. Converting Duration
into u64
nanoseconds isn't that bad
EtherCAT uses nanosecond time everywhere, stored as a u64
. I'm using
core::time::Duration
for convenience, but wanted to know if there's much of a performance hit when
doing some conversions.
Just converting a Duration
to u64
The assembly for thumbv7
comes out as this:
nanos_u64:
movw r3, #51712
movt r3, #15258
muls r1, r3, r1
umlal r2, r1, r0, r3
mov r0, r2
bx lr
nanos_u128:
push {r4, r6, r7, lr}
add r7, sp, #8
movw r12, #51712
movs r4, #0
movt r12, #15258
mov.w lr, #0
umull r0, r3, r0, r12
umlal r3, r4, r1, r12
adds r0, r0, r2
adcs r1, r3, #0
adcs r2, r4, #0
adc r3, lr, #0
pop {r4, r6, r7, pc}
Closer to real life
The IRL code does a bit more maths. Let's see how that checks out.
171 lines of assembly. ROBLOX_OOF.mp3
. Lots of divisions and overflow checks.
Optimising: make everything u64
The real code is library-internal, so we can just pass u64
s around and convert to a Duration
at
the public boundary.
Much less assembly now with fewer branches to boot.
Optimising: checked/saturating operations
No more panics which is nice, but still quite a bit of branching. I'll leave this here for now and revisit if performance becomes an issue on embedded systems.
6. Packing two u16
s into a [u8; 4]
Quite a specific one this, but I need to pack a device address plus register pair of u16
s into an
array so it can be written into a network packet buffer.
Original code:
And the simpler version:
ARM (--target thumbv7m-none-eabi
) assembly is pleasantly reduced for pack2
:
pack1:
mov.w r2, #-16777216
and.w r2, r2, r1, lsl #16
uxtb r1, r1
orr.w r1, r2, r1, lsl #16
and r2, r0, #65280
add r1, r2
uxtb r0, r0
add r0, r1
bx lr
pack2:
uxth r0, r0
orr.w r0, r0, r1, lsl #16
bx lr
As is the x86_64
which is to be expected:
pack1:
movzx ecx, sil
shl esi, 16
and esi, -16777216
shl ecx, 16
or ecx, esi
movzx eax, dil
and edi, 65280
or edi, ecx
or eax, edi
ret
pack2:
shl esi, 16
movzx eax, di
or eax, esi
ret
7. Returning 1
or -1
from a boolean comparison
I needed this for some embedded-graphics
stuff where I'm changing the sign of another variable
based on a condition.
Easy to read:
Job Security Edition:
The assembly for both versions is similar:
; Easy to read
f:
xor eax, eax
add edi, esi
sete al
lea eax, [2*rax - 1]
ret
; Job Security Edition
f:
add edi, esi
mov ecx, 1
mov eax, 255
cmove eax, ecx
ret
Bonus round: thumbv6
and thumbv7
embedded-graphics
is designed to run on tiny microcontrollers, not x86_64, so let's see if there's
much difference down in the low power stuff:
thumbv6m-none-eabi
:
; Easy to read
f:
cmn r0, r1
beq .LBB0_2
movs r0, #0
mvns r0, r0
bx lr
.LBB0_2:
movs r0, #1
bx lr
; Job Security Edition
f:
cmn r0, r1
beq .LBB0_2
movs r0, #255
bx lr
.LBB0_2:
movs r0, #1
bx lr
Not much difference, although it's interesting there's a jump in there.
thumbv7-none-eabi
:
; Easy to read
f:
mov.w r2, #-1
cmn r0, r1
it eq
moveq r2, #1
mov r0, r2
bx lr
; Job Security Edition
f:
movs r2, #255
cmn r0, r1
it eq
moveq r2, #1
mov r0, r2
bx lr
The only difference here is mov.w r2,#-1
turns into a movs r2,#255
.
I much prefer the "easy to read" variant as it's obvious what the code is doing. The performance hit is nonexistent.