A study of Rust micro-optimisations
In working on ethercrab I found myself down the rabbit hole
of trying to reduce binary size, in this case for the thumbv7m-none-eabi architecture used in
embedded. This post is a collection of examples of attempted code size reductions by refactoring
sometimes the most tiny of functions. It will hopefully serve as a reference to both me and you,
dear reader, in your travels through the Rust lands.
I don't have much knowledge of the deeper bits of the optimisation pipeline in both LLVM or Rust, so these experiments are probably super obvious to anyone with an inkling, but it was a good learning experience for me either way so I thought I'd share.
1. Clamp length offsets to the buffer length
This is code that will quite happily panic for what it's worth. The skip function is part of a
larger set of uh "generator-combinator" functions (generate stuff instead of parsing it like nom)
and will jump over a given number of bytes in the provided buffer, returning the rest.
// Original impl
// Naive attempt: a little improvement
//
// Important note: the actual asm compiles to the same 4 instructions as `skip` above, however it
// generates one less exception jump and one less string in the final binary.
// Clamp length: maybe now LLVM knows this won't panic?
//
// There are no more assertion checks so we're pretty much as small as we can get.
/// Slightly more instructions than `skip3` but maybe a little bit clearer if that matters to you.
skip3 seems to be the best here. If the returned buffer length is zero, the other original code
that uses it will panic instead so we've probably just moved the assertion check in the wide program
than removed it entirely.
2. Idiomatic method chaining is smarter than you think
Fewer lines really is faster!
use unreachable_unchecked;
use TryInto;
/// Original attempt at getting optimised output, with sad trombone bonus `unsafe`` :(
/// Look at this nice API!
The two latter solutions produce identical assembly, so there's no need for unsafe here - the
performance is already there.
There's also an in-between if you find a lot of chained methods hard to read, which is understandable:
Again, identical assembly as the two above.
3. (currently) nightly: slice_first_last_chunk
This PR will soon stabilise some methods that make parsing integers from slices much more pleasant, but do they help with performance?
// Requires Rust nightly at time of writing
The methods all generate the same assembly.
4. Another scenario
Alright now I'm just impressed. The two following functions generate the same assembly both for x64
and thumbv7m-none-eabi.
Note that the first function is copied from the output of cargo expand - it's generated by a proc
macro derive, therefore is pretty terrible code. But the optimiser doesn't seem to care.
// https://github.com/rust-lang/rust/pull/117561
unpack_from_slice_new is prettier if that matters. Either way it's nice to see that prettier code
doesn't make for worse code.
4. Pointer offsets just turn into numbers
As you might expect, but it's nice to have the confirmation. Original code is this:
unsafe
This just turns into
example::ethercat_payload_ptr:
lea rax, [rdi + 16]
ret
Because both byte_adds are given constants.
5. Converting Duration into u64 nanoseconds isn't that bad
EtherCAT uses nanosecond time everywhere, stored as a u64. I'm using
core::time::Duration for convenience, but wanted to know if there's much of a performance hit when
doing some conversions.
Just converting a Duration to u64
The assembly for thumbv7 comes out as this:
nanos_u64:
movw r3, #51712
movt r3, #15258
muls r1, r3, r1
umlal r2, r1, r0, r3
mov r0, r2
bx lr
nanos_u128:
push {r4, r6, r7, lr}
add r7, sp, #8
movw r12, #51712
movs r4, #0
movt r12, #15258
mov.w lr, #0
umull r0, r3, r0, r12
umlal r3, r4, r1, r12
adds r0, r0, r2
adcs r1, r3, #0
adcs r2, r4, #0
adc r3, lr, #0
pop {r4, r6, r7, pc}
Closer to real life
The IRL code does a bit more maths. Let's see how that checks out.
171 lines of assembly. ROBLOX_OOF.mp3. Lots of divisions and overflow checks.
Optimising: make everything u64
The real code is library-internal, so we can just pass u64s around and convert to a Duration at
the public boundary.
Much less assembly now with fewer branches to boot.
Optimising: checked/saturating operations
No more panics which is nice, but still quite a bit of branching. I'll leave this here for now and revisit if performance becomes an issue on embedded systems.
6. Packing two u16s into a [u8; 4]
Quite a specific one this, but I need to pack a device address plus register pair of u16s into an
array so it can be written into a network packet buffer.
Original code:
And the simpler version:
ARM (--target thumbv7m-none-eabi) assembly is pleasantly reduced for pack2:
pack1:
mov.w r2, #-16777216
and.w r2, r2, r1, lsl #16
uxtb r1, r1
orr.w r1, r2, r1, lsl #16
and r2, r0, #65280
add r1, r2
uxtb r0, r0
add r0, r1
bx lr
pack2:
uxth r0, r0
orr.w r0, r0, r1, lsl #16
bx lr
As is the x86_64 which is to be expected:
pack1:
movzx ecx, sil
shl esi, 16
and esi, -16777216
shl ecx, 16
or ecx, esi
movzx eax, dil
and edi, 65280
or edi, ecx
or eax, edi
ret
pack2:
shl esi, 16
movzx eax, di
or eax, esi
ret
7. Returning 1 or -1 from a boolean comparison
I needed this for some embedded-graphics stuff where I'm changing the sign of another variable
based on a condition.
Easy to read:
Job Security Edition:
The assembly for both versions is similar:
; Easy to read
f:
xor eax, eax
add edi, esi
sete al
lea eax, [2*rax - 1]
ret
; Job Security Edition
f:
add edi, esi
mov ecx, 1
mov eax, 255
cmove eax, ecx
ret
Bonus round: thumbv6 and thumbv7
embedded-graphics is designed to run on tiny microcontrollers, not x86_64, so let's see if there's
much difference down in the low power stuff:
thumbv6m-none-eabi:
; Easy to read
f:
cmn r0, r1
beq .LBB0_2
movs r0, #0
mvns r0, r0
bx lr
.LBB0_2:
movs r0, #1
bx lr
; Job Security Edition
f:
cmn r0, r1
beq .LBB0_2
movs r0, #255
bx lr
.LBB0_2:
movs r0, #1
bx lr
Not much difference, although it's interesting there's a jump in there.
thumbv7-none-eabi:
; Easy to read
f:
mov.w r2, #-1
cmn r0, r1
it eq
moveq r2, #1
mov r0, r2
bx lr
; Job Security Edition
f:
movs r2, #255
cmn r0, r1
it eq
moveq r2, #1
mov r0, r2
bx lr
The only difference here is mov.w r2,#-1 turns into a movs r2,#255.
I much prefer the "easy to read" variant as it's obvious what the code is doing. The performance hit is nonexistent.
8. Checking if a variable is one of several other values
This is a check to see if [].contains() is the same as a naive if chain.
The assembly is nearly identical, but uses registers in a different order which I thought was odd.
is_one_of:
cmp edi, r8d
sete al
cmp esi, r8d
sete sil
or sil, al
cmp edx, r8d
sete dl
cmp ecx, r8d
sete al
or al, dl
or al, sil
ret
is_one_of:
cmp r8d, edi
sete al
cmp r8d, esi
sete sil
or sil, al
cmp r8d, edx
sete dl
cmp r8d, ecx
sete al
or al, dl
or al, sil
ret
```