« Posts list

A study of Rust micro-optimisations

In working on ethercrab I found myself down the rabbit hole of trying to reduce binary size, in this case for the thumbv7m-none-eabi architecture used in embedded. This post is a collection of examples of attempted code size reductions by refactoring sometimes the most tiny of functions. It will hopefully serve as a reference to both me and you, dear reader, in your travels through the Rust lands.

I don't have much knowledge of the deeper bits of the optimisation pipeline in both LLVM or Rust, so these experiments are probably super obvious to anyone with an inkling, but it was a good learning experience for me either way so I thought I'd share.

1. Clamp length offsets to the buffer length

Godbolt

This is code that will quite happily panic for what it's worth. The skip function is part of a larger set of uh "generator-combinator" functions (generate stuff instead of parsing it like nom) and will jump over a given number of bytes in the provided buffer, returning the rest.

// Original impl
pub fn skip(len: usize, buf: &mut [u8]) -> &mut [u8] {
    let (_, rest) = buf.split_at_mut(len);

    rest
}

// Naive attempt: a little improvement
//
// Important note: the actual asm compiles to the same 4 instructions as `skip` above, however it
// generates one less exception jump and one less string in the final binary.
pub fn skip2(len: usize, buf: &mut [u8]) -> &mut [u8] {
    &mut buf[len..]
}

// Clamp length: maybe now LLVM knows this won't panic?
//
// There are no more assertion checks so we're pretty much as small as we can get.
pub fn skip3(len: usize, buf: &mut [u8]) -> &mut [u8] {
    let len = len.min(buf.len());

    &mut buf[len..]
}

/// Slightly more instructions than `skip3` but maybe a little bit clearer if that matters to you.
pub fn skip4(len: usize, buf: &mut [u8]) -> &mut [u8] {
    if len >= buf.len() {
        return &mut buf[0..0]
    }

    &mut buf[len..]
}

skip3 seems to be the best here. If the returned buffer length is zero, the other original code that uses it will panic instead so we've probably just moved the assertion check in the wide program than removed it entirely.

2. Idiomatic method chaining is smarter than you think

Godbolt

Fewer lines really is faster!

use core::hint::unreachable_unchecked;
use core::convert::TryInto;

/// Original attempt at getting optimised output, with sad trombone bonus `unsafe`` :(
pub fn unpack_from_slice_naive(buf: &[u8]) -> Result<u32, ()> {
    if buf.len() < 4 {
        return Err(());
    }

    let arr = match buf[0..4].try_into() {
        Ok(arr) => arr,
        // SAFETY: We check the buffer size above
        Err(_) => unsafe { unreachable_unchecked() },
    };

    Ok(u32::from_le_bytes(arr))
}

/// Look at this nice API!
pub fn unpack_from_slice_pleasant(buf: &[u8]) -> Result<u32, ()> {
    buf.get(0..4)
        .ok_or(())
        .and_then(|raw| raw.try_into().map_err(|_| ()))
        .map(u32::from_le_bytes)
}

The two latter solutions produce identical assembly, so there's no need for unsafe here - the performance is already there.

There's also an in-between if you find a lot of chained methods hard to read, which is understandable:

pub fn unpack_from_slice_naive(buf: &[u8]) -> Result<u32, ()> {
    if buf.len() < 4 {
        return Err(());
    }

    buf[0..4].try_into().map(u32::from_le_bytes).map_err(|_| ())
}

Again, identical assembly as the two above.

3. (currently) nightly: slice_first_last_chunk

Godbolt

This PR will soon stabilise some methods that make parsing integers from slices much more pleasant, but do they help with performance?

#![no_std]
// Requires Rust nightly at time of writing
#![feature(slice_first_last_chunk)]

pub fn unpack(buf: &[u8]) -> Result<u32, ()> {
    buf.get(0..4)
        .ok_or(())
        .map(|arr| u32::from_le_bytes(arr.try_into().unwrap()))
        .map_err(|_| ())
}

pub fn unpack_improved(buf: &[u8]) -> Result<u32, ()> {
    buf.get(0..4)
        .ok_or(())
        .and_then(|raw| raw.try_into().map_err(|_| ()))
        .map(u32::from_le_bytes)
}

pub fn unpack_new(buf: &[u8]) -> Result<u32, ()> {
    buf.first_chunk::<4>()
        .ok_or(())
        .map(|arr| u32::from_le_bytes(*arr))
        .map_err(|_| ())
}

The methods all generate the same assembly.

4. Another scenario

Godbolt

Alright now I'm just impressed. The two following functions generate the same assembly both for x64 and thumbv7m-none-eabi.

Note that the first function is copied from the output of cargo expand - it's generated by a proc macro derive, therefore is pretty terrible code. But the optimiser doesn't seem to care.

#![feature(slice_first_last_chunk)]
#![no_std]

// https://github.com/rust-lang/rust/pull/117561

pub struct Pdo {
    index: u16,
    num_entries: u8,
    sync_manager: u8,
    dc_sync: u8,
    name_string_idx: u8,
}

impl Pdo {
    pub fn unpack_from_slice_orig(buf: &[u8]) -> Result<Self, ()> {
        let buf = buf
            .get(0..6usize)
            .ok_or(())?;

        Ok(Self {
            index: u16::from_le_bytes(buf[0..2].try_into().unwrap()),
            num_entries: (buf[2usize] & 0b11111111) >> 0usize,
            sync_manager: (buf[3usize] & 0b11111111) >> 0usize,
            dc_sync: (buf[4usize] & 0b11111111) >> 0usize,
            name_string_idx: (buf[5usize] & 0b11111111) >> 0usize,
        })
    }

    pub fn unpack_from_slice_new(buf: &[u8]) -> Result<Self, ()> {
        let (index, buf) = buf.split_first_chunk::<2>().ok_or(())?;
        let (num_entries, buf) = buf.split_first_chunk::<1>().ok_or(())?;
        let (sync_manager, buf) = buf.split_first_chunk::<1>().ok_or(())?;
        let (dc_sync, buf) = buf.split_first_chunk::<1>().ok_or(())?;
        let (name_string_idx, buf) = buf.split_first_chunk::<1>().ok_or(())?;

        Ok(Self {
            index: u16::from_le_bytes(*index),
            num_entries: num_entries[0],
            sync_manager: sync_manager[0],
            dc_sync: dc_sync[0],
            name_string_idx: name_string_idx[0],
        })
    }
}

unpack_from_slice_new is prettier if that matters. Either way it's nice to see that prettier code doesn't make for worse code.

4. Pointer offsets just turn into numbers

Godbolt

As you might expect, but it's nice to have the confirmation. Original code is this:

unsafe fn ethercat_payload_ptr(this: NonNull<FrameElement<N>>) -> NonNull<u8> {
    NonNull::new_unchecked(
        Self::ptr(this)
            .as_ptr()
            .byte_add(EthernetFrame::<&[u8]>::header_len())
            .byte_add(EthercatFrameHeader::header_len()), // .byte_add(PduFrame::header_len()),
    )
}

This just turns into

example::ethercat_payload_ptr:
        lea     rax, [rdi + 16]
        ret

Because both byte_adds are given constants.

5. Converting Duration into u64 nanoseconds isn't that bad

EtherCAT uses nanosecond time everywhere, stored as a u64. I'm using core::time::Duration for convenience, but wanted to know if there's much of a performance hit when doing some conversions.

Just converting a Duration to u64

Godbolt

#![no_std]

#[no_mangle]
pub fn nanos_u64(input: core::time::Duration) -> u64 {
    input.as_nanos() as u64
}

#[no_mangle]
pub fn nanos_u128(input: core::time::Duration) -> u128 {
    input.as_nanos()
}

The assembly for thumbv7 comes out as this:

nanos_u64:
        movw    r3, #51712
        movt    r3, #15258
        muls    r1, r3, r1
        umlal   r2, r1, r0, r3
        mov     r0, r2
        bx      lr

nanos_u128:
        push    {r4, r6, r7, lr}
        add     r7, sp, #8
        movw    r12, #51712
        movs    r4, #0
        movt    r12, #15258
        mov.w   lr, #0
        umull   r0, r3, r0, r12
        umlal   r3, r4, r1, r12
        adds    r0, r0, r2
        adcs    r1, r3, #0
        adcs    r2, r4, #0
        adc     r3, lr, #0
        pop     {r4, r6, r7, pc}

Closer to real life

Godbolt

The IRL code does a bit more maths. Let's see how that checks out.

pub struct Conf {
    sync0_period: Duration,
    sync0_shift: Duration,
}

pub fn nanos_irl(time: u64, conf: &Conf) -> Duration {
    let cycle_start_offset =
        Duration::from_nanos(time % conf.sync0_period.as_nanos() as u64);

    let time_to_next_iter =
        conf.sync0_period + (conf.sync0_shift - cycle_start_offset);

    time_to_next_iter
}

171 lines of assembly. ROBLOX_OOF.mp3. Lots of divisions and overflow checks.

Optimising: make everything u64

Godbolt

The real code is library-internal, so we can just pass u64s around and convert to a Duration at the public boundary.

pub struct Conf {
    sync0_period: u64,
    sync0_shift: u64,
}

#[no_mangle]
pub fn nanos_irl(time: u64, conf: &Conf) -> Duration {
    let cycle_start_offset = time % conf.sync0_period;

    let time_to_next_iter = conf.sync0_period + (conf.sync0_shift - cycle_start_offset);

    Duration::from_nanos(time_to_next_iter)
}

Much less assembly now with fewer branches to boot.

Optimising: checked/saturating operations

Godbolt

#[no_mangle]
pub fn nanos_irl(time: u64, conf: &Conf) -> Duration {
    let cycle_start_offset = time
        .checked_rem(conf.sync0_period)
        .unwrap_or(0);

    let time_to_next_iter = conf.sync0_period.saturating_add(
        conf.sync0_shift.saturating_sub(cycle_start_offset)
    );

    Duration::from_nanos(time_to_next_iter)
}

No more panics which is nice, but still quite a bit of branching. I'll leave this here for now and revisit if performance becomes an issue on embedded systems.

6. Packing two u16s into a [u8; 4]

Godbolt

Quite a specific one this, but I need to pack a device address plus register pair of u16s into an array so it can be written into a network packet buffer.

Original code:

pub fn pack1(address: u16, register: u16) -> [u8; 4] {
    let address = address.to_le_bytes();
    let register = register.to_le_bytes();

    [address[0], address[1], register[0], register[1]]
}

And the simpler version:

pub fn pack2(address: u16, register: u16) -> [u8; 4] {
    let res = (u32::from(register) << 16) + u32::from(address);

    res.to_le_bytes()
}

ARM (--target thumbv7m-none-eabi) assembly is pleasantly reduced for pack2:

pack1:
        mov.w   r2, #-16777216
        and.w   r2, r2, r1, lsl #16
        uxtb    r1, r1
        orr.w   r1, r2, r1, lsl #16
        and     r2, r0, #65280
        add     r1, r2
        uxtb    r0, r0
        add     r0, r1
        bx      lr

pack2:
        uxth    r0, r0
        orr.w   r0, r0, r1, lsl #16
        bx      lr

As is the x86_64 which is to be expected:

pack1:
        movzx   ecx, sil
        shl     esi, 16
        and     esi, -16777216
        shl     ecx, 16
        or      ecx, esi
        movzx   eax, dil
        and     edi, 65280
        or      edi, ecx
        or      eax, edi
        ret

pack2:
        shl     esi, 16
        movzx   eax, di
        or      eax, esi
        ret

7. Returning 1 or -1 from a boolean comparison

Godbolt

I needed this for some embedded-graphics stuff where I'm changing the sign of another variable based on a condition.

Easy to read:

pub fn f(a: i32, b: i32) -> i32 {
    if a == -b {
        1
    } else {
        -1
    }
}

Job Security Edition:

pub fn f(a: i32, b: i32) -> i32 {
    i32::from((a == -b) as u8 * 2 -1)
}

The assembly for both versions is similar:

; Easy to read
f:
        xor     eax, eax
        add     edi, esi
        sete    al
        lea     eax, [2*rax - 1]
        ret

; Job Security Edition
f:
        add     edi, esi
        mov     ecx, 1
        mov     eax, 255
        cmove   eax, ecx
        ret

Bonus round: thumbv6 and thumbv7

Godbolt

embedded-graphics is designed to run on tiny microcontrollers, not x86_64, so let's see if there's much difference down in the low power stuff:

thumbv6m-none-eabi:

; Easy to read
f:
        cmn     r0, r1
        beq     .LBB0_2
        movs    r0, #0
        mvns    r0, r0
        bx      lr
.LBB0_2:
        movs    r0, #1
        bx      lr

; Job Security Edition
f:
        cmn     r0, r1
        beq     .LBB0_2
        movs    r0, #255
        bx      lr
.LBB0_2:
        movs    r0, #1
        bx      lr

Not much difference, although it's interesting there's a jump in there.

thumbv7-none-eabi:

; Easy to read
f:
        mov.w   r2, #-1
        cmn     r0, r1
        it      eq
        moveq   r2, #1
        mov     r0, r2
        bx      lr

; Job Security Edition
f:
        movs    r2, #255
        cmn     r0, r1
        it      eq
        moveq   r2, #1
        mov     r0, r2
        bx      lr

The only difference here is mov.w r2,#-1 turns into a movs r2,#255.

I much prefer the "easy to read" variant as it's obvious what the code is doing. The performance hit is nonexistent.