A study of Rust micro-optimisations

January 06, 2024

In working on ethercrab I found myself down the rabbit hole of trying to reduce binary size, in this case for the thumbv7m-none-eabi architecture used in embedded. This post is a collection of examples of attempted code size reductions by refactoring sometimes the most tiny of functions. It will hopefully serve as a reference to both me and you, dear reader, in your travels through the Rust lands.

I don't have much knowledge of the deeper bits of the optimisation pipeline in both LLVM or Rust, so these experiments are probably super obvious to anyone with an inkling, but it was a good learning experience for me either way so I thought I'd share.

1. Clamp length offsets to the buffer length

Godbolt

This is code that will quite happily panic for what it's worth. The skip function is part of a larger set of uh "generator-combinator" functions (generate stuff instead of parsing it like nom) and will jump over a given number of bytes in the provided buffer, returning the rest.

// Original impl
pub fn skip(len: usize, buf: &mut [u8]) -> &mut [u8] {
    let (_, rest) = buf.split_at_mut(len);

    rest
}

// Naive attempt: a little improvement
//
// Important note: the actual asm compiles to the same 4 instructions as `skip` above, however it
// generates one less exception jump and one less string in the final binary.
pub fn skip2(len: usize, buf: &mut [u8]) -> &mut [u8] {
    &mut buf[len..]
}

// Clamp length: maybe now LLVM knows this won't panic?
//
// There are no more assertion checks so we're pretty much as small as we can get.
pub fn skip3(len: usize, buf: &mut [u8]) -> &mut [u8] {
    let len = len.min(buf.len());

    &mut buf[len..]
}

/// Slightly more instructions than `skip3` but maybe a little bit clearer if that matters to you.
pub fn skip4(len: usize, buf: &mut [u8]) -> &mut [u8] {
    if len >= buf.len() {
        return &mut buf[0..0]
    }

    &mut buf[len..]
}

skip3 seems to be the best here. If the returned buffer length is zero, the other original code that uses it will panic instead so we've probably just moved the assertion check in the wide program than removed it entirely.

2. Idiomatic method chaining is smarter than you think

Godbolt

Fewer lines really is faster!

use core::hint::unreachable_unchecked;
use core::convert::TryInto;

/// Original attempt at getting optimised output, with sad trombone bonus `unsafe`` :(
pub fn unpack_from_slice_naive(buf: &[u8]) -> Result<u32, ()> {
    if buf.len() < 4 {
        return Err(());
    }

    let arr = match buf[0..4].try_into() {
        Ok(arr) => arr,
        // SAFETY: We check the buffer size above
        Err(_) => unsafe { unreachable_unchecked() },
    };

    Ok(u32::from_le_bytes(arr))
}

/// Look at this nice API!
pub fn unpack_from_slice_pleasant(buf: &[u8]) -> Result<u32, ()> {
    buf.get(0..4)
        .ok_or(())
        .and_then(|raw| raw.try_into().map_err(|_| ()))
        .map(u32::from_le_bytes)
}

The two latter solutions produce identical assembly, so there's no need for unsafe here - the performance is already there.

There's also an in-between if you find a lot of chained methods hard to read, which is understandable:

pub fn unpack_from_slice_naive(buf: &[u8]) -> Result<u32, ()> {
    if buf.len() < 4 {
        return Err(());
    }

    buf[0..4].try_into().map(u32::from_le_bytes).map_err(|_| ())
}

Again, identical assembly as the two above.

3. (currently) nightly: `slice_first_last_chunk`

Godbolt

This PR will soon stabilise some methods that make parsing integers from slices much more pleasant, but do they help with performance?

#![no_std]
// Requires Rust nightly at time of writing
#![feature(slice_first_last_chunk)]

pub fn unpack(buf: &[u8]) -> Result<u32, ()> {
    buf.get(0..4)
        .ok_or(())
        .map(|arr| u32::from_le_bytes(arr.try_into().unwrap()))
        .map_err(|_| ())
}

pub fn unpack_improved(buf: &[u8]) -> Result<u32, ()> {
    buf.get(0..4)
        .ok_or(())
        .and_then(|raw| raw.try_into().map_err(|_| ()))
        .map(u32::from_le_bytes)
}

pub fn unpack_new(buf: &[u8]) -> Result<u32, ()> {
    buf.first_chunk::<4>()
        .ok_or(())
        .map(|arr| u32::from_le_bytes(*arr))
        .map_err(|_| ())
}

The methods all generate the same assembly.

4. Another scenario

Godbolt

Alright now I'm just impressed. The two following functions generate the same assembly both for x64 and thumbv7m-none-eabi.

Note that the first function is copied from the output of cargo expand - it's generated by a proc macro derive, therefore is pretty terrible code. But the optimiser doesn't seem to care.

#![feature(slice_first_last_chunk)]
#![no_std]

// https://github.com/rust-lang/rust/pull/117561

pub struct Pdo {
    index: u16,
    num_entries: u8,
    sync_manager: u8,
    dc_sync: u8,
    name_string_idx: u8,
}

impl Pdo {
    pub fn unpack_from_slice_orig(buf: &[u8]) -> Result<Self, ()> {
        let buf = buf
            .get(0..6usize)
            .ok_or(())?;

        Ok(Self {
            index: u16::from_le_bytes(buf[0..2].try_into().unwrap()),
            num_entries: (buf[2usize] & 0b11111111) >> 0usize,
            sync_manager: (buf[3usize] & 0b11111111) >> 0usize,
            dc_sync: (buf[4usize] & 0b11111111) >> 0usize,
            name_string_idx: (buf[5usize] & 0b11111111) >> 0usize,
        })
    }

    pub fn unpack_from_slice_new(buf: &[u8]) -> Result<Self, ()> {
        let (index, buf) = buf.split_first_chunk::<2>().ok_or(())?;
        let (num_entries, buf) = buf.split_first_chunk::<1>().ok_or(())?;
        let (sync_manager, buf) = buf.split_first_chunk::<1>().ok_or(())?;
        let (dc_sync, buf) = buf.split_first_chunk::<1>().ok_or(())?;
        let (name_string_idx, buf) = buf.split_first_chunk::<1>().ok_or(())?;

        Ok(Self {
            index: u16::from_le_bytes(*index),
            num_entries: num_entries[0],
            sync_manager: sync_manager[0],
            dc_sync: dc_sync[0],
            name_string_idx: name_string_idx[0],
        })
    }
}

unpack_from_slice_new is prettier if that matters. Either way it's nice to see that prettier code doesn't make for worse code.

4. Pointer offsets just turn into numbers

Godbolt

As you might expect, but it's nice to have the confirmation. Original code is this:

unsafe fn ethercat_payload_ptr(this: NonNull<FrameElement<N>>) -> NonNull<u8> {
    NonNull::new_unchecked(
        Self::ptr(this)
            .as_ptr()
            .byte_add(EthernetFrame::<&[u8]>::header_len())
            .byte_add(EthercatFrameHeader::header_len()), // .byte_add(PduFrame::header_len()),
    )
}

This just turns into

example::ethercat_payload_ptr:
        lea     rax, [rdi + 16]
        ret

Because both byte_adds are given constants.

5. Converting `Duration` into `u64` nanoseconds isn't that bad

EtherCAT uses nanosecond time everywhere, stored as a u64. I'm using core::time::Duration for convenience, but wanted to know if there's much of a performance hit when doing some conversions.

Just converting a `Duration` to `u64`

Godbolt

#![no_std]

#[no_mangle]
pub fn nanos_u64(input: core::time::Duration) -> u64 {
    input.as_nanos() as u64
}

#[no_mangle]
pub fn nanos_u128(input: core::time::Duration) -> u128 {
    input.as_nanos()
}

The assembly for thumbv7 comes out as this:

nanos_u64:
        movw    r3, #51712
        movt    r3, #15258
        muls    r1, r3, r1
        umlal   r2, r1, r0, r3
        mov     r0, r2
        bx      lr

nanos_u128:
        push    {r4, r6, r7, lr}
        add     r7, sp, #8
        movw    r12, #51712
        movs    r4, #0
        movt    r12, #15258
        mov.w   lr, #0
        umull   r0, r3, r0, r12
        umlal   r3, r4, r1, r12
        adds    r0, r0, r2
        adcs    r1, r3, #0
        adcs    r2, r4, #0
        adc     r3, lr, #0
        pop     {r4, r6, r7, pc}

Closer to real life

Godbolt

The IRL code does a bit more maths. Let's see how that checks out.

pub struct Conf {
    sync0_period: Duration,
    sync0_shift: Duration,
}

pub fn nanos_irl(time: u64, conf: &Conf) -> Duration {
    let cycle_start_offset =
        Duration::from_nanos(time % conf.sync0_period.as_nanos() as u64);

    let time_to_next_iter =
        conf.sync0_period + (conf.sync0_shift - cycle_start_offset);

    time_to_next_iter
}

171 lines of assembly. ROBLOX_OOF.mp3. Lots of divisions and overflow checks.

Optimising: make everything `u64`

Godbolt

The real code is library-internal, so we can just pass u64s around and convert to a Duration at the public boundary.

pub struct Conf {
    sync0_period: u64,
    sync0_shift: u64,
}

#[no_mangle]
pub fn nanos_irl(time: u64, conf: &Conf) -> Duration {
    let cycle_start_offset = time % conf.sync0_period;

    let time_to_next_iter = conf.sync0_period + (conf.sync0_shift - cycle_start_offset);

    Duration::from_nanos(time_to_next_iter)
}

Much less assembly now with fewer branches to boot.

Optimising: checked/saturating operations

Godbolt

#[no_mangle]
pub fn nanos_irl(time: u64, conf: &Conf) -> Duration {
    let cycle_start_offset = time
        .checked_rem(conf.sync0_period)
        .unwrap_or(0);

    let time_to_next_iter = conf.sync0_period.saturating_add(
        conf.sync0_shift.saturating_sub(cycle_start_offset)
    );

    Duration::from_nanos(time_to_next_iter)
}

No more panics which is nice, but still quite a bit of branching. I'll leave this here for now and revisit if performance becomes an issue on embedded systems.