« Posts list

A study of Rust micro-optimisations

In working on ethercrab I found myself down the rabbit hole of trying to reduce binary size, in this case for the thumbv7m-none-eabi architecture used in embedded. This post is a collection of examples of attempted code size reductions by refactoring sometimes the most tiny of functions. It will hopefully serve as a reference to both me and you, dear reader, in your travels through the Rust lands.

I don't have much knowledge of the deeper bits of the optimisation pipeline in both LLVM or Rust, so these experiments are probably super obvious to anyone with an inkling, but it was a good learning experience for me either way so I thought I'd share.

🔗1. Clamp length offsets to the buffer length

Godbolt

This is code that will quite happily panic for what it's worth. The skip function is part of a larger set of uh "generator-combinator" functions (generate stuff instead of parsing it like nom) and will jump over a given number of bytes in the provided buffer, returning the rest.

// Original impl
pub fn skip(len: usize, buf: &mut [u8]) -> &mut [u8] {
    let (_, rest) = buf.split_at_mut(len);

    rest
}

// Naive attempt: a little improvement
//
// Important note: the actual asm compiles to the same 4 instructions as `skip` above, however it
// generates one less exception jump and one less string in the final binary.
pub fn skip2(len: usize, buf: &mut [u8]) -> &mut [u8] {
    &mut buf[len..]
}

// Clamp length: maybe now LLVM knows this won't panic?
//
// There are no more assertion checks so we're pretty much as small as we can get.
pub fn skip3(len: usize, buf: &mut [u8]) -> &mut [u8] {
    let len = len.min(buf.len());

    &mut buf[len..]
}

/// Slightly more instructions than `skip3` but maybe a little bit clearer if that matters to you.
pub fn skip4(len: usize, buf: &mut [u8]) -> &mut [u8] {
    if len >= buf.len() {
        return &mut buf[0..0]
    }

    &mut buf[len..]
}

skip3 seems to be the best here. If the returned buffer length is zero, the other original code that uses it will panic instead so we've probably just moved the assertion check in the wide program than removed it entirely.

🔗2. Idiomatic method chaining is smarter than you think

Godbolt

Fewer lines really is faster!

use core::hint::unreachable_unchecked;
use core::convert::TryInto;

/// Original attempt at getting optimised output, with sad trombone bonus `unsafe`` :(
pub fn unpack_from_slice_naive(buf: &[u8]) -> Result<u32, ()> {
    if buf.len() < 4 {
        return Err(());
    }

    let arr = match buf[0..4].try_into() {
        Ok(arr) => arr,
        // SAFETY: We check the buffer size above
        Err(_) => unsafe { unreachable_unchecked() },
    };

    Ok(u32::from_le_bytes(arr))
}

/// Look at this nice API!
pub fn unpack_from_slice_pleasant(buf: &[u8]) -> Result<u32, ()> {
    buf.get(0..4)
        .ok_or(())
        .and_then(|raw| raw.try_into().map_err(|_| ()))
        .map(u32::from_le_bytes)
}

The two latter solutions produce identical assembly, so there's no need for unsafe here - the performance is already there.

There's also an in-between if you find a lot of chained methods hard to read, which is understandable:

pub fn unpack_from_slice_naive(buf: &[u8]) -> Result<u32, ()> {
    if buf.len() < 4 {
        return Err(());
    }

    buf[0..4].try_into().map(u32::from_le_bytes).map_err(|_| ())
}

Again, identical assembly as the two above.

🔗3. (currently) nightly: slice_first_last_chunk

Godbolt

This PR will soon stabilise some methods that make parsing integers from slices much more pleasant, but do they help with performance?

#![no_std]
// Requires Rust nightly at time of writing
#![feature(slice_first_last_chunk)]

pub fn unpack(buf: &[u8]) -> Result<u32, ()> {
    buf.get(0..4)
        .ok_or(())
        .map(|arr| u32::from_le_bytes(arr.try_into().unwrap()))
        .map_err(|_| ())
}

pub fn unpack_improved(buf: &[u8]) -> Result<u32, ()> {
    buf.get(0..4)
        .ok_or(())
        .and_then(|raw| raw.try_into().map_err(|_| ()))
        .map(u32::from_le_bytes)
}

pub fn unpack_new(buf: &[u8]) -> Result<u32, ()> {
    buf.first_chunk::<4>()
        .ok_or(())
        .map(|arr| u32::from_le_bytes(*arr))
        .map_err(|_| ())
}

The methods all generate the same assembly.

🔗Another scenario

Godbolt

Alright now I'm just impressed. The two following functions generate the same assembly both for x64 and thumbv7m-none-eabi.

Note that the first function is copied from the output of cargo expand - it's generated by a proc macro derive, therefore is pretty terrible code. But the optimiser doesn't seem to care.

#![feature(slice_first_last_chunk)]
#![no_std]

// https://github.com/rust-lang/rust/pull/117561

pub struct Pdo {
    index: u16,
    num_entries: u8,
    sync_manager: u8,
    dc_sync: u8,
    name_string_idx: u8,
}

impl Pdo {
    pub fn unpack_from_slice_orig(buf: &[u8]) -> Result<Self, ()> {
        let buf = buf
            .get(0..6usize)
            .ok_or(())?;

        Ok(Self {
            index: u16::from_le_bytes(buf[0..2].try_into().unwrap()),
            num_entries: (buf[2usize] & 0b11111111) >> 0usize,
            sync_manager: (buf[3usize] & 0b11111111) >> 0usize,
            dc_sync: (buf[4usize] & 0b11111111) >> 0usize,
            name_string_idx: (buf[5usize] & 0b11111111) >> 0usize,
        })
    }

    pub fn unpack_from_slice_new(buf: &[u8]) -> Result<Self, ()> {
        let (index, buf) = buf.split_first_chunk::<2>().ok_or(())?;
        let (num_entries, buf) = buf.split_first_chunk::<1>().ok_or(())?;
        let (sync_manager, buf) = buf.split_first_chunk::<1>().ok_or(())?;
        let (dc_sync, buf) = buf.split_first_chunk::<1>().ok_or(())?;
        let (name_string_idx, buf) = buf.split_first_chunk::<1>().ok_or(())?;

        Ok(Self {
            index: u16::from_le_bytes(*index),
            num_entries: num_entries[0],
            sync_manager: sync_manager[0],
            dc_sync: dc_sync[0],
            name_string_idx: name_string_idx[0],
        })
    }
}

unpack_from_slice_new is prettier if that matters. Either way it's nice to see that prettier code doesn't make for worse code.