Optimizing rust

After the post about optimization, Kostya and many commenters (me included) discussed a bit about if there are better ways to optimize that loop without using unsafe code.

Kostya provided me with a test function and multiple implementations from him and I polished and benchmarked the whole thing.

The code

I put the code in a simple project, initially it was a simple main.rs and then it grew a little.

All it started with this function:

pub fn recombine_plane_reference(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    let mut idx0 = 0;
    let mut idx1 = w / 2;
    let mut idx2 = (h / 2) * sstride;
    let mut idx3 = idx2 + idx1;
    let mut oidx0 = 0;
    let mut oidx1 = dstride;

    for _ in 0..(h / 2) {
        for x in 0..(w / 2) {
            let p0 = src[idx0 + x];
            let p1 = src[idx1 + x];
            let p2 = src[idx2 + x];
            let p3 = src[idx3 + x];
            let s0 = p0.wrapping_add(p2);
            let d0 = p0.wrapping_sub(p2);
            let s1 = p1.wrapping_add(p3);
            let d1 = p1.wrapping_sub(p3);
            let o0 = s0.wrapping_add(s1).wrapping_add(2);
            let o1 = d0.wrapping_add(d1).wrapping_add(2);
            let o2 = s0.wrapping_sub(s1).wrapping_add(2);
            let o3 = d0.wrapping_sub(d1).wrapping_add(2);
            dst[oidx0 + x * 2 + 0] = clip8(o0.wrapping_shr(2).wrapping_add(128));
            dst[oidx0 + x * 2 + 1] = clip8(o1.wrapping_shr(2).wrapping_add(128));
            dst[oidx1 + x * 2 + 0] = clip8(o2.wrapping_shr(2).wrapping_add(128));
            dst[oidx1 + x * 2 + 1] = clip8(o3.wrapping_shr(2).wrapping_add(128));
        }
        idx0 += sstride;
        idx1 += sstride;
        idx2 += sstride;
        idx3 += sstride;
        oidx0 += dstride * 2;
        oidx1 += dstride * 2;
    }
}

Benchmark

Kostya used perf to measure the number of samples it takes over a large number of iterations, I wanted to make the benchmark a little more portable so I used the time::PreciseTime Rust provides to measure something a little more coarse, but good enough for our purposes.

We want to see if rewriting the loop using unsafe pointers or using high level iterators provides a decent speedup, no need to be overly precise.

NB: I decided to not use the bencher utility provided with nightly rust to make the code even easier to use.

+fn benchme<F>(name: &str, n: usize, mut f: F)
+    where F : FnMut() {
+    let start = PreciseTime::now();
+    for _ in 0..n {
+        f();
+    }
+    let end = PreciseTime::now();
+    println!("Runtime {} {}", name, start.to(end));
+}

# cargo run --release

Unsafe code

Both me and Kostya have a C background so for him (and for me), was sort of natural embracing unsafe {} and use the raw pointers like we are used to.

pub fn recombine_plane_unsafe(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    unsafe {
        let hw = (w / 2) as isize;
        let mut band0 = src.as_ptr();
        let mut band1 = band0.offset(hw);
        let mut band2 = band0.offset(((h / 2) * sstride) as isize);
        let mut band3 = band2.offset(hw);
        let mut dst0 = dst.as_mut_ptr();
        let mut dst1 = dst0.offset(dstride as isize);
        let hh = (h / 2) as isize;
        for _ in 0..hh {
            let mut b0_ptr = band0;
            let mut b1_ptr = band1;
            let mut b2_ptr = band2;
            let mut b3_ptr = band3;
            let mut d0_ptr = dst0;
            let mut d1_ptr = dst1;
            for _ in 0..hw {
                let p0 = *b0_ptr;
                let p1 = *b1_ptr;
                let p2 = *b2_ptr;
                let p3 = *b3_ptr;
                let s0 = p0.wrapping_add(p2);
                let s1 = p1.wrapping_add(p3);
                let d0 = p0.wrapping_sub(p2);
                let d1 = p1.wrapping_sub(p3);
                let o0 = s0.wrapping_add(s1).wrapping_add(2);
                let o1 = d0.wrapping_add(d1).wrapping_add(2);
                let o2 = s0.wrapping_sub(s1).wrapping_add(2);
                let o3 = d0.wrapping_sub(d1).wrapping_add(2);
                *d0_ptr.offset(0) = clip8((o0 >> 2).wrapping_add(128));
                *d0_ptr.offset(1) = clip8((o1 >> 2).wrapping_add(128));
                *d1_ptr.offset(0) = clip8((o2 >> 2).wrapping_add(128));
                *d1_ptr.offset(1) = clip8((o3 >> 2).wrapping_add(128));
                b0_ptr = b0_ptr.offset(1);
                b1_ptr = b1_ptr.offset(1);
                b2_ptr = b2_ptr.offset(1);
                b3_ptr = b3_ptr.offset(1);
                d0_ptr = d0_ptr.offset(2);
                d1_ptr = d1_ptr.offset(2);
            }
            band0 = band0.offset(sstride as isize);
            band1 = band1.offset(sstride as isize);
            band2 = band2.offset(sstride as isize);
            band3 = band3.offset(sstride as isize);
            dst0 = dst0.offset((dstride * 2) as isize);
            dst1 = dst1.offset((dstride * 2) as isize);
        }
    }
}

The function is faster than baseline:

    Runtime reference   PT1.598052169S
    Runtime unsafe      PT1.222646190S

Explicit upcasts

Kostya noticed that telling rust to use i32 instead of i16 gave some performance boost.

    Runtime reference       PT1.601846926S
    Runtime reference 32bit PT1.371876242S
    Runtime unsafe          PT1.223115917S
    Runtime unsafe 32bit    PT1.124667021S

I’ll keep variants between i16 and i32 to see when it is important and when it is not.

Note: Making code generic over primitive types is currently pretty painful and hopefully will be fixed in the future.

High level abstractions

Most of the comments to Kostya’s original post were about leveraging the high level abstractions to make the compiler understand the code better.

Use Iterators

Rust is able to omit the bound checks if there is a warranty that the code cannot go out of the array boundaries. Using Iterators instead of for loops over an external variables should do the trick.

Use `Chunks`

chunks and chunks_mut take a slice and provides a nice iterator that gets you at-most-N-sized pieces of the input slice.

Since that the code works by line it is sort of natural to use it.

Use `split_at`

split_at and split_at_mut get you independent slices, even mutable. The code is writing two lines at time so having the ability to access mutably two regions of the frame is a boon.

The (read-only) input is divided in bands and the output produced is 2 lines at time. split_at is much better than using hand-made slicing and
split_at_mut is perfect to write at the same time the even and the odd line.

All together

pub fn recombine_plane_chunks_32(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    let hw = w / 2;
    let hh = h / 2;
    let (src1, src2) = src.split_at(sstride * hh);
    let mut src1i = src1.chunks(sstride);
    let mut src2i = src2.chunks(sstride);
    let mut dstch = dst.chunks_mut(dstride * 2);
    for _ in 0..hh {
        let s1 = src1i.next().unwrap();
        let s2 = src2i.next().unwrap();
        let mut d = dstch.next().unwrap();
        let (mut d0, mut d1) = d.split_at_mut(dstride);
        let (b0, b1) = s1.split_at(hw);
        let (b2, b3) = s2.split_at(hw);
        let mut di0 = d0.iter_mut();
        let mut di1 = d1.iter_mut();
        let mut bi0 = b0.iter();
        let mut bi1 = b1.iter();
        let mut bi2 = b2.iter();
        let mut bi3 = b3.iter();
        for _ in 0..hw {
            let p0 = bi0.next().unwrap();
            let p1 = bi1.next().unwrap();
            let p2 = bi2.next().unwrap();
            let p3 = bi3.next().unwrap();
            recombine_core_32(*p0, *p1, *p2, *p3, &mut di0, &mut di1);
        }
    }
}

It is a good improvement over the reference baseline, but still not as fast as unsafe.

    Runtime reference       PT1.621158410S
    Runtime reference 32bit PT1.467441931S
    Runtime unsafe          PT1.226046003S
    Runtime unsafe 32bit    PT1.126615305S
    Runtime chunks          PT1.349947181S
    Runtime chunks 32bit    PT1.350027322S

Use of `zip` or `izip`

Using next().unwrap() feels clumsy and force the iterator to be explicitly mutable. The loop can be written in a nicer way using the system provided zip and the itertools-provided izip.

zip works fine for 2 iterators, then you start piling up (so, (many, (tuples, (that, (feels, lisp))))) (or (feels (lisp, '(so, many, tuples))) according to a reader). izip flattens the result so it is sort of nicers.

pub fn recombine_plane_zip_16(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    let hw = w / 2;
    let hh = h / 2;
    let (src1, src2) = src.split_at(sstride * hh);
    let src1i = src1.chunks(sstride);
    let src2i = src2.chunks(sstride);
    let mut dstch = dst.chunks_mut(dstride * 2);
    for (s1, s2) in src1i.zip(src2i) {
        let mut d = dstch.next().unwrap();
        let (mut d0, mut d1) = d.split_at_mut(dstride);
        let (b0, b1) = s1.split_at(hw);
        let (b2, b3) = s2.split_at(hw);
        let mut di0 = d0.iter_mut();
        let mut di1 = d1.iter_mut();
        let iterband = b0.iter().zip(b1.iter().zip(b2.iter().zip(b3.iter())));
        for (p0, (p1, (p2, p3))) in iterband {
            recombine_core_16(*p0, *p1, *p2, *p3, &mut di0, &mut di1);
        }
    }
}

How they would fare?

    Runtime reference        PT1.614962959S
    Runtime reference 32bit  PT1.369636641S
    Runtime unsafe           PT1.223157417S
    Runtime unsafe 32bit     PT1.125534521S
    Runtime chunks           PT1.350069795S
    Runtime chunks 32bit     PT1.381841742S
    Runtime zip              PT1.249227707S
    Runtime zip 32bit        PT1.094282423S
    Runtime izip             PT1.366320546S
    Runtime izip 32bit       PT1.208708213S

Pretty well.

Looks like izip is a little more wasteful than zip currently, so looks like we have a winner 🙂

Conclusions

Compared to common imperative programming patterns, using the high level abstractions does lead to a nice speedup: use iterators when you can!
Not all the abstractions cost zero, zip made the overall code faster while izip lead to a speed regression.
Do benchmark your time critical code. nightly has some facility for it BUT it is not great for micro-benchmarks.

Overall I’m enjoying a lot writing code in Rust.

2 thoughts on “Optimizing rust”

Dave says:

August 17, 2017 at 3:03 pm

I wonder how this compares to a C version of the code. Might be hard to compare if recombine_core_16 is non-trivial.

1. lu_zero says:
  
  August 17, 2017 at 7:45 pm
  
  It depends, the discussion went further on reddit and enabling lto changes quite a bit the scenario, rustc with lto enabled generates way-faster-than-clang code for the unsafe variant. I hope to find the time (and the will) to dig into this further later.