# Optimizing rust

After the post about optimization, Kostya and many commenters (me included) discussed a bit about if there are better ways to optimize that loop without using unsafe code.

Kostya provided me with a test function and multiple implementations from him and I polished and benchmarked the whole thing.

## The code

I put the code in a simple project, initially it was a simple `main.rs` and then it grew a little.

All it started with this function:

```pub fn recombine_plane_reference(
src: &[i16],
sstride: usize,
dst: &mut [u8],
dstride: usize,
w: usize,
h: usize,
) {
let mut idx0 = 0;
let mut idx1 = w / 2;
let mut idx2 = (h / 2) * sstride;
let mut idx3 = idx2 + idx1;
let mut oidx0 = 0;
let mut oidx1 = dstride;

for _ in 0..(h / 2) {
for x in 0..(w / 2) {
let p0 = src[idx0 + x];
let p1 = src[idx1 + x];
let p2 = src[idx2 + x];
let p3 = src[idx3 + x];
let d0 = p0.wrapping_sub(p2);
let d1 = p1.wrapping_sub(p3);
dst[oidx0 + x * 2 + 0] = clip8(o0.wrapping_shr(2).wrapping_add(128));
dst[oidx0 + x * 2 + 1] = clip8(o1.wrapping_shr(2).wrapping_add(128));
dst[oidx1 + x * 2 + 0] = clip8(o2.wrapping_shr(2).wrapping_add(128));
dst[oidx1 + x * 2 + 1] = clip8(o3.wrapping_shr(2).wrapping_add(128));
}
idx0 += sstride;
idx1 += sstride;
idx2 += sstride;
idx3 += sstride;
oidx0 += dstride * 2;
oidx1 += dstride * 2;
}
}
```

## Benchmark

Kostya used perf to measure the number of samples it takes over a large number of iterations, I wanted to make the benchmark a little more portable so I used the time::PreciseTime Rust provides to measure something a little more coarse, but good enough for our purposes.

We want to see if rewriting the loop using unsafe pointers or using high level iterators provides a decent speedup, no need to be overly precise.

NB: I decided to not use the bencher utility provided with nightly rust to make the code even easier to use.

```+fn benchme<F>(name: &str, n: usize, mut f: F)
+    where F : FnMut() {
+    let start = PreciseTime::now();
+    for _ in 0..n {
+        f();
+    }
+    let end = PreciseTime::now();
+    println!("Runtime {} {}", name, start.to(end));
+}
```
```# cargo run --release
```

## Unsafe code

Both me and Kostya have a C background so for him (and for me), was sort of natural embracing `unsafe {}` and use the raw pointers like we are used to.

```pub fn recombine_plane_unsafe(
src: &[i16],
sstride: usize,
dst: &mut [u8],
dstride: usize,
w: usize,
h: usize,
) {
unsafe {
let hw = (w / 2) as isize;
let mut band0 = src.as_ptr();
let mut band1 = band0.offset(hw);
let mut band2 = band0.offset(((h / 2) * sstride) as isize);
let mut band3 = band2.offset(hw);
let mut dst0 = dst.as_mut_ptr();
let mut dst1 = dst0.offset(dstride as isize);
let hh = (h / 2) as isize;
for _ in 0..hh {
let mut b0_ptr = band0;
let mut b1_ptr = band1;
let mut b2_ptr = band2;
let mut b3_ptr = band3;
let mut d0_ptr = dst0;
let mut d1_ptr = dst1;
for _ in 0..hw {
let p0 = *b0_ptr;
let p1 = *b1_ptr;
let p2 = *b2_ptr;
let p3 = *b3_ptr;
let d0 = p0.wrapping_sub(p2);
let d1 = p1.wrapping_sub(p3);
b0_ptr = b0_ptr.offset(1);
b1_ptr = b1_ptr.offset(1);
b2_ptr = b2_ptr.offset(1);
b3_ptr = b3_ptr.offset(1);
d0_ptr = d0_ptr.offset(2);
d1_ptr = d1_ptr.offset(2);
}
band0 = band0.offset(sstride as isize);
band1 = band1.offset(sstride as isize);
band2 = band2.offset(sstride as isize);
band3 = band3.offset(sstride as isize);
dst0 = dst0.offset((dstride * 2) as isize);
dst1 = dst1.offset((dstride * 2) as isize);
}
}
}
```

The function is faster than baseline:

```    Runtime reference   PT1.598052169S
Runtime unsafe      PT1.222646190S
```

### Explicit upcasts

Kostya noticed that telling rust to use i32 instead of i16 gave some performance boost.

```    Runtime reference       PT1.601846926S
Runtime reference 32bit PT1.371876242S
Runtime unsafe          PT1.223115917S
Runtime unsafe 32bit    PT1.124667021S
```

I’ll keep variants between i16 and i32 to see when it is important and when it is not.

Note: Making code generic over primitive types is currently pretty painful and hopefully will be fixed in the future.

## High level abstractions

Most of the comments to Kostya’s original post were about leveraging the high level abstractions to make the compiler understand the code better.

### Use Iterators

Rust is able to omit the bound checks if there is a warranty that the code cannot go out of the array boundaries. Using Iterators instead of for loops over an external variables should do the trick.

#### Use `Chunks`

`chunks` and `chunks_mut` take a slice and provides a nice iterator that gets you at-most-N-sized pieces of the input slice.

Since that the code works by line it is sort of natural to use it.

#### Use `split_at`

`split_at` and `split_at_mut` get you independent slices, even mutable. The code is writing two lines at time so having the ability to access mutably two regions of the frame is a boon.

The (read-only) input is divided in bands and the output produced is 2 lines at time. `split_at` is much better than using hand-made slicing and
`split_at_mut` is perfect to write at the same time the even and the odd line.

All together

```pub fn recombine_plane_chunks_32(
src: &[i16],
sstride: usize,
dst: &mut [u8],
dstride: usize,
w: usize,
h: usize,
) {
let hw = w / 2;
let hh = h / 2;
let (src1, src2) = src.split_at(sstride * hh);
let mut src1i = src1.chunks(sstride);
let mut src2i = src2.chunks(sstride);
let mut dstch = dst.chunks_mut(dstride * 2);
for _ in 0..hh {
let s1 = src1i.next().unwrap();
let s2 = src2i.next().unwrap();
let mut d = dstch.next().unwrap();
let (mut d0, mut d1) = d.split_at_mut(dstride);
let (b0, b1) = s1.split_at(hw);
let (b2, b3) = s2.split_at(hw);
let mut di0 = d0.iter_mut();
let mut di1 = d1.iter_mut();
let mut bi0 = b0.iter();
let mut bi1 = b1.iter();
let mut bi2 = b2.iter();
let mut bi3 = b3.iter();
for _ in 0..hw {
let p0 = bi0.next().unwrap();
let p1 = bi1.next().unwrap();
let p2 = bi2.next().unwrap();
let p3 = bi3.next().unwrap();
recombine_core_32(*p0, *p1, *p2, *p3, &mut di0, &mut di1);
}
}
}
```

It is a good improvement over the reference baseline, but still not as fast as unsafe.

```    Runtime reference       PT1.621158410S
Runtime reference 32bit PT1.467441931S
Runtime unsafe          PT1.226046003S
Runtime unsafe 32bit    PT1.126615305S
Runtime chunks          PT1.349947181S
Runtime chunks 32bit    PT1.350027322S
```

### Use of `zip` or `izip`

Using next().unwrap() feels clumsy and force the iterator to be explicitly mutable. The loop can be written in a nicer way using the system provided `zip` and the `itertools`-provided `izip`.

`zip` works fine for 2 iterators, then you start piling up (so, (many, (tuples, (that, (feels, lisp))))) (or `(feels (lisp, '(so, many, tuples)))` according to a reader). `izip` flattens the result so it is sort of nicers.

```pub fn recombine_plane_zip_16(
src: &[i16],
sstride: usize,
dst: &mut [u8],
dstride: usize,
w: usize,
h: usize,
) {
let hw = w / 2;
let hh = h / 2;
let (src1, src2) = src.split_at(sstride * hh);
let src1i = src1.chunks(sstride);
let src2i = src2.chunks(sstride);
let mut dstch = dst.chunks_mut(dstride * 2);
for (s1, s2) in src1i.zip(src2i) {
let mut d = dstch.next().unwrap();
let (mut d0, mut d1) = d.split_at_mut(dstride);
let (b0, b1) = s1.split_at(hw);
let (b2, b3) = s2.split_at(hw);
let mut di0 = d0.iter_mut();
let mut di1 = d1.iter_mut();
let iterband = b0.iter().zip(b1.iter().zip(b2.iter().zip(b3.iter())));
for (p0, (p1, (p2, p3))) in iterband {
recombine_core_16(*p0, *p1, *p2, *p3, &mut di0, &mut di1);
}
}
}
```

How they would fare?

```    Runtime reference        PT1.614962959S
Runtime reference 32bit  PT1.369636641S
Runtime unsafe           PT1.223157417S
Runtime unsafe 32bit     PT1.125534521S
Runtime chunks           PT1.350069795S
Runtime chunks 32bit     PT1.381841742S
Runtime zip              PT1.249227707S
Runtime zip 32bit        PT1.094282423S
Runtime izip             PT1.366320546S
Runtime izip 32bit       PT1.208708213S
```

Pretty well.

Looks like `izip` is a little more wasteful than `zip` currently, so looks like we have a winner 🙂

## Conclusions

• Compared to common imperative programming patterns, using the high level abstractions does lead to a nice speedup: use iterators when you can!
• Not all the abstractions cost zero, `zip` made the overall code faster while `izip` lead to a speed regression.
• Do benchmark your time critical code. nightly has some facility for it BUT it is not great for micro-benchmarks.

Overall I’m enjoying a lot writing code in Rust.

## 2 thoughts on “Optimizing rust”

1. Dave says:

I wonder how this compares to a C version of the code. Might be hard to compare if recombine_core_16 is non-trivial.

1. lu_zero says:

It depends, the discussion went further on reddit and enabling lto changes quite a bit the scenario, rustc with lto enabled generates way-faster-than-clang code for the unsafe variant. I hope to find the time (and the will) to dig into this further later.