After the post about optimization, Kostya and many commenters (me included) discussed a bit about if there are better ways to optimize that loop without using unsafe code.
Kostya provided me with a test function and multiple implementations from him and I polished and benchmarked the whole thing.
The code
I put the code in a simple project, initially it was a simple main.rs
and then it grew a little.
All it started with this function:
pub fn recombine_plane_reference( src: &[i16], sstride: usize, dst: &mut [u8], dstride: usize, w: usize, h: usize, ) { let mut idx0 = 0; let mut idx1 = w / 2; let mut idx2 = (h / 2) * sstride; let mut idx3 = idx2 + idx1; let mut oidx0 = 0; let mut oidx1 = dstride; for _ in 0..(h / 2) { for x in 0..(w / 2) { let p0 = src[idx0 + x]; let p1 = src[idx1 + x]; let p2 = src[idx2 + x]; let p3 = src[idx3 + x]; let s0 = p0.wrapping_add(p2); let d0 = p0.wrapping_sub(p2); let s1 = p1.wrapping_add(p3); let d1 = p1.wrapping_sub(p3); let o0 = s0.wrapping_add(s1).wrapping_add(2); let o1 = d0.wrapping_add(d1).wrapping_add(2); let o2 = s0.wrapping_sub(s1).wrapping_add(2); let o3 = d0.wrapping_sub(d1).wrapping_add(2); dst[oidx0 + x * 2 + 0] = clip8(o0.wrapping_shr(2).wrapping_add(128)); dst[oidx0 + x * 2 + 1] = clip8(o1.wrapping_shr(2).wrapping_add(128)); dst[oidx1 + x * 2 + 0] = clip8(o2.wrapping_shr(2).wrapping_add(128)); dst[oidx1 + x * 2 + 1] = clip8(o3.wrapping_shr(2).wrapping_add(128)); } idx0 += sstride; idx1 += sstride; idx2 += sstride; idx3 += sstride; oidx0 += dstride * 2; oidx1 += dstride * 2; } }
Benchmark
Kostya used perf to measure the number of samples it takes over a large number of iterations, I wanted to make the benchmark a little more portable so I used the time::PreciseTime Rust provides to measure something a little more coarse, but good enough for our purposes.
We want to see if rewriting the loop using unsafe pointers or using high level iterators provides a decent speedup, no need to be overly precise.
NB: I decided to not use the bencher utility provided with nightly rust to make the code even easier to use.
+fn benchme<F>(name: &str, n: usize, mut f: F) + where F : FnMut() { + let start = PreciseTime::now(); + for _ in 0..n { + f(); + } + let end = PreciseTime::now(); + println!("Runtime {} {}", name, start.to(end)); +}
# cargo run --release
Unsafe code
Both me and Kostya have a C background so for him (and for me), was sort of natural embracing unsafe {}
and use the raw pointers like we are used to.
pub fn recombine_plane_unsafe( src: &[i16], sstride: usize, dst: &mut [u8], dstride: usize, w: usize, h: usize, ) { unsafe { let hw = (w / 2) as isize; let mut band0 = src.as_ptr(); let mut band1 = band0.offset(hw); let mut band2 = band0.offset(((h / 2) * sstride) as isize); let mut band3 = band2.offset(hw); let mut dst0 = dst.as_mut_ptr(); let mut dst1 = dst0.offset(dstride as isize); let hh = (h / 2) as isize; for _ in 0..hh { let mut b0_ptr = band0; let mut b1_ptr = band1; let mut b2_ptr = band2; let mut b3_ptr = band3; let mut d0_ptr = dst0; let mut d1_ptr = dst1; for _ in 0..hw { let p0 = *b0_ptr; let p1 = *b1_ptr; let p2 = *b2_ptr; let p3 = *b3_ptr; let s0 = p0.wrapping_add(p2); let s1 = p1.wrapping_add(p3); let d0 = p0.wrapping_sub(p2); let d1 = p1.wrapping_sub(p3); let o0 = s0.wrapping_add(s1).wrapping_add(2); let o1 = d0.wrapping_add(d1).wrapping_add(2); let o2 = s0.wrapping_sub(s1).wrapping_add(2); let o3 = d0.wrapping_sub(d1).wrapping_add(2); *d0_ptr.offset(0) = clip8((o0 >> 2).wrapping_add(128)); *d0_ptr.offset(1) = clip8((o1 >> 2).wrapping_add(128)); *d1_ptr.offset(0) = clip8((o2 >> 2).wrapping_add(128)); *d1_ptr.offset(1) = clip8((o3 >> 2).wrapping_add(128)); b0_ptr = b0_ptr.offset(1); b1_ptr = b1_ptr.offset(1); b2_ptr = b2_ptr.offset(1); b3_ptr = b3_ptr.offset(1); d0_ptr = d0_ptr.offset(2); d1_ptr = d1_ptr.offset(2); } band0 = band0.offset(sstride as isize); band1 = band1.offset(sstride as isize); band2 = band2.offset(sstride as isize); band3 = band3.offset(sstride as isize); dst0 = dst0.offset((dstride * 2) as isize); dst1 = dst1.offset((dstride * 2) as isize); } } }
The function is faster than baseline:
Runtime reference PT1.598052169S Runtime unsafe PT1.222646190S
Explicit upcasts
Kostya noticed that telling rust to use i32 instead of i16 gave some performance boost.
Runtime reference PT1.601846926S Runtime reference 32bit PT1.371876242S Runtime unsafe PT1.223115917S Runtime unsafe 32bit PT1.124667021S
I’ll keep variants between i16 and i32 to see when it is important and when it is not.
Note: Making code generic over primitive types is currently pretty painful and hopefully will be fixed in the future.
High level abstractions
Most of the comments to Kostya’s original post were about leveraging the high level abstractions to make the compiler understand the code better.
Use Iterators
Rust is able to omit the bound checks if there is a warranty that the code cannot go out of the array boundaries. Using Iterators instead of for loops over an external variables should do the trick.
Use Chunks
chunks
and chunks_mut
take a slice and provides a nice iterator that gets you at-most-N-sized pieces of the input slice.
Since that the code works by line it is sort of natural to use it.
Use split_at
split_at
and split_at_mut
get you independent slices, even mutable. The code is writing two lines at time so having the ability to access mutably two regions of the frame is a boon.
The (read-only) input is divided in bands and the output produced is 2 lines at time. split_at
is much better than using hand-made slicing and
split_at_mut
is perfect to write at the same time the even and the odd line.
All together
pub fn recombine_plane_chunks_32( src: &[i16], sstride: usize, dst: &mut [u8], dstride: usize, w: usize, h: usize, ) { let hw = w / 2; let hh = h / 2; let (src1, src2) = src.split_at(sstride * hh); let mut src1i = src1.chunks(sstride); let mut src2i = src2.chunks(sstride); let mut dstch = dst.chunks_mut(dstride * 2); for _ in 0..hh { let s1 = src1i.next().unwrap(); let s2 = src2i.next().unwrap(); let mut d = dstch.next().unwrap(); let (mut d0, mut d1) = d.split_at_mut(dstride); let (b0, b1) = s1.split_at(hw); let (b2, b3) = s2.split_at(hw); let mut di0 = d0.iter_mut(); let mut di1 = d1.iter_mut(); let mut bi0 = b0.iter(); let mut bi1 = b1.iter(); let mut bi2 = b2.iter(); let mut bi3 = b3.iter(); for _ in 0..hw { let p0 = bi0.next().unwrap(); let p1 = bi1.next().unwrap(); let p2 = bi2.next().unwrap(); let p3 = bi3.next().unwrap(); recombine_core_32(*p0, *p1, *p2, *p3, &mut di0, &mut di1); } } }
It is a good improvement over the reference baseline, but still not as fast as unsafe.
Runtime reference PT1.621158410S Runtime reference 32bit PT1.467441931S Runtime unsafe PT1.226046003S Runtime unsafe 32bit PT1.126615305S Runtime chunks PT1.349947181S Runtime chunks 32bit PT1.350027322S
Use of zip
or izip
Using next().unwrap() feels clumsy and force the iterator to be explicitly mutable. The loop can be written in a nicer way using the system provided zip
and the itertools
-provided izip
.
zip
works fine for 2 iterators, then you start piling up (so, (many, (tuples, (that, (feels, lisp))))) (or (feels (lisp, '(so, many, tuples)))
according to a reader). izip
flattens the result so it is sort of nicers.
pub fn recombine_plane_zip_16( src: &[i16], sstride: usize, dst: &mut [u8], dstride: usize, w: usize, h: usize, ) { let hw = w / 2; let hh = h / 2; let (src1, src2) = src.split_at(sstride * hh); let src1i = src1.chunks(sstride); let src2i = src2.chunks(sstride); let mut dstch = dst.chunks_mut(dstride * 2); for (s1, s2) in src1i.zip(src2i) { let mut d = dstch.next().unwrap(); let (mut d0, mut d1) = d.split_at_mut(dstride); let (b0, b1) = s1.split_at(hw); let (b2, b3) = s2.split_at(hw); let mut di0 = d0.iter_mut(); let mut di1 = d1.iter_mut(); let iterband = b0.iter().zip(b1.iter().zip(b2.iter().zip(b3.iter()))); for (p0, (p1, (p2, p3))) in iterband { recombine_core_16(*p0, *p1, *p2, *p3, &mut di0, &mut di1); } } }
How they would fare?
Runtime reference PT1.614962959S Runtime reference 32bit PT1.369636641S Runtime unsafe PT1.223157417S Runtime unsafe 32bit PT1.125534521S Runtime chunks PT1.350069795S Runtime chunks 32bit PT1.381841742S Runtime zip PT1.249227707S Runtime zip 32bit PT1.094282423S Runtime izip PT1.366320546S Runtime izip 32bit PT1.208708213S
Pretty well.
Looks like izip
is a little more wasteful than zip
currently, so looks like we have a winner 🙂
Conclusions
- Compared to common imperative programming patterns, using the high level abstractions does lead to a nice speedup: use iterators when you can!
- Not all the abstractions cost zero,
zip
made the overall code faster whileizip
lead to a speed regression. - Do benchmark your time critical code. nightly has some facility for it BUT it is not great for micro-benchmarks.
Overall I’m enjoying a lot writing code in Rust.
I wonder how this compares to a C version of the code. Might be hard to compare if recombine_core_16 is non-trivial.
It depends, the discussion went further on reddit and enabling lto changes quite a bit the scenario, rustc with lto enabled generates way-faster-than-clang code for the unsafe variant. I hope to find the time (and the will) to dig into this further later.