rust – Luca Barbato

Making and using C-compatible libraries in rust: present and future

Since there are plenty of blogposts about what people would like to have or will implement in rust in 2019 here is mine.

I spent the last few weeks of my spare time making a C-api for rav1e called crav1e, overall the experience had been a mixed bag and there is large space for improvement.

Ideally I’d like to have by the end of the year something along the lines of:

$ cargo install-library --prefix=/usr --libdir=/usr/lib64 --destdir=/staging/place

So that it would:
– build a valid cdylib+staticlib
– produce a correct header
– produce a correct pkg-config file
– install all of it in the right paths

All of this requiring a quite basic build.rs and, probably, an applet.

What is it all about?

Building and installing properly shared libraries is quite hard, even more on multiple platforms.

Right now cargo has quite limited install capabilities with some work pending on extending it and has an open issue and a patch.

Distributions that, probably way too early since the rust-ABI is not stable nor being stabilized yet, are experimenting in building everything as shared library also have those problems.

Why it is important

rust is a pretty good language and has a fairly simple way to interact in both direction with any other language that can produce or consume C-ABI-compatible object code.

This is already quite useful if you want to build a small static archive and link it in your larger application and/or library.

An example of this use-case is librsvg.

Such heterogeneous environment warrant for a modicum of additional machinery and complication.

But if your whole library is written in rust, it is a fairly annoying amount of boilerplate that you would rather avoid.

Current status

If you want to provide C-bindings to your crates you do not have a single perfect solution right now.

What works well already

Currently building the library itself works fine and it is really straightforward:

It is quite easy to mark data types and functions to be C-compatible:

#[repr(C)]
pub struct Foo {
    a: Bar,
    ...
}

#[no_mangle]
pub unsafe extern "C" fn make_foo() -> *mut Foo {
   ...
}

rustc and cargo are aware of different crate-types, selecting staticlib produces a valid library

[lib]
name = "rav1e"
crate-type = ["staticlib"]

cbindgen can produce a usable C-header from a crate using few lines of build.rs or a stand-alone applet and a toml configuration file.

extern crate cbindgen;

fn main() {
    let crate_dir = std::env::var("CARGO_MANIFEST_DIR").unwrap();
    let header_path: std::path::PathBuf = ["include", "rav1e.h"].iter().collect();

    cbindgen::generate(crate_dir).unwrap().write_to_file(header_path);

    println!("cargo:rerun-if-changed=src/lib.rs");
    println!("cargo:rerun-if-changed=cbindgen.toml");
}

header = "// SPDX-License-Identifier: MIT"
sys_includes = ["stddef.h"]
include_guard = "RAV1E_H"
tab_width = 4
style = "Type"
language = "C"

[parse]
parse_deps = true
include = ['rav1e']
expand = ['rav1e']

[export]
prefix = "Ra"
item_types = ["enums", "structs", "unions", "typedefs", "opaque", "functions"]

[enum]
rename_variants = "ScreamingSnakeCase"
prefix_with_name = true

Now issuing cargo build --release will get you a .h in the include/ dir and a .a library in target/release, so far it is simple enough.

What sort of works

Once have a static library, you need an external mean to track what are its dependencies.

Back in the old ages there were libtool archives (.la), now we have pkg-config files providing more information and in a format that is way easier to parse and use.

rustc has --print native-static-libs to produce the additional libraries to link, BUT prints it to stderr and only as a side-effect of the actual build process.

My, fairly ugly, hack had been adding a dummy empty subcrate just to produce the link-line using

cargo rustc -- --print native-static-libs 2>&1| grep native-static-libs | cut -d ':' -f 3

And then generate the .pc file from a template.

This is anything but straightforward and because how cargo rustc works, you may end up adding an empty subcrate just to extract this information quickly.

What is missing

Once you have your library, your header and your pkg-config file, you probably would like to install the library somehow and/or make a package out of it.

cargo install does not currently cover it. It works only for binaries and just binaries alone. It will hopefully change, but right now you just have to pick the external build system you are happy with and hack your way to integrate the steps mentioned above.

For crav1e I ended up hacking a quite crude Makefile.

And with that at least a pure-rust static library can be built and installed with the common:

make DESTDIR=/staging/place prefix=/usr libdir=/usr/lib64

Dynamic libraries

Given rustc and cargo have the cdylib crate type, one would assume we could just add the type, modify our build-system contraption a little and go our merry way.

Sadly not. A dynamic library (or shared object) requires in most of the common platform some additional metadata to guide the runtime linker.

The current widespread practice is to use tools such as patchelf or install_name_tool, but it is quite brittle and might require tools.

My plans for the 2019

rustc has a mean to pass the information to the compile-time linker but there is no proper way to pass it in cargo, I already tried to provide a solution, but I’ll have to go through the rfc route to make sure there is community consensus and the feature is properly documented.

Since kind of metadata is platform-specific so would be better to have this information produced and passed on by something external to the main cargo. Having it as applet or a build.rs dependency makes easier to support more platforms little by little and have overrides without having to go through a main cargo update.

The applet could also take care of properly create the .pc file and installing since it would have access to all the required information.

Some efforts could be also put on streamlining the process of extracting the library link line for the static data and spare some roundtrips.

I guess that’s all for what I’d really like to have next year in rust and I’m confident I’ll have time to deliver myself 🙂

Altivec and VSX in Rust (part 1)

I’m involved in implementing the Altivec and VSX support on rust stdsimd.

Supporting all the instructions in this language is a HUGE endeavor since for each instruction at least 2 tests have to be written and making functions type-generic gets you to the point of having few pages of implementation (that luckily desugars to the single right instruction and nothing else).

Since I’m doing this mainly for my multimedia needs I have a short list of instructions I find handy to get some code written immediately and today I’ll talk a bit about some of them.

This post is inspired by what Luc did for neon, but I’m using rust instead.

If other people find it useful, I’ll try to write down the remaining istructions.

Permutations

Most if not all the SIMD ISAs have at least one or multiple instructions to shuffle vector elements within a vector or among two.

It is quite common to use those instructions to implement matrix transposes, but it isn’t its only use.

In my toolbox I put vec_perm and vec_xxpermdi since even if the portable stdsimd provides some shuffle support it is quite unwieldy compared to the Altivec native offering.

`vec_perm`: Vector Permute

Since it first iteration Altivec had a quite amazing instruction called vec_perm or vperm:

    fn vec_perm(a: i8x16, b: i8x16, c: i8x16) -> i8x16 {
        let mut d;
        for i in 0..16 {
            let idx = c[i] & 0xf;
            d[i] = if (c[i] & 0x10) == 0 {
                a[idx]
            } else {
                b[idx]
            };
        }
        d
    }

It is important to notice that the displacement map c is a vector and not a constant. That gives you quite a bit of flexibility in a number of situations.

This instruction is the building block you can use to implement a large deal of common patterns, including some that are also covered by stand-alone instructions e.g.:
– packing/unpacking across lanes as long you do not have to saturate: vec_pack, vec_unpackh/vec_unpackl
– interleave/merge two vectors: vec_mergel, vec_mergeh
– shift N bytes in a vector from another: vec_sld

It can be important to recall this since you could always take two permutations and make one, vec_perm itself is pretty fast and replacing two or more instructions with a single permute can get you a pretty neat speed boost.

`vec_xxpermdi` Vector Permute Doubleword Immediate

Among a good deal of improvements VSX introduced a number of instructions that work on 64bit-elements vectors, among those we have a permute instruction and I found myself using it a lot.

    #[rustc_args_required_const(2)]
    fn vec_xxpermdi(a: i64x2, b: i64x2, c: u8) -> i64x2 {
        match c & 0b11 {
            0b00 => i64x2::new(a[0], b[0]);
            0b01 => i64x2::new(a[1], b[0]);
            0b10 => i64x2::new(a[0], b[1]);
            0b11 => i64x2::new(a[1], b[1]);
        }
    }

This instruction is surely less flexible than the previous permute but it does not require an additional load.

When working on video codecs is quite common to deal with blocks of pixels that go from 4×4 up to 64×64, before vec_xxpermdi the common pattern was:

    #[inline(always)]
    fn store8(dst: &mut [u8x16], v: &[u8x16]) {
        let data = dst[i];
        dst[i] = vec_perm(v, data, TAKE_THE_FIRST_8);
    }

That implies to load the mask as often as needed as long as the destination.

Using vec_xxpermdi avoids the mask load and that usually leads to a quite significative speedup when the actual function is tiny.

Mixed Arithmetics

With mixed arithmetics I consider all the instructions that do in a single step multiple vector arithmetics.

The original altivec has the following operations available for the integer types:
– vec_madds
– vec_mladd
– vec_mradds
– vec_msum
– vec_msums
– vec_sum2s
– vec_sum4s
– vec_sums

And the following two for the float type:
– vec_madd
– vec_nmsub

All of them are quite useful and they will all find their way in stdsimd pretty soon.

I’m describing today vec_sums, vec_msums and vec_madds.

They are quite representative and the other instructions are similar in spirit:
– vec_madds, vec_mladd and vec_mradds all compute a lane-wise product, take either the high-order or the low-order part of it
and add a third vector returning a vector of the same element size.
– vec_sums, vec_sum2s and vec_sum4s all combine an in-vector sum operation with a sum with another vector.
– vec_msum and vec_msums both compute a sum of products, the intermediates are added together and then added to a wider-element
vector.

If there is enough interest and time I can extend this post to cover all of them, for today we’ll go with this approximation.

`vec_sums`: Vector Sum Saturated

Usually SIMD instruction work with two (or 3) vectors and execute the same operation for each vector element.
Sometimes you want to just do operations within the single vector and vec_sums is one of the few instructions that let you do that:

    fn vec_sums(a: i32x4, b: i32x4) -> i32x4 {
        let d = i32x4::new(0, 0, 0, 0);

        d[3] = b[3].saturating_add(a[0]).saturating_add(a[1]).saturating_add(a[2]).saturating_add(a[3]);

        d
    }

It returns in the last element of the vector the sum of the vector elements of a and the last element of b.
It is pretty handy when you need to compute an average or similar operations.

It works only with 32bit signed element vectors.

`vec_msums`: Vector Multiply Sum Saturated

This instruction sums the 32bit element of the third vector with the two products of the respective 16bit
elements of the first two vectors overlapping the element.

It does quite a bit:

    fn vmsumshs(a: i16x8, b: i16x8, c: i32x4) -> i32x4 {
        let d;
        for i in 0..4 {
            let idx = 2 * i;
            let m0 = a[idx] as i32 * b[idx] as i32;
            let m1 = a[idx + 1] as i32 * b[idx + 1] as i32;
            d[i] = c[i].saturating_add(m0).saturating_add(m1);
        }
        d
    }

    fn vmsumuhs(a: u16x8, b: u16x8, c: u32x4) -> u32x4 {
        let d;
        for i in 0..4 {
            let idx = 2 * i;
            let m0 = a[idx] as u32 * b[idx] as u32;
            let m1 = a[idx + 1] as u32 * b[idx + 1] as u32;
            d[i] = c[i].saturating_add(m0).saturating_add(m1);
        }
        d
    }

    ...

    fn vec_msums<T, U>(a: T, b: T, c: U) -> U
    where T: sealed::VectorMultiplySumSaturate<U> {
        a.msums(b, c)
    }

It works only with 16bit elements, signed or unsigned. In order to support that on rust we have to use some creative trait.
It is quite neat if you have to implement some filters.

`vec_madds`: Vector Multiply Add Saturated

    fn vec_madds(a: i16x8, b: i16x8, c: i16x8) -> i16x8 {
        let d;
        for i in 0..8 {
            let v = (a[i] as i32 * b[i] as i32) >> 15;
            d[i] = (v as i16).saturating_add(c[i]);
        }
        d
    }

Takes the high-order 17bit of the lane-wise product of the first two vectors and adds it to a third one.

Coming next

Raptor Enginering kindly gave me access to a Power 9 through their Integricloud hosting.

We could run some extensive benchmarks and we found some peculiar behaviour with the C compilers available on the machine and that got me, Luc and Alexandra a little puzzled.

Next time I’ll try to collect in a little more organic way what I randomly put on my twitter as I noticed it.

Rust-av: Rust and Multimedia

Recently I presented my new project at Fosdem and since I was afraid of not having enough time for the questions I trimmed the content to the bare minimum. This blog post should add some more details.

What is it?

Rust-av aims to be a complete multimedia toolkit written in rust.

Rust is a quite promising language that aims to offer high execution speed while granting a number of warranties on the code behavior that you cannot have in C, C++, Java and so on.

Its zero-cost abstraction feature coupled with the fact that the compiler actively prevents you from committing a large class of mistakes related to memory access seems a perfect match to implement a multimedia toolkit that is easy to use, fast enough and trustworthy.

Why something completely new?

Since rust code can be intermixed with C code, an evolutive approach of replacing little by little small components in a larger project is perfectly feasible, and it is what we are currently trying to do with vlc.

But rust is not just good to write some inner routines so they are fast and correct, its trait system is also quite useful to have a more expressive API.

Most of the multimedia concepts are pretty simple at the high level (e.g frame is just a picture or some sound with some timestamp) with an excruciating amount of quirk and details that require your toolkit to make choices for you or make you face a large amount of complexity.

That leads to API that are either easy but quite inflexible (and opinionated) or API providing all the flexibility, but forcing the user to have to learn a lot of information in order to achieve what the simpler API would let you implement in an handful of lines of code.

I wanted to leverage Rust to make the low level implementations with less bugs and, at the same time, try to provide a better API to use it.

Why now?

Since 2016 I kept bouncing ideas with Kostya and Geoffroy but between my work duties and other projects I couldn’t devote enough time to it. Thanks to the Mozilla Open Source Support initiative that awarded me with enough to develop it full time, now the project has already some components published and more will follow during the next months.

Philosophy

I’m trying to leverage the experience I have from contributing to vlc and libav and keep what is working well and try to not make the same mistakes.

Ease of use

I want that the whole toolkit to be useful to a wide audience. Developers often fight against the library in order to undo what is happening under the hood or end up vendoring some part of it since they need only a tiny subset of all the features that are provided.

Rust makes quite natural split large projects in independent components (called crates) and it is already quite common to have meta-crates re-exporting many smaller crates to provide some uniform access.

The rust-av code, as opposed to the rather monolithic approach taken in Libav, can be reused with the granularity of the bare codec or format:

Integrating it in a foreign toolkit won’t require to undo what the common utility code does.
Even when using it through the higher level layers, rust-av won’t force the developer to bring in any unrelated dependencies.
On the other hand users that enjoy a fully integrated and all-encompassing solution can simply depend on the meta-crates and get the support for everything.

Speed

Multimedia playback boils down to efficiently do complex computation so an arbitrary large amount of data can be rendered within a fraction of second, multimedia real time streaming requires to compress an equally large amount of data in the same time.

Speed in multimedia is important.

Rust provides high level idiomatic constructs that surprisingly lead to pretty decent runtime speed. The stdsimd effort and the seamless C ABI support make easier to leverage the SIMD instructions provided by the recent CPU architectures.

Trustworthy

Traditionally the most effective way to write fast multimedia code had been pairing C and assembly. Sadly the combination makes quite easy to overlook corner cases and have any kind of memory hazards (use-after-free, out of bound reads and writes, NULL-dereferences…).

Rust effectively prevents a good deal of those issues at compile time. Since its abstractions usually do not cause slowdowns it is possible to write code that is, arguably, less misleading and as fast.

Structure

The toolkit is composed of multiple, loosely coupled, crates. They can be grouped by level of abstraction.

Essential

av-data: Used by nearly all the other crates it provides basic data types and a minimal amount of functionality on top of it. It provides the following structs mainly:

Frame: it binds together a time reference and a buffer, representing either a video picture or some audio samples.
Packet: it bind together a time reference and a buffer, containing compressed data.
Value: Simple key value type abstraction, used to pass arbitrary data to the configuration functions.

Core

They provide the basic abstraction (traits) implemented by specific set of components.

av-format: It provides a set of traits to implement muxers and demuxers and an utility Context to bridge the normal rust I/O Write and Read traits and the actual muxers and demuxers.
av-codec: It provides a set of traits to implement encoders and decoders and an utility Context that wraps.

Utility

They provide building blocks that may be used to implement actual codecs and formats.

av-bitstream: Utility crate to write and read bits and bytes
av-audio: Audio-specific utilities
av-video: Video-specific utilities

Implementation

Actual implementations of codec and format, they can be used directly or through the utility Contexts.

The direct usage is suggested only if you are integrating it in larger frameworks that already implement, possibly in different ways, the integration code provided by the Context (e.g. binding it together with the I/O for the formats or internal queues for the codecs).

Higher-level

They provide higher level Contexts to playback or encode data through a simplified interface:

av-player reads bytes though a provided Read and outputs decoded Frames. Under the hood it probes the data, allocates and configures a Demuxer and a Decoder for each stream of interest.
av-encoder consumes Frames and outputs encoded and muxed data through a Write output. It automatically setup the encoders and the muxer.

Meta-crates

They ease the use in bulk of everything provided by rust-av.

There are 4 crates providing a list of specific components: av-demuxers, av-muxers, av-decoders and av-encoders; and 2 grouping them by type: av-formats and av-codecs.

Their use is suggested when you’d like to support every format and codec available.

So far

All the development happens on the github organization and so far the initial Core and Essential crates are ready to be used.

There is a nom-based matroska demuxer in working condition and some non-native wrappers providing implementations for some decoders and encoders.

Thanks to est31 we have native vorbis support.

I’m working on a native implementation of opus and soon I’ll move to a video codec.

There is a tiny player called avp and an encoder tool (named ave) will appear once the matroska muxer is complete.

What’s missing in rust-av

API-wise, right now rust-av provides only for simple decode and encoding, muxing and demuxing. There are already enough wrapped codecs to let people play with the library and, hopefully, help in polishing it.

For each crate I’m trying to prepare some easy tasks so people willing to contribute to the project can start from them, all help is welcome!

What’s missing in rust

So far my experience with rust had been quite positive, but there are a number of features that are missing or that could be addressed.

SIMD support is shaping up nicely and it is coming soon.
The natural fallback, going down to assembly, is available since rust supports the C ABI, inline assembly support on the other hand seems that is still pending some discussion before it reaches stable.
Arbitrarily aligned allocation is a MUST in order to support hardware acceleration and SIMD works usually better with aligned buffers.
I’d love to have const generics now, luckily associated constants with traits allow some workarounds that let you specialize by constants (and result in neat speedups).
I think that focusing a little more on array/slice support would lead to the best gains, since right now there isn’t an equivalent to collect() to fill arrays in an idiomatic way and in multimedia large lookup tables are pretty much a staple.

In closing

Rust and Multimedia seem a really good match, in my experience beside a number of missing features the language seems quite good for the purpose.

Once I have more native implementations complete I will be able to have better means to evaluate the speed difference from writing the same code in C.

Optimizing rust

After the post about optimization, Kostya and many commenters (me included) discussed a bit about if there are better ways to optimize that loop without using unsafe code.

Kostya provided me with a test function and multiple implementations from him and I polished and benchmarked the whole thing.

The code

I put the code in a simple project, initially it was a simple main.rs and then it grew a little.

All it started with this function:

pub fn recombine_plane_reference(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    let mut idx0 = 0;
    let mut idx1 = w / 2;
    let mut idx2 = (h / 2) * sstride;
    let mut idx3 = idx2 + idx1;
    let mut oidx0 = 0;
    let mut oidx1 = dstride;

    for _ in 0..(h / 2) {
        for x in 0..(w / 2) {
            let p0 = src[idx0 + x];
            let p1 = src[idx1 + x];
            let p2 = src[idx2 + x];
            let p3 = src[idx3 + x];
            let s0 = p0.wrapping_add(p2);
            let d0 = p0.wrapping_sub(p2);
            let s1 = p1.wrapping_add(p3);
            let d1 = p1.wrapping_sub(p3);
            let o0 = s0.wrapping_add(s1).wrapping_add(2);
            let o1 = d0.wrapping_add(d1).wrapping_add(2);
            let o2 = s0.wrapping_sub(s1).wrapping_add(2);
            let o3 = d0.wrapping_sub(d1).wrapping_add(2);
            dst[oidx0 + x * 2 + 0] = clip8(o0.wrapping_shr(2).wrapping_add(128));
            dst[oidx0 + x * 2 + 1] = clip8(o1.wrapping_shr(2).wrapping_add(128));
            dst[oidx1 + x * 2 + 0] = clip8(o2.wrapping_shr(2).wrapping_add(128));
            dst[oidx1 + x * 2 + 1] = clip8(o3.wrapping_shr(2).wrapping_add(128));
        }
        idx0 += sstride;
        idx1 += sstride;
        idx2 += sstride;
        idx3 += sstride;
        oidx0 += dstride * 2;
        oidx1 += dstride * 2;
    }
}

Benchmark

Kostya used perf to measure the number of samples it takes over a large number of iterations, I wanted to make the benchmark a little more portable so I used the time::PreciseTime Rust provides to measure something a little more coarse, but good enough for our purposes.

We want to see if rewriting the loop using unsafe pointers or using high level iterators provides a decent speedup, no need to be overly precise.

NB: I decided to not use the bencher utility provided with nightly rust to make the code even easier to use.

+fn benchme<F>(name: &str, n: usize, mut f: F)
+    where F : FnMut() {
+    let start = PreciseTime::now();
+    for _ in 0..n {
+        f();
+    }
+    let end = PreciseTime::now();
+    println!("Runtime {} {}", name, start.to(end));
+}

# cargo run --release

Unsafe code

Both me and Kostya have a C background so for him (and for me), was sort of natural embracing unsafe {} and use the raw pointers like we are used to.

pub fn recombine_plane_unsafe(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    unsafe {
        let hw = (w / 2) as isize;
        let mut band0 = src.as_ptr();
        let mut band1 = band0.offset(hw);
        let mut band2 = band0.offset(((h / 2) * sstride) as isize);
        let mut band3 = band2.offset(hw);
        let mut dst0 = dst.as_mut_ptr();
        let mut dst1 = dst0.offset(dstride as isize);
        let hh = (h / 2) as isize;
        for _ in 0..hh {
            let mut b0_ptr = band0;
            let mut b1_ptr = band1;
            let mut b2_ptr = band2;
            let mut b3_ptr = band3;
            let mut d0_ptr = dst0;
            let mut d1_ptr = dst1;
            for _ in 0..hw {
                let p0 = *b0_ptr;
                let p1 = *b1_ptr;
                let p2 = *b2_ptr;
                let p3 = *b3_ptr;
                let s0 = p0.wrapping_add(p2);
                let s1 = p1.wrapping_add(p3);
                let d0 = p0.wrapping_sub(p2);
                let d1 = p1.wrapping_sub(p3);
                let o0 = s0.wrapping_add(s1).wrapping_add(2);
                let o1 = d0.wrapping_add(d1).wrapping_add(2);
                let o2 = s0.wrapping_sub(s1).wrapping_add(2);
                let o3 = d0.wrapping_sub(d1).wrapping_add(2);
                *d0_ptr.offset(0) = clip8((o0 >> 2).wrapping_add(128));
                *d0_ptr.offset(1) = clip8((o1 >> 2).wrapping_add(128));
                *d1_ptr.offset(0) = clip8((o2 >> 2).wrapping_add(128));
                *d1_ptr.offset(1) = clip8((o3 >> 2).wrapping_add(128));
                b0_ptr = b0_ptr.offset(1);
                b1_ptr = b1_ptr.offset(1);
                b2_ptr = b2_ptr.offset(1);
                b3_ptr = b3_ptr.offset(1);
                d0_ptr = d0_ptr.offset(2);
                d1_ptr = d1_ptr.offset(2);
            }
            band0 = band0.offset(sstride as isize);
            band1 = band1.offset(sstride as isize);
            band2 = band2.offset(sstride as isize);
            band3 = band3.offset(sstride as isize);
            dst0 = dst0.offset((dstride * 2) as isize);
            dst1 = dst1.offset((dstride * 2) as isize);
        }
    }
}

The function is faster than baseline:

    Runtime reference   PT1.598052169S
    Runtime unsafe      PT1.222646190S

Explicit upcasts

Kostya noticed that telling rust to use i32 instead of i16 gave some performance boost.

    Runtime reference       PT1.601846926S
    Runtime reference 32bit PT1.371876242S
    Runtime unsafe          PT1.223115917S
    Runtime unsafe 32bit    PT1.124667021S

I’ll keep variants between i16 and i32 to see when it is important and when it is not.

Note: Making code generic over primitive types is currently pretty painful and hopefully will be fixed in the future.

High level abstractions

Most of the comments to Kostya’s original post were about leveraging the high level abstractions to make the compiler understand the code better.

Use Iterators

Rust is able to omit the bound checks if there is a warranty that the code cannot go out of the array boundaries. Using Iterators instead of for loops over an external variables should do the trick.

Use `Chunks`

chunks and chunks_mut take a slice and provides a nice iterator that gets you at-most-N-sized pieces of the input slice.

Since that the code works by line it is sort of natural to use it.

Use `split_at`

split_at and split_at_mut get you independent slices, even mutable. The code is writing two lines at time so having the ability to access mutably two regions of the frame is a boon.

The (read-only) input is divided in bands and the output produced is 2 lines at time. split_at is much better than using hand-made slicing and
split_at_mut is perfect to write at the same time the even and the odd line.

All together

pub fn recombine_plane_chunks_32(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    let hw = w / 2;
    let hh = h / 2;
    let (src1, src2) = src.split_at(sstride * hh);
    let mut src1i = src1.chunks(sstride);
    let mut src2i = src2.chunks(sstride);
    let mut dstch = dst.chunks_mut(dstride * 2);
    for _ in 0..hh {
        let s1 = src1i.next().unwrap();
        let s2 = src2i.next().unwrap();
        let mut d = dstch.next().unwrap();
        let (mut d0, mut d1) = d.split_at_mut(dstride);
        let (b0, b1) = s1.split_at(hw);
        let (b2, b3) = s2.split_at(hw);
        let mut di0 = d0.iter_mut();
        let mut di1 = d1.iter_mut();
        let mut bi0 = b0.iter();
        let mut bi1 = b1.iter();
        let mut bi2 = b2.iter();
        let mut bi3 = b3.iter();
        for _ in 0..hw {
            let p0 = bi0.next().unwrap();
            let p1 = bi1.next().unwrap();
            let p2 = bi2.next().unwrap();
            let p3 = bi3.next().unwrap();
            recombine_core_32(*p0, *p1, *p2, *p3, &mut di0, &mut di1);
        }
    }
}

It is a good improvement over the reference baseline, but still not as fast as unsafe.

    Runtime reference       PT1.621158410S
    Runtime reference 32bit PT1.467441931S
    Runtime unsafe          PT1.226046003S
    Runtime unsafe 32bit    PT1.126615305S
    Runtime chunks          PT1.349947181S
    Runtime chunks 32bit    PT1.350027322S

Use of `zip` or `izip`

Using next().unwrap() feels clumsy and force the iterator to be explicitly mutable. The loop can be written in a nicer way using the system provided zip and the itertools-provided izip.

zip works fine for 2 iterators, then you start piling up (so, (many, (tuples, (that, (feels, lisp))))) (or (feels (lisp, '(so, many, tuples))) according to a reader). izip flattens the result so it is sort of nicers.

pub fn recombine_plane_zip_16(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    let hw = w / 2;
    let hh = h / 2;
    let (src1, src2) = src.split_at(sstride * hh);
    let src1i = src1.chunks(sstride);
    let src2i = src2.chunks(sstride);
    let mut dstch = dst.chunks_mut(dstride * 2);
    for (s1, s2) in src1i.zip(src2i) {
        let mut d = dstch.next().unwrap();
        let (mut d0, mut d1) = d.split_at_mut(dstride);
        let (b0, b1) = s1.split_at(hw);
        let (b2, b3) = s2.split_at(hw);
        let mut di0 = d0.iter_mut();
        let mut di1 = d1.iter_mut();
        let iterband = b0.iter().zip(b1.iter().zip(b2.iter().zip(b3.iter())));
        for (p0, (p1, (p2, p3))) in iterband {
            recombine_core_16(*p0, *p1, *p2, *p3, &mut di0, &mut di1);
        }
    }
}

How they would fare?

    Runtime reference        PT1.614962959S
    Runtime reference 32bit  PT1.369636641S
    Runtime unsafe           PT1.223157417S
    Runtime unsafe 32bit     PT1.125534521S
    Runtime chunks           PT1.350069795S
    Runtime chunks 32bit     PT1.381841742S
    Runtime zip              PT1.249227707S
    Runtime zip 32bit        PT1.094282423S
    Runtime izip             PT1.366320546S
    Runtime izip 32bit       PT1.208708213S

Pretty well.

Looks like izip is a little more wasteful than zip currently, so looks like we have a winner 🙂

Conclusions

Compared to common imperative programming patterns, using the high level abstractions does lead to a nice speedup: use iterators when you can!
Not all the abstractions cost zero, zip made the overall code faster while izip lead to a speed regression.
Do benchmark your time critical code. nightly has some facility for it BUT it is not great for micro-benchmarks.

Overall I’m enjoying a lot writing code in Rust.