Building crates so they look like C(ABI) Libraries

I presented cargo-c at the rustlab 2019, here is a longer followup of this.

Mixing Rust and C

One of the best selling point for rust is being highly interoperable with the C-ABI, in addition to safety, speed and its amazing community.

This comes really handy when you have well optimized hand-crafted asm kernels you’d like to use as-they-are:

  • They are small and with a clear interface, usually strict boundaries on what they read/write by their own nature.
  • You’d basically rewrite them as they are using some inline assembly for dubious gains.
  • Both cc-rs and nasm-rs make the process of building and linking relatively painless.

Also, if you plan to integrate in a foreign language project some rust component, it is quite straightforward to link the staticlib produced by cargo in your main project.

If you have a pure-rust crate and you want to export it to the world as if it were a normal C (shared/dynamic) library, it gets quite gory.

Well behaved C-API Library structure

Usually when you want to use a C-library in your own project you should expect it to provide the following:

  • A header file, telling the compiler which symbols it should expect
  • A static library
  • A dynamic library
  • A pkg-config file giving you direction on where to find the header and what you need to pass to the linker to correctly link the library, being it static or dynamic

Header file

In C you usually keep a list of function prototypes and type definitions in a separate file and then embed it in your source file to let the compiler know what to expect.

Since you rely on a quite simple preprocessor to do that you have to be careful about adding guards so the file does not get included more than once and, in order to avoid clashes you install it in a subdirectory of your include dir.

Since the location of the header could be not part of the default search path, you store this information in pkg-config usually.

Static Libraries

Static libraries are quite simple in concept (and execution):

  • they are an archive of object code files.
  • the linker simply reads them as it would read just produced .os and link everything together.

There is a pitfall though:

  • In some platforms even if you want to make a fully static binary you end up dynamically linking some system library for a number of reasons.

    The worst offenders are the pthread libraries and in some cases the compiler builtins (e.g. libgcc_s)

  • The information on what they are is usually not known

rustc comes to the rescue with --print native-static-libs, it isn’t the best example of integration since it’s a string produced on stderr and it behaves as a side-effect of the actual building, but it is still a good step in the right direction.

pkg-config is the de-facto standard way to preserve the information and have the build systems know about it (I guess you are seeing a pattern now).

Dynamic Libraries

A shared or dynamic library is a specially crafted lump of executable code that gets linked to the binary as it is being executed.
The advantages compared to statically linking everything are mainly two:

  • Sparing disk space: since without link-time pruning you end up carrying multiple copies of the same library with every binary using it.
  • Safer and simpler updates: If you need to update say, openssl, you do that once compared to updating the 100+ consumers of it existing in your system.

There is some inherent complexity and constraints in order to get this feature right, the most problematic one is ABI stability:

  • The dynamic linker needs to find the symbols the binary expects and have them with the correct size
  • If you change the in-memory layout of a struct or how the function names are represented you should make so the linker is aware.

Usually that means that depending on your platform you have some versioning information you should provide when you are preparing your library. This can be as simple as telling the compile-time linker to embed the version information (e.g. Mach-O dylib or ELF) in the library or as complex as crafting a version script.

Compared to crafting a staticlib it there are more moving parts and platform-specific knowledge.

Sadly in this case rustc does not provide any help for now: even if the C-ABI is stable and set in stone, the rust mangling strategy is not finalized yet, and it is a large part of being ABI stable, so the work on fully supporting dynamic libraries is yet to be completed.

Dynamic libraries in most platforms have a mean to store which other dynamic libraries they reliy on and which are the paths in which to look for. When the information is incomplete, or you are storing the library in a non-standard path, pkg-config comes to the rescue again, helpfully storing the information for you.

Pkg-config

It is your single point of truth as long your build system supports it and the libraries you want to use craft it properly.
It simplifies a lot your life if you want to keep around multiple versions of a library or you are doing non-system packaging (e.g.: Homebrew or Gentoo Prefix).
Beside the search path, link line and dependency information I mentioned above, it also stores the library version and inter-library compatibility relationships.
If you are publishing a C-library and you aren’t providing a .pc file, please consider doing it.

Producing a C-compatible library out of a crate

I explained what we are expected to produce, now let see what we can do on the rust side:

  • We need to export C-ABI-compatible symbols, that means we have to:
  • Decorate the data types we want to export with #[repr(C)]
  • Decorate the functions with #[no_mangle] and prefix them with export "C"
  • Tell rustc the crate type is both staticlib and cdylib
  • Pass rustc the platform-correct link line so the library produced has the right information inside.
    > NOTE: In some platforms beside the version information also the install path must be encoded in the library.
  • Generate the header file so that the C compiler knows about them.
  • Produce a pkg-config file with the correct information

    NOTE: It requires knowing where the whole lot will be eventually installed.

cargo does not support installing libraries at all (since for now rust dynamic libraries should not be used at all) so we are a bit on our own.

For rav1e I did that the hard way and then I came up an easy way for you to use (and that I used for doing the same again with lewton spending about 1/2 day instead of several ~~weeks~~months).

The hard way

As seen in crav1e, you can explore the history there.

It isn’t the fully hard way since before cargo-c there was already nice tools to avoid some time consuming tasks: cbindgen.
In a terse summary what I had to do was:

  • Come up with an external build system since cargo itself cannot install anything nor have direct knowledge of the install path information. I used Make since it is simple and sufficiently widespread, anything richer would probably get in the way and be more time consuming to set up.
  • Figure out how to extract the information provided in Cargo.toml so I have it at Makefile level. I gave up and duplicated it since parsing toml or json is pointlessly complicated for a prototype.
  • Write down the platform-specific logic on how to build (and install) the libraries. It ended up living in the build.rs and the Makefile. Thanks again to Derek for taking care of the Windows-specific details.
  • Use cbindgen to generate the C header (And in the process smooth some of its rough edges
  • Since we already have a build system add more targets for testing and continuous integration purpose.

If you do not want to use cargo-c I spun away the cdylib-link line logic in a stand alone crate so you can use it in your build.rs.

The easier way

Using a Makefile and a separate crate with a customized build.rs works fine and keeps the developers that care just about writing in rust fully shielded from the gory details and contraptions presented above.

But it comes with some additional churn:

  • Keeping the API in sync
  • Duplicate the release work
  • Have the users confused on where to report the issues or where to find the actual sources. (The users tend to miss the information presented in the obvious places such as the README way too often)

So to try to minimize it I came up with a cargo applet that provides two subcommands:

  • cbuild to build the libraries, the .pc file and header.
  • cinstall to install the whole lot, if already built or to build and then install it.

They are two subcommands since it is quite common to build as user and then install as root. If you are using rustup and root does not have cargo you can get away with using --destdir and then sudo install or craft your local package if your distribution provides a mean to do that.

All I mentioned in the hard way happens under the hood and, beside bugs in the current implementation, you should be completely oblivious of the details.

Using cargo-c

As seen in lewton and rav1e.

  • Create a capi.rs with the C-API you want to expose and use #[cfg(cargo_c)] to hide it when you build a normal rust library.
  • Make sure you have a lib target and if you are using a workspace the first member is the crate you want to export, that means that you might have to add a "." member at the start of the list.
  • Remember to add a cbindgen.toml and fill it with at least the include guard and probably you want to set the language to C (it defaults to C++)
  • Once you are happy with the result update your documentation to tell the user to install cargo-c and do cargo cinstall --prefix=/usr --destdir=/tmp/some-place or something along those lines.

Coming next

cargo-c is a young project and far from being complete even if it is functional for my needs.

Help in improving it is welcome, there are plenty of rough edges and bugs to find and squash.

Thanks

Thanks to est31 and sdroege for the in-depth review in #rust-av and kodabb for the last minute edits.

Using rav1e – from your code

AV1, Rav1e, Crav1e, an intro

(this article is also available on my dev.to profile, I might use it more often since wordpress is pretty horrible at managing markdown.)

AV1 is a modern video codec brought to you by an alliance of many different bigger and smaller players in the multimedia field.
I’m part of the VideoLan organization and I spent quite a bit of time on this codec lately.


rav1e: The safest and fastest AV1 encoder, built by many volunteers and Mozilla/Xiph developers.
It is written in rust and strives to provide good speed, quality and stay maintainable.


crav1e: A companion crate, written by yours truly, that provides a C-API, so the encoder can be used by C libraries and programs.


This article will just give a quick overview of the API available right now and it is mainly to help people start using it and hopefully report issues and problem.

Rav1e API

The current API is built around the following 4 structs and 1 enum:

  • struct Frame: The raw pixel data
  • struct Packet: The encoded bitstream
  • struct Config: The encoder configuration
  • struct Context: The encoder state

  • enum EncoderStatus: Fatal and non-fatal condition returned by the Contextmethods.

Config

The Config struct currently is simply constructed.

    struct Config {
        enc: EncoderConfig,
        threads: usize,
    }

The EncoderConfig stores all the settings that have an impact to the actual bitstream while settings such as threads are kept outside.

    let mut enc = EncoderConfig::with_speed_preset(speed);
    enc.width = w;
    enc.height = h;
    enc.bit_depth = 8;
    let cfg = Config { enc, threads: 0 };

NOTE: Some of the fields above may be shuffled around until the API is marked as stable.

Methods

new_context
    let cfg = Config { enc, threads: 0 };
    let ctx: Context<u8> = cfg.new_context();

It produces a new encoding context. Where bit_depth is 8, it is possible to use an optimized u8 codepath, otherwise u16 must be used.

Context

It is produced by Config::new_context, its implementation details are hidden.

Methods

The Context can be grouped into essential, optional and convenience.

    // Essential API
    pub fn send_frame<F>(&mut self, frame: F) -> Result<(), EncoderStatus>
      where F: Into<Option<Arc<Frame<T>>>>, T: Pixel;
    pub fn receive_packet(&mut self) -> Result<Packet<T>, EncoderStatus>;

The encoder works by processing each Frame fed through send_frame and producing each Packet that can be retrieved by receive_packet.

    // Optional API
    pub fn container_sequence_header(&mut self) -> Vec<u8>;
    pub fn get_first_pass_data(&self) -> &FirstPassData;

Depending on the container format, the AV1 Sequence Header could be stored in the extradata. container_sequence_header produces the data pre-formatted to be simply stored in mkv or mp4.

rav1e supports multi-pass encoding and the encoding data from the first pass can be retrieved by calling get_first_pass_data.

    // Convenience shortcuts
    pub fn new_frame(&self) -> Arc<Frame<T>>;
    pub fn set_limit(&mut self, limit: u64);
    pub fn flush(&mut self) {
  • new_frame() produces a frame according to the dimension and pixel format information in the Context.
  • flush() is functionally equivalent to call send_frame(None).
  • set_limit()is functionally equivalent to call flush()once limit frames are sent to the encoder.

Workflow

The workflow is the following:

  1. Setup:
    • Prepare a Config
    • Call new_context from the Config to produce a Context
  2. Encode loop:
    • Pull each Packet using receive_packet.
    • If receive_packet returns EncoderStatus::NeedMoreData
      • Feed each Frame to the Context using send_frame
  3. Complete the encoding
    • Issue a flush() to encode each pending Frame in a final Packet.
    • Call receive_packet until EncoderStatus::LimitReached is returned.

Crav1e API

The crav1e API provides the same structures and features beside few key differences:

  • The Frame, Config, and Context structs are opaque.
typedef struct RaConfig RaConfig;
typedef struct RaContext RaContext;
typedef struct RaFrame RaFrame;
  • The Packet struct exposed is much simpler than the rav1e original.
typedef struct {
    const uint8_t *data;
    size_t len;
    uint64_t number;
    RaFrameType frame_type;
} RaPacket;
  • The EncoderStatus includes a Success condition.
typedef enum {
    RA_ENCODER_STATUS_SUCCESS = 0,
    RA_ENCODER_STATUS_NEED_MORE_DATA,
    RA_ENCODER_STATUS_ENOUGH_DATA,
    RA_ENCODER_STATUS_LIMIT_REACHED,
    RA_ENCODER_STATUS_FAILURE = -1,
} RaEncoderStatus;

RaConfig

Since the configuration is opaque there are a number of functions to assemble it:

  • rav1e_config_default allocates a default configuration.
  • rav1e_config_parse and rav1e_config_parse_int set a specific value for a specific field selected by a key string.
  • rav1e_config_set_${field} are specialized setters for complex information such as the color description.
RaConfig *rav1e_config_default(void);

/**
 * Set a configuration parameter using its key and value as string.
 * Available keys and values
 * - "quantizer": 0-255, default 100
 * - "speed": 0-10, default 3
 * - "tune": "psnr"-"psychovisual", default "psnr"
 * Return a negative value on error or 0.
 */
int rav1e_config_parse(RaConfig *cfg, const char *key, const char *value);

/**
 * Set a configuration parameter using its key and value as integer.
 * Available keys and values are the same as rav1e_config_parse()
 * Return a negative value on error or 0.
 */
int rav1e_config_parse_int(RaConfig *cfg, const char *key, int value);

/**
 * Set color properties of the stream.
 * Supported values are defined by the enum types
 * RaMatrixCoefficients, RaColorPrimaries, and RaTransferCharacteristics
 * respectively.
 * Return a negative value on error or 0.
 */
int rav1e_config_set_color_description(RaConfig *cfg,
                                       RaMatrixCoefficients matrix,
                                       RaColorPrimaries primaries,
                                       RaTransferCharacteristics transfer);

/**
 * Set the content light level information for HDR10 streams.
 * Return a negative value on error or 0.
 */
int rav1e_config_set_content_light(RaConfig *cfg,
                                   uint16_t max_content_light_level,
                                   uint16_t max_frame_average_light_level);

/**
 * Set the mastering display information for HDR10 streams.
 * primaries and white_point arguments are RaPoint, containing 0.16 fixed point values.
 * max_luminance is a 24.8 fixed point value.
 * min_luminance is a 18.14 fixed point value.
 * Returns a negative value on error or 0.
 */
int rav1e_config_set_mastering_display(RaConfig *cfg,
                                       RaPoint primaries[3],
                                       RaPoint white_point,
                                       uint32_t max_luminance,
                                       uint32_t min_luminance);

void rav1e_config_unref(RaConfig *cfg);

The bare minimum setup code is the following:

    int ret = -1;
    RaConfig *rac = rav1e_config_default();
    if (!rac) {
        printf("Unable to initialize\n");
        goto clean;
    }

    ret = rav1e_config_parse_int(rac, "width", 64);
    if (ret < 0) {
        printf("Unable to configure width\n");
        goto clean;
    }

    ret = rav1e_config_parse_int(rac, "height", 96);
    if (ret < 0) {
        printf("Unable to configure height\n");
        goto clean;
    }

    ret = rav1e_config_parse_int(rac, "speed", 9);
    if (ret < 0) {
        printf("Unable to configure speed\n");
        goto clean;
    }

RaContext

As per the rav1e API, the context structure is produced from a configuration and the same send-receive model is used.
The convenience methods aren’t exposed and the frame allocation function is actually essential.

// Essential API
RaContext *rav1e_context_new(const RaConfig *cfg);
void rav1e_context_unref(RaContext *ctx);

RaEncoderStatus rav1e_send_frame(RaContext *ctx, const RaFrame *frame);
RaEncoderStatus rav1e_receive_packet(RaContext *ctx, RaPacket **pkt);
// Optional API
uint8_t *rav1e_container_sequence_header(RaContext *ctx, size_t *buf_size);
void rav1e_container_sequence_header_unref(uint8_t *sequence);

RaFrame

Since the frame structure is opaque in C, we have the following functions to create, fill and dispose of the frames.

RaFrame *rav1e_frame_new(const RaContext *ctx);
void rav1e_frame_unref(RaFrame *frame);

/**
 * Fill a frame plane
 * Currently the frame contains 3 planes, the first is luminance followed by
 * chrominance.
 * The data is copied and this function has to be called for each plane.
 * frame: A frame provided by rav1e_frame_new()
 * plane: The index of the plane starting from 0
 * data: The data to be copied
 * data_len: Lenght of the buffer
 * stride: Plane line in bytes, including padding
 * bytewidth: Number of bytes per component, either 1 or 2
 */
void rav1e_frame_fill_plane(RaFrame *frame,
                            int plane,
                            const uint8_t *data,
                            size_t data_len,
                            ptrdiff_t stride,
                            int bytewidth);

RaEncoderStatus

The encoder status enum is returned by the rav1e_send_frame and rav1e_receive_packet and it is possible already to arbitrarily query the context for its status.

RaEncoderStatus rav1e_last_status(const RaContext *ctx);

To simulate the rust Debug functionality a to_str function is provided.

char *rav1e_status_to_str(RaEncoderStatus status);

Workflow

The C API workflow is similar to the Rust one, albeit a little more verbose due to the error checking.

    RaContext *rax = rav1e_context_new(rac);
    if (!rax) {
        printf("Unable to allocate a new context\n");
        goto clean;
    }
    RaFrame *f = rav1e_frame_new(rax);
    if (!f) {
        printf("Unable to allocate a new frame\n");
        goto clean;
    }
while (keep_going(i)){
     RaPacket *p;
     ret = rav1e_receive_packet(rax, &p);
     if (ret < 0) {
         printf("Unable to receive packet %d\n", i);
         goto clean;
     } else if (ret == RA_ENCODER_STATUS_SUCCESS) {
         printf("Packet %"PRIu64"\n", p->number);
         do_something_with(p);
         rav1e_packet_unref(p);
         i++;
     } else if (ret == RA_ENCODER_STATUS_NEED_MORE_DATA) {
         RaFrame *f = get_frame_by_some_mean(rax);
         ret = rav1e_send_frame(rax, f);
         if (ret < 0) {
            printf("Unable to send frame %d\n", i);
            goto clean;
        } else if (ret > 0) {
        // Cannot happen in normal conditions
            printf("Unable to append frame %d to the internal queue\n", i);
            abort();
        }
     } else if (ret == RA_ENCODER_STATUS_LIMIT_REACHED) {
         printf("Limit reached\n");
         break;
     }
}

In closing

This article was mainly a good excuse to try dev.to and see write down some notes and clarify my ideas on what had been done API-wise so far and what I should change and improve.

If you managed to read till here, your feedback is really welcome, please feel free to comment, try the software and open issues for crav1e and rav1e.

Coming next

  • Working crav1e got me to see what’s good and what is lacking in the c-interoperability story of rust, now that this landed I can start crafting and publishing better tools for it and maybe I’ll talk more about it here.
  • Soon rav1e will get more threading-oriented features, some benchmarking experiments will happen soon.

Thanks

  • Special thanks to Derek and Vittorio spent lots of time integrating crav1e in larger software and gave precious feedback in what was missing and broken in the initial iterations.
  • Thanks to David for the review and editorial work.
  • Also thanks to Matteo for introducing me to dev.to.

Altivec and VSX in Rust (part 1)

I’m involved in implementing the Altivec and VSX support on rust stdsimd.

Supporting all the instructions in this language is a HUGE endeavor since for each instruction at least 2 tests have to be written and making functions type-generic gets you to the point of having few pages of implementation (that luckily desugars to the single right instruction and nothing else).

Since I’m doing this mainly for my multimedia needs I have a short list of instructions I find handy to get some code written immediately and today I’ll talk a bit about some of them.

This post is inspired by what Luc did for neon, but I’m using rust instead.

If other people find it useful, I’ll try to write down the remaining istructions.

Permutations

Most if not all the SIMD ISAs have at least one or multiple instructions to shuffle vector elements within a vector or among two.

It is quite common to use those instructions to implement matrix transposes, but it isn’t its only use.

In my toolbox I put vec_perm and vec_xxpermdi since even if the portable stdsimd provides some shuffle support it is quite unwieldy compared to the Altivec native offering.

vec_perm: Vector Permute

Since it first iteration Altivec had a quite amazing instruction called vec_perm or vperm:

    fn vec_perm(a: i8x16, b: i8x16, c: i8x16) -> i8x16 {
        let mut d;
        for i in 0..16 {
            let idx = c[i] & 0xf;
            d[i] = if (c[i] & 0x10) == 0 {
                a[idx]
            } else {
                b[idx]
            };
        }
        d
    }

It is important to notice that the displacement map c is a vector and not a constant. That gives you quite a bit of flexibility in a number of situations.

This instruction is the building block you can use to implement a large deal of common patterns, including some that are also covered by stand-alone instructions e.g.:
– packing/unpacking across lanes as long you do not have to saturate: vec_pack, vec_unpackh/vec_unpackl
– interleave/merge two vectors: vec_mergel, vec_mergeh
– shift N bytes in a vector from another: vec_sld

It can be important to recall this since you could always take two permutations and make one, vec_perm itself is pretty fast and replacing two or more instructions with a single permute can get you a pretty neat speed boost.

vec_xxpermdi Vector Permute Doubleword Immediate

Among a good deal of improvements VSX introduced a number of instructions that work on 64bit-elements vectors, among those we have a permute instruction and I found myself using it a lot.

    #[rustc_args_required_const(2)]
    fn vec_xxpermdi(a: i64x2, b: i64x2, c: u8) -> i64x2 {
        match c & 0b11 {
            0b00 => i64x2::new(a[0], b[0]);
            0b01 => i64x2::new(a[1], b[0]);
            0b10 => i64x2::new(a[0], b[1]);
            0b11 => i64x2::new(a[1], b[1]);
        }
    }

This instruction is surely less flexible than the previous permute but it does not require an additional load.

When working on video codecs is quite common to deal with blocks of pixels that go from 4×4 up to 64×64, before vec_xxpermdi the common pattern was:

    #[inline(always)]
    fn store8(dst: &mut [u8x16], v: &[u8x16]) {
        let data = dst[i];
        dst[i] = vec_perm(v, data, TAKE_THE_FIRST_8);
    }

That implies to load the mask as often as needed as long as the destination.

Using vec_xxpermdi avoids the mask load and that usually leads to a quite significative speedup when the actual function is tiny.

Mixed Arithmetics

With mixed arithmetics I consider all the instructions that do in a single step multiple vector arithmetics.

The original altivec has the following operations available for the integer types:
vec_madds
vec_mladd
vec_mradds
vec_msum
vec_msums
vec_sum2s
vec_sum4s
vec_sums

And the following two for the float type:
vec_madd
vec_nmsub

All of them are quite useful and they will all find their way in stdsimd pretty soon.

I’m describing today vec_sums, vec_msums and vec_madds.

They are quite representative and the other instructions are similar in spirit:
vec_madds, vec_mladd and vec_mradds all compute a lane-wise product, take either the high-order or the low-order part of it
and add a third vector returning a vector of the same element size.
vec_sums, vec_sum2s and vec_sum4s all combine an in-vector sum operation with a sum with another vector.
vec_msum and vec_msums both compute a sum of products, the intermediates are added together and then added to a wider-element
vector.

If there is enough interest and time I can extend this post to cover all of them, for today we’ll go with this approximation.

vec_sums: Vector Sum Saturated

Usually SIMD instruction work with two (or 3) vectors and execute the same operation for each vector element.
Sometimes you want to just do operations within the single vector and vec_sums is one of the few instructions that let you do that:

    fn vec_sums(a: i32x4, b: i32x4) -> i32x4 {
        let d = i32x4::new(0, 0, 0, 0);

        d[3] = b[3].saturating_add(a[0]).saturating_add(a[1]).saturating_add(a[2]).saturating_add(a[3]);

        d
    }

It returns in the last element of the vector the sum of the vector elements of a and the last element of b.
It is pretty handy when you need to compute an average or similar operations.

It works only with 32bit signed element vectors.

vec_msums: Vector Multiply Sum Saturated

This instruction sums the 32bit element of the third vector with the two products of the respective 16bit
elements of the first two vectors overlapping the element.

It does quite a bit:

    fn vmsumshs(a: i16x8, b: i16x8, c: i32x4) -> i32x4 {
        let d;
        for i in 0..4 {
            let idx = 2 * i;
            let m0 = a[idx] as i32 * b[idx] as i32;
            let m1 = a[idx + 1] as i32 * b[idx + 1] as i32;
            d[i] = c[i].saturating_add(m0).saturating_add(m1);
        }
        d
    }

    fn vmsumuhs(a: u16x8, b: u16x8, c: u32x4) -> u32x4 {
        let d;
        for i in 0..4 {
            let idx = 2 * i;
            let m0 = a[idx] as u32 * b[idx] as u32;
            let m1 = a[idx + 1] as u32 * b[idx + 1] as u32;
            d[i] = c[i].saturating_add(m0).saturating_add(m1);
        }
        d
    }

    ...

    fn vec_msums<T, U>(a: T, b: T, c: U) -> U
    where T: sealed::VectorMultiplySumSaturate<U> {
        a.msums(b, c)
    }

It works only with 16bit elements, signed or unsigned. In order to support that on rust we have to use some creative trait.
It is quite neat if you have to implement some filters.

vec_madds: Vector Multiply Add Saturated

    fn vec_madds(a: i16x8, b: i16x8, c: i16x8) -> i16x8 {
        let d;
        for i in 0..8 {
            let v = (a[i] as i32 * b[i] as i32) >> 15;
            d[i] = (v as i16).saturating_add(c[i]);
        }
        d
    }

Takes the high-order 17bit of the lane-wise product of the first two vectors and adds it to a third one.

Coming next

Raptor Enginering kindly gave me access to a Power 9 through their Integricloud hosting.

We could run some extensive benchmarks and we found some peculiar behaviour with the C compilers available on the machine and that got me, Luc and Alexandra a little puzzled.

Next time I’ll try to collect in a little more organic way what I randomly put on my twitter as I noticed it.