rav1e-0.1.0 Made in Tokyo

Intro

AV1 is a modern video codec brought to you by an alliance of many different bigger and smaller players in the multimedia field.
I’m part of the VideoLan organization and I spent quite a bit of time on this codec lately.


rav1e: The safest and fastest AV1 encoder, built by many volunteers and Mozilla/Xiph developers.
It is written in rust and strives to provide good speed, quality and stay maintainable.


Made in Tokyo

The first official release of rav1e happened during the Video Dev Days 2019 in Tokyo.
Since it was originally presented during Video Dev Days 2017 it felt the right thing to do.

What is inside

Since last time I blogged about it, there are few changes:

crav1e is no more

The C-API now is part of rav1e itself and everything is built by cargo-c.

cargo install cargo-c
cargo cinstall --destdir=/tmp/staging
sudo cp /tmp/staging/* /

Thats’s all you need to use rav1e from C.

New API features

Keyframe placement

The rust API now let you override the keyframe placement

  let cfg = Config::default();
  let mut ctx: Context<u8> = cfg.new_context().unwrap();
  let f1 = ctx.new_frame();
  let f2 = f1.clone();
  let info = FrameParameters {
    frame_type_override: FrameTypeOverride::Key
};

  // Send the plain frame data
  ctx.send_frame(f1)?;
  // Send the data and the per-frame parameters
  // In this case the frame is forced to be a keyframe.
  ctx.send_frame((f2, info))?;
  // Flush the encoder, it is equivalent to a call to `flush()`
  ctx.send_frame(None)?;

Multipass rate control

It is possible to use Context::twopass_out() to feed back the rate control information that Context::twopass_in() can produce.

The API is intentionally opaque to the point you deal with pre-serialized data and not structured information.

Config validation

We added Config::validate() to make sure the settings are correct and return a detailed error (InvalidConfig) if that’s not the case.

Speed

We overall made it a lot faster:

15:23 < koda> hey guys, did an encode with the new 0.1.0, 20 hours for 8 minutes, down from 32 hours using a two month old build
15:23 < koda> congrats (and thanks) for making these speed improvements

And we are still working to speed it up a lot. The current weekly snapshot is an additional 20-25% faster compared to 0.1.0.

NOTE: rav1e is still resources conscious, so it will not use all the threads and memory available. This makes it good if you want to encode multiple videos in parallel, but we will work on adding additional parallelism so even the single-video scenario is covered better.

What’s next

Ideally a 0.2.0 will appear the early December, it will contain many speed improvements, lots of bugfixes (in particular docs.rs will serve the documentation) and possibly a large boost in single-pass quality if what Tim and Guillaume are working on will land in time.

For those that want to try it in Gentoo a live package is available.

Building crates so they look like C(ABI) Libraries

I presented cargo-c at the rustlab 2019, here is a longer followup of this.

Mixing Rust and C

One of the best selling point for rust is being highly interoperable with the C-ABI, in addition to safety, speed and its amazing community.

This comes really handy when you have well optimized hand-crafted asm kernels you’d like to use as-they-are:

  • They are small and with a clear interface, usually strict boundaries on what they read/write by their own nature.
  • You’d basically rewrite them as they are using some inline assembly for dubious gains.
  • Both cc-rs and nasm-rs make the process of building and linking relatively painless.

Also, if you plan to integrate in a foreign language project some rust component, it is quite straightforward to link the staticlib produced by cargo in your main project.

If you have a pure-rust crate and you want to export it to the world as if it were a normal C (shared/dynamic) library, it gets quite gory.

Well behaved C-API Library structure

Usually when you want to use a C-library in your own project you should expect it to provide the following:

  • A header file, telling the compiler which symbols it should expect
  • A static library
  • A dynamic library
  • A pkg-config file giving you direction on where to find the header and what you need to pass to the linker to correctly link the library, being it static or dynamic

Header file

In C you usually keep a list of function prototypes and type definitions in a separate file and then embed it in your source file to let the compiler know what to expect.

Since you rely on a quite simple preprocessor to do that you have to be careful about adding guards so the file does not get included more than once and, in order to avoid clashes you install it in a subdirectory of your include dir.

Since the location of the header could be not part of the default search path, you store this information in pkg-config usually.

Static Libraries

Static libraries are quite simple in concept (and execution):

  • they are an archive of object code files.
  • the linker simply reads them as it would read just produced .os and link everything together.

There is a pitfall though:

  • In some platforms even if you want to make a fully static binary you end up dynamically linking some system library for a number of reasons.

    The worst offenders are the pthread libraries and in some cases the compiler builtins (e.g. libgcc_s)

  • The information on what they are is usually not known

rustc comes to the rescue with --print native-static-libs, it isn’t the best example of integration since it’s a string produced on stderr and it behaves as a side-effect of the actual building, but it is still a good step in the right direction.

pkg-config is the de-facto standard way to preserve the information and have the build systems know about it (I guess you are seeing a pattern now).

Dynamic Libraries

A shared or dynamic library is a specially crafted lump of executable code that gets linked to the binary as it is being executed.
The advantages compared to statically linking everything are mainly two:

  • Sparing disk space: since without link-time pruning you end up carrying multiple copies of the same library with every binary using it.
  • Safer and simpler updates: If you need to update say, openssl, you do that once compared to updating the 100+ consumers of it existing in your system.

There is some inherent complexity and constraints in order to get this feature right, the most problematic one is ABI stability:

  • The dynamic linker needs to find the symbols the binary expects and have them with the correct size
  • If you change the in-memory layout of a struct or how the function names are represented you should make so the linker is aware.

Usually that means that depending on your platform you have some versioning information you should provide when you are preparing your library. This can be as simple as telling the compile-time linker to embed the version information (e.g. Mach-O dylib or ELF) in the library or as complex as crafting a version script.

Compared to crafting a staticlib it there are more moving parts and platform-specific knowledge.

Sadly in this case rustc does not provide any help for now: even if the C-ABI is stable and set in stone, the rust mangling strategy is not finalized yet, and it is a large part of being ABI stable, so the work on fully supporting dynamic libraries is yet to be completed.

Dynamic libraries in most platforms have a mean to store which other dynamic libraries they reliy on and which are the paths in which to look for. When the information is incomplete, or you are storing the library in a non-standard path, pkg-config comes to the rescue again, helpfully storing the information for you.

Pkg-config

It is your single point of truth as long your build system supports it and the libraries you want to use craft it properly.
It simplifies a lot your life if you want to keep around multiple versions of a library or you are doing non-system packaging (e.g.: Homebrew or Gentoo Prefix).
Beside the search path, link line and dependency information I mentioned above, it also stores the library version and inter-library compatibility relationships.
If you are publishing a C-library and you aren’t providing a .pc file, please consider doing it.

Producing a C-compatible library out of a crate

I explained what we are expected to produce, now let see what we can do on the rust side:

  • We need to export C-ABI-compatible symbols, that means we have to:
  • Decorate the data types we want to export with #[repr(C)]
  • Decorate the functions with #[no_mangle] and prefix them with export "C"
  • Tell rustc the crate type is both staticlib and cdylib
  • Pass rustc the platform-correct link line so the library produced has the right information inside.
    > NOTE: In some platforms beside the version information also the install path must be encoded in the library.
  • Generate the header file so that the C compiler knows about them.
  • Produce a pkg-config file with the correct information

    NOTE: It requires knowing where the whole lot will be eventually installed.

cargo does not support installing libraries at all (since for now rust dynamic libraries should not be used at all) so we are a bit on our own.

For rav1e I did that the hard way and then I came up an easy way for you to use (and that I used for doing the same again with lewton spending about 1/2 day instead of several ~~weeks~~months).

The hard way

As seen in crav1e, you can explore the history there.

It isn’t the fully hard way since before cargo-c there was already nice tools to avoid some time consuming tasks: cbindgen.
In a terse summary what I had to do was:

  • Come up with an external build system since cargo itself cannot install anything nor have direct knowledge of the install path information. I used Make since it is simple and sufficiently widespread, anything richer would probably get in the way and be more time consuming to set up.
  • Figure out how to extract the information provided in Cargo.toml so I have it at Makefile level. I gave up and duplicated it since parsing toml or json is pointlessly complicated for a prototype.
  • Write down the platform-specific logic on how to build (and install) the libraries. It ended up living in the build.rs and the Makefile. Thanks again to Derek for taking care of the Windows-specific details.
  • Use cbindgen to generate the C header (And in the process smooth some of its rough edges
  • Since we already have a build system add more targets for testing and continuous integration purpose.

If you do not want to use cargo-c I spun away the cdylib-link line logic in a stand alone crate so you can use it in your build.rs.

The easier way

Using a Makefile and a separate crate with a customized build.rs works fine and keeps the developers that care just about writing in rust fully shielded from the gory details and contraptions presented above.

But it comes with some additional churn:

  • Keeping the API in sync
  • Duplicate the release work
  • Have the users confused on where to report the issues or where to find the actual sources. (The users tend to miss the information presented in the obvious places such as the README way too often)

So to try to minimize it I came up with a cargo applet that provides two subcommands:

  • cbuild to build the libraries, the .pc file and header.
  • cinstall to install the whole lot, if already built or to build and then install it.

They are two subcommands since it is quite common to build as user and then install as root. If you are using rustup and root does not have cargo you can get away with using --destdir and then sudo install or craft your local package if your distribution provides a mean to do that.

All I mentioned in the hard way happens under the hood and, beside bugs in the current implementation, you should be completely oblivious of the details.

Using cargo-c

As seen in lewton and rav1e.

  • Create a capi.rs with the C-API you want to expose and use #[cfg(cargo_c)] to hide it when you build a normal rust library.
  • Make sure you have a lib target and if you are using a workspace the first member is the crate you want to export, that means that you might have to add a "." member at the start of the list.
  • Remember to add a cbindgen.toml and fill it with at least the include guard and probably you want to set the language to C (it defaults to C++)
  • Once you are happy with the result update your documentation to tell the user to install cargo-c and do cargo cinstall --prefix=/usr --destdir=/tmp/some-place or something along those lines.

Coming next

cargo-c is a young project and far from being complete even if it is functional for my needs.

Help in improving it is welcome, there are plenty of rough edges and bugs to find and squash.

Thanks

Thanks to est31 and sdroege for the in-depth review in #rust-av and kodabb for the last minute edits.

Using Wireguard

wireguard is a modern, secure and fast vpn tunnel that is extremely simple to setup and works already nearly everywhere.

Since I spent a little bet to play with it because this looked quite interesting, I thought of writing a small tutorial.

I normally use Gentoo (and macos) so this guide is about Gentoo.

General concepts

Wireguard sets up peers identified by an public key and manages a virtual network interface and the routing across them (optionally).

The server is just a peer that knows about loots of peers while a client knows how to directly reach the server and that’s it.

Setting up in Gentoo

Wireguard on Linux is implemented as a kernel module.

So in general you have to build the module and the userspace tools (wg).
If you want to have some advanced feature make sure that your kernel has the following settings:

IP_ADVANCED_ROUTER
IP_MULTIPLE_TABLES
NETFILTER_XT_MARK

After that using emerge will get you all you need:

$ emerge wireguard

Tools

The default distribution of tools come with the wg command and an helper script called wg-quick that makes easier to bring up and down the virtual network interface.

wg help
Usage: wg <cmd> [<args>]

Available subcommands:
  show: Shows the current configuration and device information
  showconf: Shows the current configuration of a given WireGuard interface, for use with `setconf'
  set: Change the current configuration, add peers, remove peers, or change peers
  setconf: Applies a configuration file to a WireGuard interface
  addconf: Appends a configuration file to a WireGuard interface
  genkey: Generates a new private key and writes it to stdout
  genpsk: Generates a new preshared key and writes it to stdout
  pubkey: Reads a private key from stdin and writes a public key to stdout
You may pass `--help' to any of these subcommands to view usage.
Usage: wg-quick [ up | down | save | strip ] [ CONFIG_FILE | INTERFACE ]

  CONFIG_FILE is a configuration file, whose filename is the interface name
  followed by `.conf'. Otherwise, INTERFACE is an interface name, with
  configuration found at /etc/wireguard/INTERFACE.conf. It is to be readable
  by wg(8)'s `setconf' sub-command, with the exception of the following additions
  to the [Interface] section, which are handled by wg-quick:

  - Address: may be specified one or more times and contains one or more
    IP addresses (with an optional CIDR mask) to be set for the interface.
  - DNS: an optional DNS server to use while the device is up.
  - MTU: an optional MTU for the interface; if unspecified, auto-calculated.
  - Table: an optional routing table to which routes will be added; if
    unspecified or `auto', the default table is used. If `off', no routes
    are added.
  - PreUp, PostUp, PreDown, PostDown: script snippets which will be executed
    by bash(1) at the corresponding phases of the link, most commonly used
    to configure DNS. The string `%i' is expanded to INTERFACE.
  - SaveConfig: if set to `true', the configuration is saved from the current
    state of the interface upon shutdown.

See wg-quick(8) for more info and examples.

Creating a configuration

Wireguard is quite straightforward, you can either prepare a configuration with your favourite text editor or generate one by setting by hand the virtual network device and then saving the result wg showconf presents.

A configuration file then can be augmented with wg-quick-specific options (such as Address) or just passed to wg setconf while the other networking details are managed by your usual tools (e.g. ip).

Create your keys

The first step is to create the public-private key pair that identifies your peer.

  • wg genkey generates a private key for you.
  • You feed it to wg pubkey to have your public key.

In a single line:

$ wg genkey | tee privkey | wg pubkey > pubkey

Prepare a configuration file

Both wg-quick and wg setconf use an ini-like configuration file.

If you put it in /etc/wireguard/${ifname}.conf then wg-quick would just need the interface name and would look it up for you.

The minimum configuration needs an [Interface] and a [Peer] set.
You may add additional peers later.
A server would specify its ListenPort and identify the peers by their PublicKey.

[Interface]
Address = 192.168.2.1/24
ListenPort = 51820
PrivateKey = <key>

[Peer]
PublicKey = <key>
AllowedIPs = 192.168.2.2/32

A client would have a peer with an EndPoint defined and optionally not specify the ListenPort in its interface description.

[Interface]
PrivateKey = <key>
Address = 192.168.2.2/24

[Peer]
PublicKey = <key>
AllowedIPs = 192.168.2.0/24
Endpoint = <ip>:<port>

The AllowedIPs mask let you specify how much you want to route over the vpn.
By setting 0.0.0.0/0 you tell you want to route ALL the traffic through it.

NOTE: Address is a wg-quick-specific option.

Using a configuration

wg-quick is really simple to use, assuming you have created /etc/wireguard/wg0.conf:

$ wg-quick up wg0
$ wg-quick down wg0

If you are using netifrc from version 0.6.1 wireguard is supported and you can have a configuration such as:

config_wg0="192.168.2.4/24"
wireguard_wg0="/etc/wireguard/wg0.conf"

With the wg0.conf file like the above but stripped of the wg-quick-specific options.

Summing up

Wireguard is a breeze to set up compared to nearly all the other vpn solutions.

Non-linux systems can currently use a go implementation and in the future a rust implementation (help welcome).

Android and macos have already some pretty front-ends that make the setup easy even on those platforms.

I hope you enjoyed it 🙂

Using rav1e – from your code

AV1, Rav1e, Crav1e, an intro

(this article is also available on my dev.to profile, I might use it more often since wordpress is pretty horrible at managing markdown.)

AV1 is a modern video codec brought to you by an alliance of many different bigger and smaller players in the multimedia field.
I’m part of the VideoLan organization and I spent quite a bit of time on this codec lately.


rav1e: The safest and fastest AV1 encoder, built by many volunteers and Mozilla/Xiph developers.
It is written in rust and strives to provide good speed, quality and stay maintainable.


crav1e: A companion crate, written by yours truly, that provides a C-API, so the encoder can be used by C libraries and programs.


This article will just give a quick overview of the API available right now and it is mainly to help people start using it and hopefully report issues and problem.

Rav1e API

The current API is built around the following 4 structs and 1 enum:

  • struct Frame: The raw pixel data
  • struct Packet: The encoded bitstream
  • struct Config: The encoder configuration
  • struct Context: The encoder state

  • enum EncoderStatus: Fatal and non-fatal condition returned by the Contextmethods.

Config

The Config struct currently is simply constructed.

    struct Config {
        enc: EncoderConfig,
        threads: usize,
    }

The EncoderConfig stores all the settings that have an impact to the actual bitstream while settings such as threads are kept outside.

    let mut enc = EncoderConfig::with_speed_preset(speed);
    enc.width = w;
    enc.height = h;
    enc.bit_depth = 8;
    let cfg = Config { enc, threads: 0 };

NOTE: Some of the fields above may be shuffled around until the API is marked as stable.

Methods

new_context
    let cfg = Config { enc, threads: 0 };
    let ctx: Context<u8> = cfg.new_context();

It produces a new encoding context. Where bit_depth is 8, it is possible to use an optimized u8 codepath, otherwise u16 must be used.

Context

It is produced by Config::new_context, its implementation details are hidden.

Methods

The Context can be grouped into essential, optional and convenience.

    // Essential API
    pub fn send_frame<F>(&mut self, frame: F) -> Result<(), EncoderStatus>
      where F: Into<Option<Arc<Frame<T>>>>, T: Pixel;
    pub fn receive_packet(&mut self) -> Result<Packet<T>, EncoderStatus>;

The encoder works by processing each Frame fed through send_frame and producing each Packet that can be retrieved by receive_packet.

    // Optional API
    pub fn container_sequence_header(&mut self) -> Vec<u8>;
    pub fn get_first_pass_data(&self) -> &FirstPassData;

Depending on the container format, the AV1 Sequence Header could be stored in the extradata. container_sequence_header produces the data pre-formatted to be simply stored in mkv or mp4.

rav1e supports multi-pass encoding and the encoding data from the first pass can be retrieved by calling get_first_pass_data.

    // Convenience shortcuts
    pub fn new_frame(&self) -> Arc<Frame<T>>;
    pub fn set_limit(&mut self, limit: u64);
    pub fn flush(&mut self) {
  • new_frame() produces a frame according to the dimension and pixel format information in the Context.
  • flush() is functionally equivalent to call send_frame(None).
  • set_limit()is functionally equivalent to call flush()once limit frames are sent to the encoder.

Workflow

The workflow is the following:

  1. Setup:
    • Prepare a Config
    • Call new_context from the Config to produce a Context
  2. Encode loop:
    • Pull each Packet using receive_packet.
    • If receive_packet returns EncoderStatus::NeedMoreData
      • Feed each Frame to the Context using send_frame
  3. Complete the encoding
    • Issue a flush() to encode each pending Frame in a final Packet.
    • Call receive_packet until EncoderStatus::LimitReached is returned.

Crav1e API

The crav1e API provides the same structures and features beside few key differences:

  • The Frame, Config, and Context structs are opaque.
typedef struct RaConfig RaConfig;
typedef struct RaContext RaContext;
typedef struct RaFrame RaFrame;
  • The Packet struct exposed is much simpler than the rav1e original.
typedef struct {
    const uint8_t *data;
    size_t len;
    uint64_t number;
    RaFrameType frame_type;
} RaPacket;
  • The EncoderStatus includes a Success condition.
typedef enum {
    RA_ENCODER_STATUS_SUCCESS = 0,
    RA_ENCODER_STATUS_NEED_MORE_DATA,
    RA_ENCODER_STATUS_ENOUGH_DATA,
    RA_ENCODER_STATUS_LIMIT_REACHED,
    RA_ENCODER_STATUS_FAILURE = -1,
} RaEncoderStatus;

RaConfig

Since the configuration is opaque there are a number of functions to assemble it:

  • rav1e_config_default allocates a default configuration.
  • rav1e_config_parse and rav1e_config_parse_int set a specific value for a specific field selected by a key string.
  • rav1e_config_set_${field} are specialized setters for complex information such as the color description.
RaConfig *rav1e_config_default(void);

/**
 * Set a configuration parameter using its key and value as string.
 * Available keys and values
 * - "quantizer": 0-255, default 100
 * - "speed": 0-10, default 3
 * - "tune": "psnr"-"psychovisual", default "psnr"
 * Return a negative value on error or 0.
 */
int rav1e_config_parse(RaConfig *cfg, const char *key, const char *value);

/**
 * Set a configuration parameter using its key and value as integer.
 * Available keys and values are the same as rav1e_config_parse()
 * Return a negative value on error or 0.
 */
int rav1e_config_parse_int(RaConfig *cfg, const char *key, int value);

/**
 * Set color properties of the stream.
 * Supported values are defined by the enum types
 * RaMatrixCoefficients, RaColorPrimaries, and RaTransferCharacteristics
 * respectively.
 * Return a negative value on error or 0.
 */
int rav1e_config_set_color_description(RaConfig *cfg,
                                       RaMatrixCoefficients matrix,
                                       RaColorPrimaries primaries,
                                       RaTransferCharacteristics transfer);

/**
 * Set the content light level information for HDR10 streams.
 * Return a negative value on error or 0.
 */
int rav1e_config_set_content_light(RaConfig *cfg,
                                   uint16_t max_content_light_level,
                                   uint16_t max_frame_average_light_level);

/**
 * Set the mastering display information for HDR10 streams.
 * primaries and white_point arguments are RaPoint, containing 0.16 fixed point values.
 * max_luminance is a 24.8 fixed point value.
 * min_luminance is a 18.14 fixed point value.
 * Returns a negative value on error or 0.
 */
int rav1e_config_set_mastering_display(RaConfig *cfg,
                                       RaPoint primaries[3],
                                       RaPoint white_point,
                                       uint32_t max_luminance,
                                       uint32_t min_luminance);

void rav1e_config_unref(RaConfig *cfg);

The bare minimum setup code is the following:

    int ret = -1;
    RaConfig *rac = rav1e_config_default();
    if (!rac) {
        printf("Unable to initialize\n");
        goto clean;
    }

    ret = rav1e_config_parse_int(rac, "width", 64);
    if (ret < 0) {
        printf("Unable to configure width\n");
        goto clean;
    }

    ret = rav1e_config_parse_int(rac, "height", 96);
    if (ret < 0) {
        printf("Unable to configure height\n");
        goto clean;
    }

    ret = rav1e_config_parse_int(rac, "speed", 9);
    if (ret < 0) {
        printf("Unable to configure speed\n");
        goto clean;
    }

RaContext

As per the rav1e API, the context structure is produced from a configuration and the same send-receive model is used.
The convenience methods aren’t exposed and the frame allocation function is actually essential.

// Essential API
RaContext *rav1e_context_new(const RaConfig *cfg);
void rav1e_context_unref(RaContext *ctx);

RaEncoderStatus rav1e_send_frame(RaContext *ctx, const RaFrame *frame);
RaEncoderStatus rav1e_receive_packet(RaContext *ctx, RaPacket **pkt);
// Optional API
uint8_t *rav1e_container_sequence_header(RaContext *ctx, size_t *buf_size);
void rav1e_container_sequence_header_unref(uint8_t *sequence);

RaFrame

Since the frame structure is opaque in C, we have the following functions to create, fill and dispose of the frames.

RaFrame *rav1e_frame_new(const RaContext *ctx);
void rav1e_frame_unref(RaFrame *frame);

/**
 * Fill a frame plane
 * Currently the frame contains 3 planes, the first is luminance followed by
 * chrominance.
 * The data is copied and this function has to be called for each plane.
 * frame: A frame provided by rav1e_frame_new()
 * plane: The index of the plane starting from 0
 * data: The data to be copied
 * data_len: Lenght of the buffer
 * stride: Plane line in bytes, including padding
 * bytewidth: Number of bytes per component, either 1 or 2
 */
void rav1e_frame_fill_plane(RaFrame *frame,
                            int plane,
                            const uint8_t *data,
                            size_t data_len,
                            ptrdiff_t stride,
                            int bytewidth);

RaEncoderStatus

The encoder status enum is returned by the rav1e_send_frame and rav1e_receive_packet and it is possible already to arbitrarily query the context for its status.

RaEncoderStatus rav1e_last_status(const RaContext *ctx);

To simulate the rust Debug functionality a to_str function is provided.

char *rav1e_status_to_str(RaEncoderStatus status);

Workflow

The C API workflow is similar to the Rust one, albeit a little more verbose due to the error checking.

    RaContext *rax = rav1e_context_new(rac);
    if (!rax) {
        printf("Unable to allocate a new context\n");
        goto clean;
    }
    RaFrame *f = rav1e_frame_new(rax);
    if (!f) {
        printf("Unable to allocate a new frame\n");
        goto clean;
    }
while (keep_going(i)){
     RaPacket *p;
     ret = rav1e_receive_packet(rax, &p);
     if (ret < 0) {
         printf("Unable to receive packet %d\n", i);
         goto clean;
     } else if (ret == RA_ENCODER_STATUS_SUCCESS) {
         printf("Packet %"PRIu64"\n", p->number);
         do_something_with(p);
         rav1e_packet_unref(p);
         i++;
     } else if (ret == RA_ENCODER_STATUS_NEED_MORE_DATA) {
         RaFrame *f = get_frame_by_some_mean(rax);
         ret = rav1e_send_frame(rax, f);
         if (ret < 0) {
            printf("Unable to send frame %d\n", i);
            goto clean;
        } else if (ret > 0) {
        // Cannot happen in normal conditions
            printf("Unable to append frame %d to the internal queue\n", i);
            abort();
        }
     } else if (ret == RA_ENCODER_STATUS_LIMIT_REACHED) {
         printf("Limit reached\n");
         break;
     }
}

In closing

This article was mainly a good excuse to try dev.to and see write down some notes and clarify my ideas on what had been done API-wise so far and what I should change and improve.

If you managed to read till here, your feedback is really welcome, please feel free to comment, try the software and open issues for crav1e and rav1e.

Coming next

  • Working crav1e got me to see what’s good and what is lacking in the c-interoperability story of rust, now that this landed I can start crafting and publishing better tools for it and maybe I’ll talk more about it here.
  • Soon rav1e will get more threading-oriented features, some benchmarking experiments will happen soon.

Thanks

  • Special thanks to Derek and Vittorio spent lots of time integrating crav1e in larger software and gave precious feedback in what was missing and broken in the initial iterations.
  • Thanks to David for the review and editorial work.
  • Also thanks to Matteo for introducing me to dev.to.

Making and using C-compatible libraries in rust: present and future

Since there are plenty of blogposts about what people would like to have or will implement in rust in 2019 here is mine.

I spent the last few weeks of my spare time making a C-api for rav1e called crav1e, overall the experience had been a mixed bag and there is large space for improvement.

Ideally I’d like to have by the end of the year something along the lines of:

$ cargo install-library --prefix=/usr --libdir=/usr/lib64 --destdir=/staging/place

So that it would:
– build a valid cdylib+staticlib
– produce a correct header
– produce a correct pkg-config file
– install all of it in the right paths

All of this requiring a quite basic build.rs and, probably, an applet.

What is it all about?

Building and installing properly shared libraries is quite hard, even more on multiple platforms.

Right now cargo has quite limited install capabilities with some work pending on extending it and has an open issue and a patch.

Distributions that, probably way too early since the rust-ABI is not stable nor being stabilized yet, are experimenting in building everything as shared library also have those problems.

Why it is important

rust is a pretty good language and has a fairly simple way to interact in both direction with any other language that can produce or consume C-ABI-compatible object code.

This is already quite useful if you want to build a small static archive and link it in your larger application and/or library.

An example of this use-case is librsvg.

Such heterogeneous environment warrant for a modicum of additional machinery and complication.

But if your whole library is written in rust, it is a fairly annoying amount of boilerplate that you would rather avoid.

Current status

If you want to provide C-bindings to your crates you do not have a single perfect solution right now.

What works well already

Currently building the library itself works fine and it is really straightforward:

  • It is quite easy to mark data types and functions to be C-compatible:
#[repr(C)]
pub struct Foo {
    a: Bar,
    ...
}

#[no_mangle]
pub unsafe extern "C" fn make_foo() -> *mut Foo {
   ...
}
  • rustc and cargo are aware of different crate-types, selecting staticlib produces a valid library
[lib]
name = "rav1e"
crate-type = ["staticlib"]
  • cbindgen can produce a usable C-header from a crate using few lines of build.rs or a stand-alone applet and a toml configuration file.
extern crate cbindgen;

fn main() {
    let crate_dir = std::env::var("CARGO_MANIFEST_DIR").unwrap();
    let header_path: std::path::PathBuf = ["include", "rav1e.h"].iter().collect();

    cbindgen::generate(crate_dir).unwrap().write_to_file(header_path);

    println!("cargo:rerun-if-changed=src/lib.rs");
    println!("cargo:rerun-if-changed=cbindgen.toml");
}
header = "// SPDX-License-Identifier: MIT"
sys_includes = ["stddef.h"]
include_guard = "RAV1E_H"
tab_width = 4
style = "Type"
language = "C"

[parse]
parse_deps = true
include = ['rav1e']
expand = ['rav1e']

[export]
prefix = "Ra"
item_types = ["enums", "structs", "unions", "typedefs", "opaque", "functions"]

[enum]
rename_variants = "ScreamingSnakeCase"
prefix_with_name = true

Now issuing cargo build --release will get you a .h in the include/ dir and a .a library in target/release, so far it is simple enough.

What sort of works

Once have a static library, you need an external mean to track what are its dependencies.

Back in the old ages there were libtool archives (.la), now we have pkg-config files providing more information and in a format that is way easier to parse and use.

rustc has --print native-static-libs to produce the additional libraries to link, BUT prints it to stderr and only as a side-effect of the actual build process.

My, fairly ugly, hack had been adding a dummy empty subcrate just to produce the link-line using

cargo rustc -- --print native-static-libs 2>&1| grep native-static-libs | cut -d ':' -f 3

And then generate the .pc file from a template.

This is anything but straightforward and because how cargo rustc works, you may end up adding an empty subcrate just to extract this information quickly.

What is missing

Once you have your library, your header and your pkg-config file, you probably would like to install the library somehow and/or make a package out of it.

cargo install does not currently cover it. It works only for binaries and just binaries alone. It will hopefully change, but right now you just have to pick the external build system you are happy with and hack your way to integrate the steps mentioned above.

For crav1e I ended up hacking a quite crude Makefile.

And with that at least a pure-rust static library can be built and installed with the common:

make DESTDIR=/staging/place prefix=/usr libdir=/usr/lib64

Dynamic libraries

Given rustc and cargo have the cdylib crate type, one would assume we could just add the type, modify our build-system contraption a little and go our merry way.

Sadly not. A dynamic library (or shared object) requires in most of the common platform some additional metadata to guide the runtime linker.

The current widespread practice is to use tools such as patchelf or install_name_tool, but it is quite brittle and might require tools.

My plans for the 2019

rustc has a mean to pass the information to the compile-time linker but there is no proper way to pass it in cargo, I already tried to provide a solution, but I’ll have to go through the rfc route to make sure there is community consensus and the feature is properly documented.

Since kind of metadata is platform-specific so would be better to have this information produced and passed on by something external to the main cargo. Having it as applet or a build.rs dependency makes easier to support more platforms little by little and have overrides without having to go through a main cargo update.

The applet could also take care of properly create the .pc file and installing since it would have access to all the required information.

Some efforts could be also put on streamlining the process of extracting the library link line for the static data and spare some roundtrips.

I guess that’s all for what I’d really like to have next year in rust and I’m confident I’ll have time to deliver myself 🙂

rav1e and crav1e – A fast and safe AV1 encoder – Some HowTo

Over the year I contributed to an AV1 encoder written in rust.

Here a small tutorial about what is available right now, there is still lots to do, but I think we could enjoy more user-feedback (and possibly also some help).

Setting up

Install the rust toolchain

If you do not have rust installed, it is quite simple to get a full environment using rustup

$ curl https://sh.rustup.rs -sSf | sh
# Answer the questions asked and make sure you source the `.profile` file created.
$ source ~/.profile

Install cmake, perl and nasm

rav1e uses libaom for testing and and on x86/x86_64 some components have SIMD variants written directly using nasm.

You may follow the instructions, or just install:
nasm (version 2.13 or better)
perl (any recent perl5)
cmake (any recent version)

Once you have those dependencies in you are set.

Building rav1e

We use cargo, so the process is straightforward:

## Pull in the customized libaom if you want to run all the tests
$ git submodule update --init

## Build everything
$ cargo build --release

## Test to make sure everything works as intended
$ cargo test --features decode_test --release

## Install rav1e
$ cargo install

Using rav1e

Right now rav1e has a quite simple interface:

rav1e 0.1.0
AV1 video encoder

USAGE:
    rav1e [OPTIONS]  --output 

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -I, --keyint     Keyframe interval [default: 30]
    -l, --limit                  Maximum number of frames to encode [default: 0]
        --low_latency      low latency mode. true or false [default: true]
    -o, --output                Compressed AV1 in IVF video output
        --quantizer                 Quantizer (0-255) [default: 100]
    -r 
    -s, --speed                  Speed level (0(slow)-10(fast)) [default: 3]
        --tune                    Quality tuning (Will enforce partition sizes >= 8x8) [default: psnr]  [possible
                                        values: Psnr, Psychovisual]

ARGS:
        Uncompressed YUV4MPEG2 video input

It accepts y4m raw source and produces ivf files.

You can configure the encoder by setting the speed and quantizer levels.

The low_latency flag can be turned off to run some additional analysis over a set of frames and have additional quality gains.

Crav1e

While ave and gst-rs will use the rav1e crate directly, there are a number of software such as handbrake or vlc that would be much happier to consume a C API.

Thanks to the staticlib target and cbindgen is quite easy to produce a C-ABI library and its matching header.

Setup

crav1e is built using cargo, so nothing special is needed right now beside nasm if you are building it on x86/x86_64.

Build the library

This step is completely straightforward, you can build it as release:

$ cargo build --release

or as debug

$ cargo build

It will produce a target/release/librav1e.a or a target/debug/librav1e.a.
The C header will be in include/rav1e.h.

Try the example code

I provided a quite minimal sample case.

cc -Wall c-examples/simple_encoding.c -Ltarget/release/ -lrav1e -Iinclude/ -o c-examples/simple_encoding
./c-examples/simple_encoding

If it builds and runs correctly you are set.

Manually copy the .a and the .h

Currently cargo install does not work for our purposes, but it will change in the future.

$ cp target/release/librav1e.a /usr/local/lib
$ cp include/rav1e.h /usr/local/include/

Missing pieces

Right now crav1e works well enough but there are few shortcomings I’m trying to address.

Shared library support

The cdylib target does exist and produce a nearly usable library but there are some issues with soname support. I’m trying to address them with upstream, but it might take some time.

Meanwhile some people suggest to use patchelf or similar tools to fix the library after the fact.

Install target

cargo is generally awesome, but sadly its support for installing arbitrary files to arbitrary paths is limited, luckily there are people proposing solutions.

pkg-config file generation

I consider a library not proper if a .pc file is not provided with it.

Right now there are means to extract the information need to build a pkg-config file, but there isn’t a simple way to do it.

$ cargo rustc -- --print native-static-libs

Provides what is needed for Libs.private, ideally it should be created as part of the install step since you need to know the prefix, libdir and includedir paths.

Coming next

Probably the next blog post will be about my efforts to make cargo able to produce proper cdylib or something quite different.

PS: If somebody feels to help me with matroska in AV1 would be great 🙂

Gentoo on Integricloud

Integricloud gave me access to their infrastructure to track some issues on ppc64 and ppc64le.

Since some of the issues are related to the compilers, I obviously installed Gentoo on it and in the process I started to fix some issues with catalyst to get a working install media, but that’s for another blogpost.

Today I’m just giving a walk-through on how to get a ppc64le (and ppc64 soon) VM up and running.

Preparation

Read this and get your install media available to your instance.

Install Media

I’m using the Gentoo installcd I’m currently refining.

Booting

You have to append console=hvc0 to your boot command, the boot process might figure it out for you on newer install medias (I still have to send patches to update livecd-tools)

Network configuration

You have to manually setup the network.
You can use ifconfig and route or ip as you like, refer to your instance setup for the parameters.

ifconfig enp0s0 ${ip}/16
route add -net default gw ${gw}
echo "nameserver 8.8.8.8" > /etc/resolv.conf
ip a add ${ip}/16 dev enp0s0
ip l set enp0s0 up
ip r add default via ${gw}
echo "nameserver 8.8.8.8" > /etc/resolv.conf

Disk Setup

OpenFirmware seems to like gpt much better:

parted /dev/sda mklabel gpt

You may use fdisk to create:
– a PowerPC PrEP boot partition of 8M
– root partition with the remaining space

Device     Start      End  Sectors Size Type
/dev/sda1   2048    18431    16384   8M PowerPC PReP boot
/dev/sda2  18432 33554654 33536223  16G Linux filesystem

I’m using btrfs and zstd-compress /usr/portage and /usr/src/.

mkfs.btrfs /dev/sda2

Initial setup

It is pretty much the usual.

mount /dev/sda2 /mnt/gentoo
cd /mnt/gentoo
wget https://dev.gentoo.org/~mattst88/ppc-stages/stage3-ppc64le-20180810.tar.xz
tar -xpf stage3-ppc64le-20180810.tar.xz
mount -o bind /dev dev
mount -t devpts devpts dev/pts
mount -t proc proc proc
mount -t sysfs sys sys
cp /etc/resolv.conf etc
chroot .

You just have to emerge grub and gentoo-sources, I diverge from the defconfig by making btrfs builtin.

My /etc/portage/make.conf:

CFLAGS="-O3 -mcpu=power9 -pipe"
# WARNING: Changing your CHOST is not something that should be done lightly.
# Please consult https://wiki.gentoo.org/wiki/Changing_the_CHOST_variable beforee
 changing.
CHOST="powerpc64le-unknown-linux-gnu"

# NOTE: This stage was built with the bindist Use flag enabled
PORTDIR="/usr/portage"
DISTDIR="/usr/portage/distfiles"
PKGDIR="/usr/portage/packages"

USE="ibm altivec vsx"

# This sets the language of build output to English.
# Please keep this setting intact when reporting bugs.
LC_MESSAGES=C
ACCEPT_KEYWORDS=~ppc64

MAKEOPTS="-j4 -l6"
EMERGE_DEFAULT_OPTS="--jobs 10 --load-average 6 "

My minimal set of packages I need before booting:

emerge grub gentoo-sources vim btrfs-progs openssh

NOTE: You want to emerge again openssh and make sure bindist is not in your USE.

Kernel & Bootloader

cd /usr/src/linux
make defconfig
make menuconfig # I want btrfs builtin so I can avoid a initrd
make -j 10 all && make install && make modules_install
grub-install /dev/sda1
grub-mkconfig -o /boot/grub/grub.cfg

NOTE: make sure you pass /dev/sda1 otherwise grub will happily assume OpenFirmware knows about btrfs and just point it to your directory.
That’s not the case unfortunately.

Networking

I’m using netifrc and I’m using the eth0-naming-convention.

touch /etc/udev/rules.d/80-net-name-slot.rules
ln -sf /etc/init.d/net.{lo,eth0}
echo -e "config_eth0=\"${ip}/16\"\nroutes_eth0="default via ${gw}\"\ndns_servers_eth0=\"8.8.8.8\"" > /etc/conf.d/net

Password and SSH

Even if the mticlient is quite nice, you would rather use ssh as much as you could.

passwd 
rc-update add sshd default

Finishing touches

Right now sysvinit does not add the hvc0 console as it should due to a profile quirk, for now check /etc/inittab and in case add:

echo 'hvc0:2345:respawn:/sbin/agetty -L 9600 hvc0' >> /etc/inittab

Add your user and add your ssh key and you are ready to use your new system!

Altivec and VSX in Rust (part 1)

I’m involved in implementing the Altivec and VSX support on rust stdsimd.

Supporting all the instructions in this language is a HUGE endeavor since for each instruction at least 2 tests have to be written and making functions type-generic gets you to the point of having few pages of implementation (that luckily desugars to the single right instruction and nothing else).

Since I’m doing this mainly for my multimedia needs I have a short list of instructions I find handy to get some code written immediately and today I’ll talk a bit about some of them.

This post is inspired by what Luc did for neon, but I’m using rust instead.

If other people find it useful, I’ll try to write down the remaining istructions.

Permutations

Most if not all the SIMD ISAs have at least one or multiple instructions to shuffle vector elements within a vector or among two.

It is quite common to use those instructions to implement matrix transposes, but it isn’t its only use.

In my toolbox I put vec_perm and vec_xxpermdi since even if the portable stdsimd provides some shuffle support it is quite unwieldy compared to the Altivec native offering.

vec_perm: Vector Permute

Since it first iteration Altivec had a quite amazing instruction called vec_perm or vperm:

    fn vec_perm(a: i8x16, b: i8x16, c: i8x16) -> i8x16 {
        let mut d;
        for i in 0..16 {
            let idx = c[i] & 0xf;
            d[i] = if (c[i] & 0x10) == 0 {
                a[idx]
            } else {
                b[idx]
            };
        }
        d
    }

It is important to notice that the displacement map c is a vector and not a constant. That gives you quite a bit of flexibility in a number of situations.

This instruction is the building block you can use to implement a large deal of common patterns, including some that are also covered by stand-alone instructions e.g.:
– packing/unpacking across lanes as long you do not have to saturate: vec_pack, vec_unpackh/vec_unpackl
– interleave/merge two vectors: vec_mergel, vec_mergeh
– shift N bytes in a vector from another: vec_sld

It can be important to recall this since you could always take two permutations and make one, vec_perm itself is pretty fast and replacing two or more instructions with a single permute can get you a pretty neat speed boost.

vec_xxpermdi Vector Permute Doubleword Immediate

Among a good deal of improvements VSX introduced a number of instructions that work on 64bit-elements vectors, among those we have a permute instruction and I found myself using it a lot.

    #[rustc_args_required_const(2)]
    fn vec_xxpermdi(a: i64x2, b: i64x2, c: u8) -> i64x2 {
        match c & 0b11 {
            0b00 => i64x2::new(a[0], b[0]);
            0b01 => i64x2::new(a[1], b[0]);
            0b10 => i64x2::new(a[0], b[1]);
            0b11 => i64x2::new(a[1], b[1]);
        }
    }

This instruction is surely less flexible than the previous permute but it does not require an additional load.

When working on video codecs is quite common to deal with blocks of pixels that go from 4×4 up to 64×64, before vec_xxpermdi the common pattern was:

    #[inline(always)]
    fn store8(dst: &mut [u8x16], v: &[u8x16]) {
        let data = dst[i];
        dst[i] = vec_perm(v, data, TAKE_THE_FIRST_8);
    }

That implies to load the mask as often as needed as long as the destination.

Using vec_xxpermdi avoids the mask load and that usually leads to a quite significative speedup when the actual function is tiny.

Mixed Arithmetics

With mixed arithmetics I consider all the instructions that do in a single step multiple vector arithmetics.

The original altivec has the following operations available for the integer types:
vec_madds
vec_mladd
vec_mradds
vec_msum
vec_msums
vec_sum2s
vec_sum4s
vec_sums

And the following two for the float type:
vec_madd
vec_nmsub

All of them are quite useful and they will all find their way in stdsimd pretty soon.

I’m describing today vec_sums, vec_msums and vec_madds.

They are quite representative and the other instructions are similar in spirit:
vec_madds, vec_mladd and vec_mradds all compute a lane-wise product, take either the high-order or the low-order part of it
and add a third vector returning a vector of the same element size.
vec_sums, vec_sum2s and vec_sum4s all combine an in-vector sum operation with a sum with another vector.
vec_msum and vec_msums both compute a sum of products, the intermediates are added together and then added to a wider-element
vector.

If there is enough interest and time I can extend this post to cover all of them, for today we’ll go with this approximation.

vec_sums: Vector Sum Saturated

Usually SIMD instruction work with two (or 3) vectors and execute the same operation for each vector element.
Sometimes you want to just do operations within the single vector and vec_sums is one of the few instructions that let you do that:

    fn vec_sums(a: i32x4, b: i32x4) -> i32x4 {
        let d = i32x4::new(0, 0, 0, 0);

        d[3] = b[3].saturating_add(a[0]).saturating_add(a[1]).saturating_add(a[2]).saturating_add(a[3]);

        d
    }

It returns in the last element of the vector the sum of the vector elements of a and the last element of b.
It is pretty handy when you need to compute an average or similar operations.

It works only with 32bit signed element vectors.

vec_msums: Vector Multiply Sum Saturated

This instruction sums the 32bit element of the third vector with the two products of the respective 16bit
elements of the first two vectors overlapping the element.

It does quite a bit:

    fn vmsumshs(a: i16x8, b: i16x8, c: i32x4) -> i32x4 {
        let d;
        for i in 0..4 {
            let idx = 2 * i;
            let m0 = a[idx] as i32 * b[idx] as i32;
            let m1 = a[idx + 1] as i32 * b[idx + 1] as i32;
            d[i] = c[i].saturating_add(m0).saturating_add(m1);
        }
        d
    }

    fn vmsumuhs(a: u16x8, b: u16x8, c: u32x4) -> u32x4 {
        let d;
        for i in 0..4 {
            let idx = 2 * i;
            let m0 = a[idx] as u32 * b[idx] as u32;
            let m1 = a[idx + 1] as u32 * b[idx + 1] as u32;
            d[i] = c[i].saturating_add(m0).saturating_add(m1);
        }
        d
    }

    ...

    fn vec_msums<T, U>(a: T, b: T, c: U) -> U
    where T: sealed::VectorMultiplySumSaturate<U> {
        a.msums(b, c)
    }

It works only with 16bit elements, signed or unsigned. In order to support that on rust we have to use some creative trait.
It is quite neat if you have to implement some filters.

vec_madds: Vector Multiply Add Saturated

    fn vec_madds(a: i16x8, b: i16x8, c: i16x8) -> i16x8 {
        let d;
        for i in 0..8 {
            let v = (a[i] as i32 * b[i] as i32) >> 15;
            d[i] = (v as i16).saturating_add(c[i]);
        }
        d
    }

Takes the high-order 17bit of the lane-wise product of the first two vectors and adds it to a third one.

Coming next

Raptor Enginering kindly gave me access to a Power 9 through their Integricloud hosting.

We could run some extensive benchmarks and we found some peculiar behaviour with the C compilers available on the machine and that got me, Luc and Alexandra a little puzzled.

Next time I’ll try to collect in a little more organic way what I randomly put on my twitter as I noticed it.

Video Compression Bounty Hunters

In this post, we (Luca Barbato and Luc Trudeau) joined forces to talk about the awesome work we’ve been doing on Altivec/VSX optimizations for the libvpx library, you can read it here or on Luc’s medium.

Both of us where in Brussels for FOSDEM 2018, Luca presented his work on rust-av and Luc was there to hack on rav1e – an experimental AV1 video encoder in Rust.

Luca joined the rav1e team and helped give hints about how to effectively leverage rust. Together, we worked on AV1 intra prediction code, among the other things.

Luc Trudeau: I was finishing up my work on Chroma from Luma in AV1, and wanted to stay involved in royalty free open source video codecs. When Luca talked to me about libvpx bounties on Bountysource, I was immediately intrigued.

Luca Barbato: Luc just finished implementing the Neon version of his CfL work and I wondered how that code could work using VSX. I prepared some of the machinery that was missing in libaom and Luc tried his hand on Altivec. We still had some pending libvpx work sponsored by IBM and I asked him if he wanted to join in.

What’s libvpx?

For those less familiar, libvpx is the official Google implementation of the VP9 video format. VP9 is most notably used in YouTube and Netflix. VP9 playback is available on some browsers including Chrome, Edge and Firefox and also on Android devices, covering the 75.31% of the global user base.

Ref: caniuse.com VP9 support in browsers.

Why use VP9, when the de facto video format is H.264/AVC?

Because VP9 is royalty free and the bandwidth savings are substantial when compared to H.264 when playback is available (an estimated 3.3B devices support VP9). In other words, having VP9 as a secondary codec can pay for itself in bandwidth savings by not having to send H.264 to most users.

Ref: Netflix VP9 compression analysis.

Why care about libvpx on Power?

Dynamic adaptive streaming formats like HLS and MPEG DASH have completely changed the game of streaming video over the internet. Streaming hardware and custom multimedia servers are being replaced by web servers.

From the servers’ perspective streaming video is akin to serving small videos files; lots of small video files! To cover all clients and most network conditions a considerable amount of video files must be encoded, stored and distributed.

Things are changing fast and while the total cost of ownership of video content for previous generation video formats, like H.264, was mostly made up of bandwidth and hosting, encoding costs are growing with more complex video formats like HEVC and VP9.

This complexity is reported to have grown exponentially with the upcoming AV1 video format. A video format, built on the libVPX code base, by the Alliance for Open Media, of which IBM is a founding member.

Ref: Facebook’s AV1 complexity analysis

At the same time, IBM and its partners in the OpenPower Foundation are releasing some very impressive hardware with the new Power9 processor line up. Big Iron Power9 systems, like the Talos II from Raptor Computing Systems and the collaboration between Google and Rackspace on Zaius/Barreleye servers, are ideal solutions to the tackle the growing complexity of video format encoding.

However, these awesome machines are currently at a disadvantage when encoding video. Without the platform specific optimizations, that their competitors enjoy, the Power9 architecture can’t be fully utilized. This is clearly illustrated in the x264 benchmark released in a recent Phoronix article.

Ref: Phoronix x264 server benchmark.

Thanks to the optimization bounties sponsored by IBM, we are hard at work bridging the gap in libvpx.

Optimization bounties?

Just like bug bounty programs, optimization make for great bounties. Companies that see benefit in platform specific optimizations for video codecs can sponsor our bounties on the Bountysource platform.

Multiple companies can sponsor the same bounty, thus sharing cost of more important bounties. Furthermore, bounties are a minimal risk investment for sponsors, as they are only paid out when the work is completed (and peer reviewed by libvpx maintainers)

Not only is the Bountysource platform a win for companies that directly benefit from the bounties they are sponsoring, it’s also a win for developers (like us) who can get paid to work on free and open source projects that we are passionate about. Optimization bounties are a source of sustainability in the free and open source software ecosystem.

How do you choose bounties?

Since we’re a small team of bounty hunters (Luca Barbato, Alexandra Hájková, Rafael de Lucena Valle and Luc Trudeau), we need to play it smart and maximize the impact of our work. We’ve identified two common use cases related to streaming on the Power architecture: YouTube-like encodes and real time (a.k.a. low latency) encodes.

By profiling libvpx under these conditions, we can determine the key functions to optimize. The following charts show the percentage of time spent the in top 20 functions of the libvpx encoder (without Altivec/VSX optimisations) on a Power8 system, for both YouTube-like and real time settings.

It’s interesting to see that the top 20 functions make up about 80% of the encoding time. That’s similar in spirit to the Pareto principle, in that we don’t have to optimize the whole encoder to make the Power architecture competitive for video encoding.

We see a similar distribution between YouTube-like encoding settings and real time video encoding. In other words, optimization bounties for libvpx benefit both Video on Demand (VOD) and live broadcast services.

We add bounties on the Bountysource platform around common themed functions like: convolution, sum of absolute differences (SAD), variance, etc. Companies interested in libvpx optimization can go and fund these bounties.

What’s the impact of this project so far?

So far, we delivered multiple libvpx bounties including:

  • Convolution
  • Sum of absolute differences (SAD)
  • Quantization
  • Inverse transforms
  • Intra prediction
  • etc.

To see the benefit of our work, we compiled the latest version of libVPX with and without VSX optimizations and ran it on a Power8 machine. Note that the C compiled versions can produce Altivec/VSX code via auto vectorization. The results, in frames per minutes, are shown below for both YouTube-like encoding and Real time encoding.

Our current VSX optimizations give approximately a 40% and 30% boost in encoding speed for YouTube-like and real time encoding respectively. Encoding speed increases in the range of 10 to 14 frames per minute can considerably reduce cloud encoding costs for Power architecture users.

In the context of real time encoding, the time saved by the platform optimization can be put to good use to improve compression efficiency. Concretely, a real time encoder will encode in real time speed, but speeding up the encoders allows for operators to increase the number of coding tools, resulting in better quality for the viewers and bandwidth savings for operators.

What’s next?

We’re energized by the impact that our small team of bounty hunters is having on libvpx performance for the Power architecture and we wanted to share it in this blog post. We look forward to getting even more performance from libvpx on the Power architecture. Expect considerable performance improvement for the Power architecture in the next libvpx release (1.8).

As IBM targets its Power9 line of systems at heavy cloud computations, it seems natural to also aim all that power at tackling the growing costs of AV1 encodes. This won’t happen without platform specific optimizations and the time to start is now; as the AV1 format is being finalized, everyone is still in the early phases of optimization. We are currently working with our sponsors to set up AV1 bounties, so stay tuned for an upcoming post.

Rust-av: Rust and Multimedia

Recently I presented my new project at Fosdem and since I was afraid of not having enough time for the questions I trimmed the content to the bare minimum. This blog post should add some more details.

What is it?

Rust-av aims to be a complete multimedia toolkit written in rust.

Rust is a quite promising language that aims to offer high execution speed while granting a number of warranties on the code behavior that you cannot have in C, C++, Java and so on.

Its zero-cost abstraction feature coupled with the fact that the compiler actively prevents you from committing a large class of mistakes related to memory access seems a perfect match to implement a multimedia toolkit that is easy to use, fast enough and trustworthy.

Why something completely new?

Since rust code can be intermixed with C code, an evolutive approach of replacing little by little small components in a larger project is perfectly feasible, and it is what we are currently trying to do with vlc.

But rust is not just good to write some inner routines so they are fast and correct, its trait system is also quite useful to have a more expressive API.

Most of the multimedia concepts are pretty simple at the high level (e.g frame is just a picture or some sound with some timestamp) with an excruciating amount of quirk and details that require your toolkit to make choices for you or make you face a large amount of complexity.

That leads to API that are either easy but quite inflexible (and opinionated) or API providing all the flexibility, but forcing the user to have to learn a lot of information in order to achieve what the simpler API would let you implement in an handful of lines of code.

I wanted to leverage Rust to make the low level implementations with less bugs and, at the same time, try to provide a better API to use it.

Why now?

Since 2016 I kept bouncing ideas with Kostya and Geoffroy but between my work duties and other projects I couldn’t devote enough time to it. Thanks to the Mozilla Open Source Support initiative that awarded me with enough to develop it full time, now the project has already some components published and more will follow during the next months.

Philosophy

I’m trying to leverage the experience I have from contributing to vlc and libav and keep what is working well and try to not make the same mistakes.

Ease of use

I want that the whole toolkit to be useful to a wide audience. Developers often fight against the library in order to undo what is happening under the hood or end up vendoring some part of it since they need only a tiny subset of all the features that are provided.

Rust makes quite natural split large projects in independent components (called crates) and it is already quite common to have meta-crates re-exporting many smaller crates to provide some uniform access.

The rust-av code, as opposed to the rather monolithic approach taken in Libav, can be reused with the granularity of the bare codec or format:

  • Integrating it in a foreign toolkit won’t require to undo what the common utility code does.
  • Even when using it through the higher level layers, rust-av won’t force the developer to bring in any unrelated dependencies.
  • On the other hand users that enjoy a fully integrated and all-encompassing solution can simply depend on the meta-crates and get the support for everything.

Speed

Multimedia playback boils down to efficiently do complex computation so an arbitrary large amount of data can be rendered within a fraction of second, multimedia real time streaming requires to compress an equally large amount of data in the same time.

Speed in multimedia is important.

Rust provides high level idiomatic constructs that surprisingly lead to pretty decent runtime speed. The stdsimd effort and the seamless C ABI support make easier to leverage the SIMD instructions provided by the recent CPU architectures.

Trustworthy

Traditionally the most effective way to write fast multimedia code had been pairing C and assembly. Sadly the combination makes quite easy to overlook corner cases and have any kind of memory hazards (use-after-free, out of bound reads and writes, NULL-dereferences…).

Rust effectively prevents a good deal of those issues at compile time. Since its abstractions usually do not cause slowdowns it is possible to write code that is, arguably, less misleading and as fast.

Structure

The toolkit is composed of multiple, loosely coupled, crates. They can be grouped by level of abstraction.

Essential

av-data: Used by nearly all the other crates it provides basic data types and a minimal amount of functionality on top of it. It provides the following structs mainly:

  • Frame: it binds together a time reference and a buffer, representing either a video picture or some audio samples.
  • Packet: it bind together a time reference and a buffer, containing compressed data.
  • Value: Simple key value type abstraction, used to pass arbitrary data to the configuration functions.

Core

They provide the basic abstraction (traits) implemented by specific set of components.

  • av-format: It provides a set of traits to implement muxers and demuxers and an utility Context to bridge the normal rust I/O Write and Read traits and the actual muxers and demuxers.
  • av-codec: It provides a set of traits to implement encoders and decoders and an utility Context that wraps.

Utility

They provide building blocks that may be used to implement actual codecs and formats.

  • av-bitstream: Utility crate to write and read bits and bytes
  • av-audio: Audio-specific utilities
  • av-video: Video-specific utilities

Implementation

Actual implementations of codec and format, they can be used directly or through the utility Contexts.

The direct usage is suggested only if you are integrating it in larger frameworks that already implement, possibly in different ways, the integration code provided by the Context (e.g. binding it together with the I/O for the formats or internal queues for the codecs).

Higher-level

They provide higher level Contexts to playback or encode data through a simplified interface:

  • av-player reads bytes though a provided Read and outputs decoded Frames. Under the hood it probes the data, allocates and configures a Demuxer and a Decoder for each stream of interest.
  • av-encoder consumes Frames and outputs encoded and muxed data through a Write output. It automatically setup the encoders and the muxer.

Meta-crates

They ease the use in bulk of everything provided by rust-av.

There are 4 crates providing a list of specific components: av-demuxers, av-muxers, av-decoders and av-encoders; and 2 grouping them by type: av-formats and av-codecs.

Their use is suggested when you’d like to support every format and codec available.

So far

All the development happens on the github organization and so far the initial Core and Essential crates are ready to be used.

There is a nom-based matroska demuxer in working condition and some non-native wrappers providing implementations for some decoders and encoders.

Thanks to est31 we have native vorbis support.

I’m working on a native implementation of opus and soon I’ll move to a video codec.

There is a tiny player called avp and an encoder tool (named ave) will appear once the matroska muxer is complete.

What’s missing in rust-av

API-wise, right now rust-av provides only for simple decode and encoding, muxing and demuxing. There are already enough wrapped codecs to let people play with the library and, hopefully, help in polishing it.

For each crate I’m trying to prepare some easy tasks so people willing to contribute to the project can start from them, all help is welcome!

What’s missing in rust

So far my experience with rust had been quite positive, but there are a number of features that are missing or that could be addressed.

  • SIMD support is shaping up nicely and it is coming soon.
  • The natural fallback, going down to assembly, is available since rust supports the C ABI, inline assembly support on the other hand seems that is still pending some discussion before it reaches stable.
  • Arbitrarily aligned allocation is a MUST in order to support hardware acceleration and SIMD works usually better with aligned buffers.
  • I’d love to have const generics now, luckily associated constants with traits allow some workarounds that let you specialize by constants (and result in neat speedups).
  • I think that focusing a little more on array/slice support would lead to the best gains, since right now there isn’t an equivalent to collect() to fill arrays in an idiomatic way and in multimedia large lookup tables are pretty much a staple.

In closing

Rust and Multimedia seem a really good match, in my experience beside a number of missing features the language seems quite good for the purpose.

Once I have more native implementations complete I will be able to have better means to evaluate the speed difference from writing the same code in C.