Using rav1e – from your code

AV1, Rav1e, Crav1e, an intro

(this article is also available on my dev.to profile, I might use it more often since wordpress is pretty horrible at managing markdown.)

AV1 is a modern video codec brought to you by an alliance of many different bigger and smaller players in the multimedia field.
I’m part of the VideoLan organization and I spent quite a bit of time on this codec lately.

rav1e: The safest and fastest AV1 encoder, built by many volunteers and Mozilla/Xiph developers.
It is written in rust and strives to provide good speed, quality and stay maintainable.

crav1e: A companion crate, written by yours truly, that provides a C-API, so the encoder can be used by C libraries and programs.

This article will just give a quick overview of the API available right now and it is mainly to help people start using it and hopefully report issues and problem.

Rav1e API

The current API is built around the following 4 structs and 1 enum:

struct Frame: The raw pixel data
struct Packet: The encoded bitstream
struct Config: The encoder configuration
struct Context: The encoder state

enum EncoderStatus: Fatal and non-fatal condition returned by the Contextmethods.

Config

The Config struct currently is simply constructed.

    struct Config {
        enc: EncoderConfig,
        threads: usize,
    }

The EncoderConfig stores all the settings that have an impact to the actual bitstream while settings such as threads are kept outside.

    let mut enc = EncoderConfig::with_speed_preset(speed);
    enc.width = w;
    enc.height = h;
    enc.bit_depth = 8;
    let cfg = Config { enc, threads: 0 };

NOTE: Some of the fields above may be shuffled around until the API is marked as stable.

Methods

new_context

    let cfg = Config { enc, threads: 0 };
    let ctx: Context<u8> = cfg.new_context();

It produces a new encoding context. Where bit_depth is 8, it is possible to use an optimized u8 codepath, otherwise u16 must be used.

Context

It is produced by Config::new_context, its implementation details are hidden.

Methods

The Context can be grouped into essential, optional and convenience.

    // Essential API
    pub fn send_frame<F>(&mut self, frame: F) -> Result<(), EncoderStatus>
      where F: Into<Option<Arc<Frame<T>>>>, T: Pixel;
    pub fn receive_packet(&mut self) -> Result<Packet<T>, EncoderStatus>;

The encoder works by processing each Frame fed through send_frame and producing each Packet that can be retrieved by receive_packet.

    // Optional API
    pub fn container_sequence_header(&mut self) -> Vec<u8>;
    pub fn get_first_pass_data(&self) -> &FirstPassData;

Depending on the container format, the AV1 Sequence Header could be stored in the extradata. container_sequence_header produces the data pre-formatted to be simply stored in mkv or mp4.

rav1e supports multi-pass encoding and the encoding data from the first pass can be retrieved by calling get_first_pass_data.

    // Convenience shortcuts
    pub fn new_frame(&self) -> Arc<Frame<T>>;
    pub fn set_limit(&mut self, limit: u64);
    pub fn flush(&mut self) {

new_frame() produces a frame according to the dimension and pixel format information in the Context.
flush() is functionally equivalent to call send_frame(None).
set_limit()is functionally equivalent to call flush()once limit frames are sent to the encoder.

Workflow

The workflow is the following:

Setup:
- Prepare a Config
- Call new_context from the Config to produce a Context
Encode loop:
- Pull each Packet using receive_packet.
- If receive_packet returns EncoderStatus::NeedMoreData
  - Feed each Frame to the Context using send_frame
Complete the encoding
- Issue a flush() to encode each pending Frame in a final Packet.
- Call receive_packet until EncoderStatus::LimitReached is returned.

Crav1e API

The crav1e API provides the same structures and features beside few key differences:

The Frame, Config, and Context structs are opaque.

typedef struct RaConfig RaConfig;
typedef struct RaContext RaContext;
typedef struct RaFrame RaFrame;

The Packet struct exposed is much simpler than the rav1e original.

typedef struct {
    const uint8_t *data;
    size_t len;
    uint64_t number;
    RaFrameType frame_type;
} RaPacket;

The EncoderStatus includes a Success condition.

typedef enum {
    RA_ENCODER_STATUS_SUCCESS = 0,
    RA_ENCODER_STATUS_NEED_MORE_DATA,
    RA_ENCODER_STATUS_ENOUGH_DATA,
    RA_ENCODER_STATUS_LIMIT_REACHED,
    RA_ENCODER_STATUS_FAILURE = -1,
} RaEncoderStatus;

RaConfig

Since the configuration is opaque there are a number of functions to assemble it:

rav1e_config_default allocates a default configuration.
rav1e_config_parse and rav1e_config_parse_int set a specific value for a specific field selected by a key string.
rav1e_config_set_${field} are specialized setters for complex information such as the color description.

RaConfig *rav1e_config_default(void);

/**
 * Set a configuration parameter using its key and value as string.
 * Available keys and values
 * - "quantizer": 0-255, default 100
 * - "speed": 0-10, default 3
 * - "tune": "psnr"-"psychovisual", default "psnr"
 * Return a negative value on error or 0.
 */
int rav1e_config_parse(RaConfig *cfg, const char *key, const char *value);

/**
 * Set a configuration parameter using its key and value as integer.
 * Available keys and values are the same as rav1e_config_parse()
 * Return a negative value on error or 0.
 */
int rav1e_config_parse_int(RaConfig *cfg, const char *key, int value);

/**
 * Set color properties of the stream.
 * Supported values are defined by the enum types
 * RaMatrixCoefficients, RaColorPrimaries, and RaTransferCharacteristics
 * respectively.
 * Return a negative value on error or 0.
 */
int rav1e_config_set_color_description(RaConfig *cfg,
                                       RaMatrixCoefficients matrix,
                                       RaColorPrimaries primaries,
                                       RaTransferCharacteristics transfer);

/**
 * Set the content light level information for HDR10 streams.
 * Return a negative value on error or 0.
 */
int rav1e_config_set_content_light(RaConfig *cfg,
                                   uint16_t max_content_light_level,
                                   uint16_t max_frame_average_light_level);

/**
 * Set the mastering display information for HDR10 streams.
 * primaries and white_point arguments are RaPoint, containing 0.16 fixed point values.
 * max_luminance is a 24.8 fixed point value.
 * min_luminance is a 18.14 fixed point value.
 * Returns a negative value on error or 0.
 */
int rav1e_config_set_mastering_display(RaConfig *cfg,
                                       RaPoint primaries[3],
                                       RaPoint white_point,
                                       uint32_t max_luminance,
                                       uint32_t min_luminance);

void rav1e_config_unref(RaConfig *cfg);

The bare minimum setup code is the following:

    int ret = -1;
    RaConfig *rac = rav1e_config_default();
    if (!rac) {
        printf("Unable to initialize\n");
        goto clean;
    }

    ret = rav1e_config_parse_int(rac, "width", 64);
    if (ret < 0) {
        printf("Unable to configure width\n");
        goto clean;
    }

    ret = rav1e_config_parse_int(rac, "height", 96);
    if (ret < 0) {
        printf("Unable to configure height\n");
        goto clean;
    }

    ret = rav1e_config_parse_int(rac, "speed", 9);
    if (ret < 0) {
        printf("Unable to configure speed\n");
        goto clean;
    }

RaContext

As per the rav1e API, the context structure is produced from a configuration and the same send-receive model is used.
The convenience methods aren’t exposed and the frame allocation function is actually essential.

// Essential API
RaContext *rav1e_context_new(const RaConfig *cfg);
void rav1e_context_unref(RaContext *ctx);

RaEncoderStatus rav1e_send_frame(RaContext *ctx, const RaFrame *frame);
RaEncoderStatus rav1e_receive_packet(RaContext *ctx, RaPacket **pkt);

// Optional API
uint8_t *rav1e_container_sequence_header(RaContext *ctx, size_t *buf_size);
void rav1e_container_sequence_header_unref(uint8_t *sequence);

RaFrame

Since the frame structure is opaque in C, we have the following functions to create, fill and dispose of the frames.

RaFrame *rav1e_frame_new(const RaContext *ctx);
void rav1e_frame_unref(RaFrame *frame);

/**
 * Fill a frame plane
 * Currently the frame contains 3 planes, the first is luminance followed by
 * chrominance.
 * The data is copied and this function has to be called for each plane.
 * frame: A frame provided by rav1e_frame_new()
 * plane: The index of the plane starting from 0
 * data: The data to be copied
 * data_len: Lenght of the buffer
 * stride: Plane line in bytes, including padding
 * bytewidth: Number of bytes per component, either 1 or 2
 */
void rav1e_frame_fill_plane(RaFrame *frame,
                            int plane,
                            const uint8_t *data,
                            size_t data_len,
                            ptrdiff_t stride,
                            int bytewidth);

RaEncoderStatus

The encoder status enum is returned by the rav1e_send_frame and rav1e_receive_packet and it is possible already to arbitrarily query the context for its status.

RaEncoderStatus rav1e_last_status(const RaContext *ctx);

To simulate the rust Debug functionality a to_str function is provided.

char *rav1e_status_to_str(RaEncoderStatus status);

Workflow

The C API workflow is similar to the Rust one, albeit a little more verbose due to the error checking.

    RaContext *rax = rav1e_context_new(rac);
    if (!rax) {
        printf("Unable to allocate a new context\n");
        goto clean;
    }

    RaFrame *f = rav1e_frame_new(rax);
    if (!f) {
        printf("Unable to allocate a new frame\n");
        goto clean;
    }

while (keep_going(i)){
     RaPacket *p;
     ret = rav1e_receive_packet(rax, &p);
     if (ret < 0) {
         printf("Unable to receive packet %d\n", i);
         goto clean;
     } else if (ret == RA_ENCODER_STATUS_SUCCESS) {
         printf("Packet %"PRIu64"\n", p->number);
         do_something_with(p);
         rav1e_packet_unref(p);
         i++;
     } else if (ret == RA_ENCODER_STATUS_NEED_MORE_DATA) {
         RaFrame *f = get_frame_by_some_mean(rax);
         ret = rav1e_send_frame(rax, f);
         if (ret < 0) {
            printf("Unable to send frame %d\n", i);
            goto clean;
        } else if (ret > 0) {
        // Cannot happen in normal conditions
            printf("Unable to append frame %d to the internal queue\n", i);
            abort();
        }
     } else if (ret == RA_ENCODER_STATUS_LIMIT_REACHED) {
         printf("Limit reached\n");
         break;
     }
}

In closing

This article was mainly a good excuse to try dev.to and see write down some notes and clarify my ideas on what had been done API-wise so far and what I should change and improve.

If you managed to read till here, your feedback is really welcome, please feel free to comment, try the software and open issues for crav1e and rav1e.

Coming next

Working crav1e got me to see what’s good and what is lacking in the c-interoperability story of rust, now that this landed I can start crafting and publishing better tools for it and maybe I’ll talk more about it here.
Soon rav1e will get more threading-oriented features, some benchmarking experiments will happen soon.

Thanks

Special thanks to Derek and Vittorio spent lots of time integrating crav1e in larger software and gave precious feedback in what was missing and broken in the initial iterations.
Thanks to David for the review and editorial work.
Also thanks to Matteo for introducing me to dev.to.

Video Compression Bounty Hunters

In this post, we (Luca Barbato and Luc Trudeau) joined forces to talk about the awesome work we’ve been doing on Altivec/VSX optimizations for the libvpx library, you can read it here or on Luc’s medium.

Both of us where in Brussels for FOSDEM 2018, Luca presented his work on rust-av and Luc was there to hack on rav1e – an experimental AV1 video encoder in Rust.

Luca joined the rav1e team and helped give hints about how to effectively leverage rust. Together, we worked on AV1 intra prediction code, among the other things.

Luc Trudeau: I was finishing up my work on Chroma from Luma in AV1, and wanted to stay involved in royalty free open source video codecs. When Luca talked to me about libvpx bounties on Bountysource, I was immediately intrigued.

Luca Barbato: Luc just finished implementing the Neon version of his CfL work and I wondered how that code could work using VSX. I prepared some of the machinery that was missing in libaom and Luc tried his hand on Altivec. We still had some pending libvpx work sponsored by IBM and I asked him if he wanted to join in.

What’s libvpx?

For those less familiar, libvpx is the official Google implementation of the VP9 video format. VP9 is most notably used in YouTube and Netflix. VP9 playback is available on some browsers including Chrome, Edge and Firefox and also on Android devices, covering the 75.31% of the global user base.

Ref: caniuse.com VP9 support in browsers.

Why use VP9, when the de facto video format is H.264/AVC?

Because VP9 is royalty free and the bandwidth savings are substantial when compared to H.264 when playback is available (an estimated 3.3B devices support VP9). In other words, having VP9 as a secondary codec can pay for itself in bandwidth savings by not having to send H.264 to most users.

Ref: Netflix VP9 compression analysis.

Why care about libvpx on Power?

Dynamic adaptive streaming formats like HLS and MPEG DASH have completely changed the game of streaming video over the internet. Streaming hardware and custom multimedia servers are being replaced by web servers.

From the servers’ perspective streaming video is akin to serving small videos files; lots of small video files! To cover all clients and most network conditions a considerable amount of video files must be encoded, stored and distributed.

Things are changing fast and while the total cost of ownership of video content for previous generation video formats, like H.264, was mostly made up of bandwidth and hosting, encoding costs are growing with more complex video formats like HEVC and VP9.

This complexity is reported to have grown exponentially with the upcoming AV1 video format. A video format, built on the libVPX code base, by the Alliance for Open Media, of which IBM is a founding member.

Ref: Facebook’s AV1 complexity analysis

At the same time, IBM and its partners in the OpenPower Foundation are releasing some very impressive hardware with the new Power9 processor line up. Big Iron Power9 systems, like the Talos II from Raptor Computing Systems and the collaboration between Google and Rackspace on Zaius/Barreleye servers, are ideal solutions to the tackle the growing complexity of video format encoding.

However, these awesome machines are currently at a disadvantage when encoding video. Without the platform specific optimizations, that their competitors enjoy, the Power9 architecture can’t be fully utilized. This is clearly illustrated in the x264 benchmark released in a recent Phoronix article.

Ref: Phoronix x264 server benchmark.

Thanks to the optimization bounties sponsored by IBM, we are hard at work bridging the gap in libvpx.

Optimization bounties?

Just like bug bounty programs, optimization make for great bounties. Companies that see benefit in platform specific optimizations for video codecs can sponsor our bounties on the Bountysource platform.

Multiple companies can sponsor the same bounty, thus sharing cost of more important bounties. Furthermore, bounties are a minimal risk investment for sponsors, as they are only paid out when the work is completed (and peer reviewed by libvpx maintainers)

Not only is the Bountysource platform a win for companies that directly benefit from the bounties they are sponsoring, it’s also a win for developers (like us) who can get paid to work on free and open source projects that we are passionate about. Optimization bounties are a source of sustainability in the free and open source software ecosystem.

How do you choose bounties?

Since we’re a small team of bounty hunters (Luca Barbato, Alexandra Hájková, Rafael de Lucena Valle and Luc Trudeau), we need to play it smart and maximize the impact of our work. We’ve identified two common use cases related to streaming on the Power architecture: YouTube-like encodes and real time (a.k.a. low latency) encodes.

By profiling libvpx under these conditions, we can determine the key functions to optimize. The following charts show the percentage of time spent the in top 20 functions of the libvpx encoder (without Altivec/VSX optimisations) on a Power8 system, for both YouTube-like and real time settings.

It’s interesting to see that the top 20 functions make up about 80% of the encoding time. That’s similar in spirit to the Pareto principle, in that we don’t have to optimize the whole encoder to make the Power architecture competitive for video encoding.

We see a similar distribution between YouTube-like encoding settings and real time video encoding. In other words, optimization bounties for libvpx benefit both Video on Demand (VOD) and live broadcast services.

We add bounties on the Bountysource platform around common themed functions like: convolution, sum of absolute differences (SAD), variance, etc. Companies interested in libvpx optimization can go and fund these bounties.

What’s the impact of this project so far?

So far, we delivered multiple libvpx bounties including:

Convolution
Sum of absolute differences (SAD)
Quantization
Inverse transforms
Intra prediction
etc.

To see the benefit of our work, we compiled the latest version of libVPX with and without VSX optimizations and ran it on a Power8 machine. Note that the C compiled versions can produce Altivec/VSX code via auto vectorization. The results, in frames per minutes, are shown below for both YouTube-like encoding and Real time encoding.

Our current VSX optimizations give approximately a 40% and 30% boost in encoding speed for YouTube-like and real time encoding respectively. Encoding speed increases in the range of 10 to 14 frames per minute can considerably reduce cloud encoding costs for Power architecture users.

In the context of real time encoding, the time saved by the platform optimization can be put to good use to improve compression efficiency. Concretely, a real time encoder will encode in real time speed, but speeding up the encoders allows for operators to increase the number of coding tools, resulting in better quality for the viewers and bandwidth savings for operators.

What’s next?

We’re energized by the impact that our small team of bounty hunters is having on libvpx performance for the Power architecture and we wanted to share it in this blog post. We look forward to getting even more performance from libvpx on the Power architecture. Expect considerable performance improvement for the Power architecture in the next libvpx release (1.8).

As IBM targets its Power9 line of systems at heavy cloud computations, it seems natural to also aim all that power at tackling the growing costs of AV1 encodes. This won’t happen without platform specific optimizations and the time to start is now; as the AV1 format is being finalized, everyone is still in the early phases of optimization. We are currently working with our sponsors to set up AV1 bounties, so stay tuned for an upcoming post.