rav1e and crav1e – A fast and safe AV1 encoder – Some HowTo

Over the year I contributed to an AV1 encoder written in rust.

Here a small tutorial about what is available right now, there is still lots to do, but I think we could enjoy more user-feedback (and possibly also some help).

Setting up

Install the rust toolchain

If you do not have rust installed, it is quite simple to get a full environment using rustup

$ curl https://sh.rustup.rs -sSf | sh
# Answer the questions asked and make sure you source the `.profile` file created.
$ source ~/.profile

Install cmake, perl and nasm

rav1e uses libaom for testing and and on x86/x86_64 some components have SIMD variants written directly using nasm.

You may follow the instructions, or just install:
nasm (version 2.13 or better)
perl (any recent perl5)
cmake (any recent version)

Once you have those dependencies in you are set.

Building rav1e

We use cargo, so the process is straightforward:

## Pull in the customized libaom if you want to run all the tests
$ git submodule update --init

## Build everything
$ cargo build --release

## Test to make sure everything works as intended
$ cargo test --features decode_test --release

## Install rav1e
$ cargo install

Using rav1e

Right now rav1e has a quite simple interface:

rav1e 0.1.0
AV1 video encoder

USAGE:
    rav1e [OPTIONS]  --output 

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -I, --keyint     Keyframe interval [default: 30]
    -l, --limit                  Maximum number of frames to encode [default: 0]
        --low_latency      low latency mode. true or false [default: true]
    -o, --output                Compressed AV1 in IVF video output
        --quantizer                 Quantizer (0-255) [default: 100]
    -r 
    -s, --speed                  Speed level (0(slow)-10(fast)) [default: 3]
        --tune                    Quality tuning (Will enforce partition sizes >= 8x8) [default: psnr]  [possible
                                        values: Psnr, Psychovisual]

ARGS:
        Uncompressed YUV4MPEG2 video input

It accepts y4m raw source and produces ivf files.

You can configure the encoder by setting the speed and quantizer levels.

The low_latency flag can be turned off to run some additional analysis over a set of frames and have additional quality gains.

Crav1e

While ave and gst-rs will use the rav1e crate directly, there are a number of software such as handbrake or vlc that would be much happier to consume a C API.

Thanks to the staticlib target and cbindgen is quite easy to produce a C-ABI library and its matching header.

Setup

crav1e is built using cargo, so nothing special is needed right now beside nasm if you are building it on x86/x86_64.

Build the library

This step is completely straightforward, you can build it as release:

$ cargo build --release

or as debug

$ cargo build

It will produce a target/release/librav1e.a or a target/debug/librav1e.a.
The C header will be in include/rav1e.h.

Try the example code

I provided a quite minimal sample case.

cc -Wall c-examples/simple_encoding.c -Ltarget/release/ -lrav1e -Iinclude/ -o c-examples/simple_encoding
./c-examples/simple_encoding

If it builds and runs correctly you are set.

Manually copy the .a and the .h

Currently cargo install does not work for our purposes, but it will change in the future.

$ cp target/release/librav1e.a /usr/local/lib
$ cp include/rav1e.h /usr/local/include/

Missing pieces

Right now crav1e works well enough but there are few shortcomings I’m trying to address.

Shared library support

The cdylib target does exist and produce a nearly usable library but there are some issues with soname support. I’m trying to address them with upstream, but it might take some time.

Meanwhile some people suggest to use patchelf or similar tools to fix the library after the fact.

Install target

cargo is generally awesome, but sadly its support for installing arbitrary files to arbitrary paths is limited, luckily there are people proposing solutions.

pkg-config file generation

I consider a library not proper if a .pc file is not provided with it.

Right now there are means to extract the information need to build a pkg-config file, but there isn’t a simple way to do it.

$ cargo rustc -- --print native-static-libs

Provides what is needed for Libs.private, ideally it should be created as part of the install step since you need to know the prefix, libdir and includedir paths.

Coming next

Probably the next blog post will be about my efforts to make cargo able to produce proper cdylib or something quite different.

PS: If somebody feels to help me with matroska in AV1 would be great πŸ™‚

Contributing to libvpx

Recently I started to write the PowerPC/VSX support for libvpx, Alexandra will help as well.

Every open source project has its own rules, I found the choices taken in Libvpx interesting enough to write them down (and possibly help newcomers with some shortcuts).

Overview

Coding style

The coding style is strongly enforced, the CI system will bounce your code if it doesn’t adhere to the style.

This constraint is enforced through a clang-format ruleset.

If you are using vim, this makes your life easier, otherwise the git integration comes handy.

Otherwise:

# clang-format -i what/I/m/working/on.c

Works no matter what.

Testing

New code should have its testcase, if it isn’t already covered.

Libvpx uses gtest and it has a quite decent test coverage. A full run of the tests can take a large chunk of time, if you are working on specific code (e.g. dsp functions), is easy to run only the tests you care about like this:

# ./test_libvpx --gtest_filter="*pattern*with*globs"

The current system does not double as benchmarking tool, so you are quite on your own if you are trying to speed up some parts.

Adding brand new tests more annoying than it should since gtest is quite bloated, updating a test to cover a variant is quite painless though.

Submitting patches

Libvpx uses gerrit and jenkins in a setup that makes them almost painless and has a detailed guide on how to register and fill in some forms to make the Google lawyers happy.

Gerrit and Jenkins defaults are quite clunky, so they Libvpx maintainer definitely invested some time to get them in a better shape.

Once you registered and set the hook to tag your commits sending a set boils down to:

# git push https://chromium-review.googlesource.com/webm/libvpx HEAD:refs/for/master

Interaction

Comments and reports end up in your mailbox, spamming it a lot (expect about 5-6 emails per commit). You have to use the web interface to have decent interaction and luckily PolyGerrit isn’t terrible at all (just make sure your replies gets sent since it has the tendency of keeping them in draft mode).

TL;DR

  • read this
  • install clang-format, including the git integration
  • be ready to make changes in test/*.cc and cope with gtest verboseness.
  • be ready to receive tons of email and use your web browser

broken-endian

You wrote your code, you wrote the tests and everything seems working.

Then you got somebody running your code on a big-endian machine and reports that EVERYTHING is broken.

Usually most of the data is serialized to disk or wire as big-endian, most of cpu usually do the computation in little-endian (with MIPS and PowerPC as rare exception). If you assume the relationship between the data on-wire and data in the cpu registers is always the same you are bound to have problems (and it gets even worse if you decide to write the data down as little-endian to disk because swapping from cpu to disk feels slow, you are doing it wrong).

Checklist

The problem is mainly while reading or writing:

  • Sometimes feels simpler to copy over some packed structure using the equivalent of read(fd, &my_struct, sizeof(struct)). if the struct contains anything different from byte-sized variables it won’t work, so is safe to say it won’t work at all. Gets even worse if you forgot to mark the structure as packed.
  • Writing has the same issue, never try to directly write a structure or even 16bit integers w/out making sure you get the expected endianess right.

Mini-post written to recall what not to do (more examples later).

Rethinking AVFormat – part 1

Container formats should be just a boring application of serialization of multiple arrays of tuples timestamp-binary blob.

Instead there are tons of implementation details and there are fun
and exceedingly annoying means to lose your sanity.

This post is yet another post about APIs you can see other here and here.

Current Status

In Libav we have libavformat taking care of general I/O, Muxing, Demuxing.

This blog post will not cover the additional grouping given by Programs, Chapters and such to not make the whole article huge and just focus on the basics.

I/O

The AVIO abstraction provides a mean to uniformly access content stored in files, available as remote streams (e.g. served through http or rtmp) or through custom implementations.

This part of the API is rightly coupled with the Muxer and Demuxer implementation.

It uses the common Context pattern you can find across the rest of Libav with some of twists:

  • The protocol handler can be guessed using the url provided, e.g. file:///tmp/foo.
  • The functions that allocate a context take an extra parameter than the usual options AVDictionary in the form of a callback function.
  • You can create your own custom protocol easily.
int avio_open2(AVIOContext **s, const char *url, int flags, const AVIOInterruptCB *int_cb, AVDictionary **options)

AVIOContext *avio_alloc_context(unsigned char *buffer, int buffer_size, int write_flag, void *opaque,
                                int(*read_packet)(void *opaque, uint8_t *buf, int buf_size),
                                int(*write_packet)(void *opaque, uint8_t *buf, int buf_size),
                                int64_t(*seek)(void *opaque, int64_t offset, int whence))
int avio_closep(AVIOContext **s);

The api tries to mimic the C stdio plus lots of API sugar.

# core functions
int avio_read(AVIOContext *s, unsigned char *buf, int size);
void avio_write(AVIOContext *s, const unsigned char *buf, int size);
int64_t avio_seek(AVIOContext *s, int64_t offset, int whence);


# simple integer readers
int          avio_r8  (AVIOContext *s);
uint64_t     avio_rb64(AVIOContext *s);
uint64_t     avio_rl64(AVIOContext *s);
unsigned int avio_rb16(AVIOContext *s);
unsigned int avio_rb24(AVIOContext *s);
unsigned int avio_rb32(AVIOContext *s);
unsigned int avio_rl16(AVIOContext *s);
unsigned int avio_rl24(AVIOContext *s);
unsigned int avio_rl32(AVIOContext *s);

# simple integer writers
void avio_w8(AVIOContext *s, int b);
void avio_wb16(AVIOContext *s, unsigned int val);
void avio_wb24(AVIOContext *s, unsigned int val);
void avio_wb32(AVIOContext *s, unsigned int val);
void avio_wb64(AVIOContext *s, uint64_t val);
void avio_wl16(AVIOContext *s, unsigned int val);
void avio_wl24(AVIOContext *s, unsigned int val);
void avio_wl32(AVIOContext *s, unsigned int val);
void avio_wl64(AVIOContext *s, uint64_t val);


# utf8 and utf16 strings
int avio_get_str(AVIOContext *pb, int maxlen, char *buf, int buflen);

int avio_get_str16le(AVIOContext *pb, int maxlen, char *buf, int buflen);
int avio_get_str16be(AVIOContext *pb, int maxlen, char *buf, int buflen);

int avio_put_str(AVIOContext *s, const char *str);

int avio_put_str16le(AVIOContext *s, const char *str);

... (and more) ...

Buffering

All the function use an intermediate buffer to back reads and writes, the buffer can be explicitly flushed or it gets flushed automatically once the request would end outside it.

void avio_flush(AVIOContext *s);

A special kind of AVIOContext is a dynamic write buffer, it extends on demand and can be used to build complex recourring patterns once and write them as many time as needed.

int avio_open_dyn_buf(AVIOContext **s);

int avio_close_dyn_buf(AVIOContext *s, uint8_t **pbuffer);

Error handling

An I/O layer has to take in account the fact the resource being read or written could be abruptly disappear or suddenly slow down. This is valid for both local and remote resources.

The internal buffer allocation might fail.

A seek too far could lead to the end of file.

AVIO approach to errors is quite simplicistic:
– A write can silently fail.
– A failing read just returns 0-ed buffer or value.
– All the functions set the error field or the eof_reached field.

Is up to the user to decide when to check for I/O problems or leverage the AVIOInterruptCB to implement timeouts or other mean to interrupt a read or a write that otherwise would just quietly block till it is completed.

Demuxing (and Probing)

The AVFormat part taking care of input streams can be split in three: Probing the data to guess the right demuxer, the actual Demuxing and optionally parse the demuxed data and fit it in packets containing the information needed by the decoder to decode a frame of video or a matching amount of audio samples, later I call it frame-worth amount of data and I call this process chopping amorphous data streams. It is colorful as expression but represents quite well the endeavor.

Probing

The Probe functions take an arbitrary big chunk of data (stored in a AVProbeData struct) and figure out which demuxer should be able to actually parse it correctly.

As a rule of thumb probes need to be fast since all of them have to be run over the data at least once and possibly multiple times since if the result is not really conclusive increasing the data and trying again is an option.

AVInputFormat *av_probe_input_format2(AVProbeData *pd, int is_opened,
                                      int *score_max);

An helper function to probe from an AVIOContext and get the possible input format is provided.

int av_probe_input_buffer(AVIOContext *pb, AVInputFormat **fmt,
                          const char *filename, void *logctx,
                          unsigned int offset, unsigned int max_probe_size);

It used internally by avformat_open_input to automatically figure out the demuxer to use and it might look a little confusing.

Demuxing

Once that the input format is either guessed or selected the actual muxing conceptually is just providing AVPackets
as they are parsed. You might want to reposition within the stream at random times (the infamous seeking opening yet another can of worms).

int avformat_open_input(AVFormatContext **ps, const char *filename,
                        AVInputFormat *fmt, AVDictionary **options);

int av_read_frame(AVFormatContext *s, AVPacket *pkt);

void avformat_close_input(AVFormatContext **ps);
Figuring out the data inside the format

Some container formats keep the information regarding their contents in a global header at the start of the file, other, that could have new data streams appearing at random times, do not.

Since there is no easy mean to figure out which kind of data they are storing, the only safe way to figure out is to try to decode some packets in order to know which kind of data is available, avformat_find_stream_info.

int avformat_find_stream_info(AVFormatContext *ic, AVDictionary **options);

The apparently simple function does a lot of work behind the scenes: it demuxes and decodes a settable number of packets before giving up and keeps all of them in an internal queue so that they will be available for demuxing even if the input stream is not seekable.

Getting the data outside

Containers such as MPEG PS mux data in small fixed-sized chunks
while usually muxers and decoders expect to receive AVPackets containing enough data to produce a frame.

Specific parsers can be inserted automatically to take amorphous stream of demuxed data and chop out of it AVPackets containing the expected amount of data.

This happens usually automatically so the user does not have to care about it as long as the codec parser is present.

Timestamps

The multimedia data is expected to carry a timestamp to present at the same time video frames and audio frames (and subtitles).

Some containers do provide directly such timestamps, other do not, requiring some amount of guesswork by some heuristics that might or might not work depending on the codec at hand.

For example, if the container is supposed to not allow variable frame rate, the implicit time stamp for video can be deduced from the frame number. This might not work as expected if the codec uses B-frames and requires some form
of reordering.

This part in Libav is sort of hidden and often causing a number of problems.

Seeking

Seeking is quite a different and large can of worms.

Ideally seeking just sets the AVIOContext to a certain position and the demuxer keeps working from there.

int av_seek_frame(AVFormatContext *s, int stream_index,
                  int64_t timestamp, int flags);

Depending on the container format and the codec picking the correct byte offset from the user provided timestamp can be incredibly simple or really complex, with various degrees of precision.

Some format provide an precise index so a plain lookup is enough, a dichotomic search looking for the closest I-frame is the common case and in the worst situation a linear search might be required.

In some cases auxiliary indexes are built to speed up seeking within previously parsed areas.

Seeking is not fun at the demuxer level and gets even worse at the codec level if the data provided is not the one expected.

Muxing

Muxing is sort of simpler than demuxing. The output format is always known and the data always come in AVPackets matching a frame-worth of raw data and possibly sporting correct timestamps.

API-wise it expects an AVFormatContext with the oformat set to the correct AVOutputFormat and the pb
set with an allocated AVIOContext and populated AVStreams.

Once the AVFormatContext is configured is possible to write the packets. First the global header should be written, then as many packets as needed are muxed, interleaving audio and video so that demuxing and seeking work correctly.

int avformat_write_header(AVFormatContext *s, AVDictionary **options);

int av_interleaved_write_frame(AVFormatContext *s, AVPacket *pkt);

int av_write_trailer(AVFormatContext *s);

Bitstream filtering

Some codecs have multiple possible representation, e.g. H264 has the AVCC bitstream format and the Annex B bitstream format. Come containers support both, other expect only one or the other. Currently the correct converter from a bitstream to another must be inserted manually.

Packet interleaving

Certain container formats have quite peculiar muxing rules. This is normally hidden from the user, in certain cases being able to override it is a boon.

Shortcomings summary

In the next post I will explain how I would improve the situation, today post is mainly to introduce the structure of AVFormat and start explaining what should be fixed. Here a short list of what I’d like to fix sooner than later.

Non-uniform API

  • There is quite a mixture of av_ and avformat_ namespaces.
  • The muxing and demuxing APIs are sufficiently confusing (and surely I should complete my avformat_open_output to reduce the boilerplate)

Abstractions Leaking the wrong way

  • The demuxing side automagically inserts parsers to chop data streams in a frame-worth amount of data while the muxing side would just fail if the bitstream provided is not matching the one required by the container format.
  • There is quite of hidden magic happening in avformat_find_stream_info and just recently we added options to at least flush the buffer it keeps to probe for codecs. Having a better function and a better mean to control this kind of internal buffer would be surely appreciated by the user that need to keep the latency low.
  • There is no good mean to be notified if the number of streams change (new streams found or old streams disappearing).

Bad implementations

  • The old muxers sometimes do not even use the now-available internals (e.g. the interleaver helpers) but implement internally queues and logic that should be now common and shared across all the muxers.
  • While AVCodec has (now) quite an uniform mean to slice bytes and bits, avformat is not leveraging it beside few places.

PS: Kostya prefers to provide both amorphous stream and chopped packets. It makes sense since you might have some codec you cannot parse but you can sort of safely remux if the container is the same.
For the common case I’d rather suggest to use a set of functions that always insert parsers when they can both demuxing and muxing and provide another set of functions to get arbitrary lumps of stream as provided by the container format.

Bridging Markdown to sphinx

One of my annoying itch is documentation.

I like a lot sphinx as toolchain but the underlying rst has a quite steep learning curve and it is outright ugly to write in many common situation.

I like a lot kramdown as syntax but sadly it is ruby-only and overall the Markdown implementation for python usually have a good number of shortcomings, including the quite annoying part of not having a full AST for the extensions, making quite a pain to proper translations (e.g. moin-2 markdown can’t use the extension supported by the original since they get mangled badly during the process of node matching)…

Enters CommonMark

CommonMark is a cooperative effort to build actually a proper specification of the ubiquitous markdown syntax. The implementations usually provide a full AST and the python one (derived from the javascript one) is quite easy to understand, fast enough and easy to extend.

I know that somebody else already tried to bridge docutils and markdown, sadly parsley is a tad slow for the purpose. I gutted away the original markdown parser and wired in commonmark-py the result is a decently fast implementation that maps most of the core syntax to the docutils AST and thus makes possible to write in markdown and get it converted using the docutils output formats.

What’s left

ReStructuredText is much richer than CommonMark core, at least I should complete my work to support attributes so the manpage output would work mostly as intended.

The directive system is quite different from the one currently discussed and that will cause a good deal of headache to map the sphinx extension to document the function parameters.

As usual help welcome!

PowerPC is back (and little endian)

Yesterday I fixed a PowerPC issue since ages, it is an endianess issue, and it is (funny enough) on the little endian flavour of it.

PowerPC

I have some ties with this architecture since my interest on the architecture (and Altivec/VMX in particular) is what made me start contributing to MPlayer while fixing issue on Gentoo and from there hack on the FFmpeg of the time, meet the VLC people, decide to part ways with Michael Niedermayer and with the other main contributors of FFmpeg create Libav. Quite a loong way back in the time.

Big endian, Little Endian

It is a bit surprising that IBM decided to use little endian (since big endian is MUCH nicer for I/O processing such as networking) but they might have their reasons.

PowerPC traditionally always had been both-endian with the ability to switch on the fly between the two (this made having foreign-endian simulators lightly less annoying to manage), but the main endianess had always been big.

This brings us to a quite interesting problem: Some if not most of the PowerPC code had been written thinking in big-endian. Luckily since most of the code wrote was using C intrinsics (Bless to whoever made the Altivec intrinsics not as terrible as the other ones around) it won’t be that hard to recycle most of the code.

More will follow.

lldb: how to botch the user interface

Recently I had to spend some time developing on MacOSX. Gentoo-Prefix sadly is getting less and less useful till we don’t make clang a first class citizen (People proposing a GSoC for it are welcome!) so I’m forced to use what’s provided by Xcode. All in all I do like a lot most of the new toolchain: clang instead of an ancient gcc-4.2, a brand new ld replacing a stale binutils. Just lldb is not good.

Clang

clang is wonderful for developing, it is arguably fast at building and the generated code isn’t that bad, beside when you are using asan and it miscompiles… (reported to the asan developers, they will have a look, gcc-asan works as expected.

The warning reporting is probably one of the feature I do miss in other compilers and that’s why I added it to cparser and I’m looking forward to move to gcc-4.9.

All in all clang developers increased the usability of the compiler and made the other projects improve as well, competition in opensource does work.

ld

The linker is again different from the usual binutils, normally you do not notice it but with the new xcode you have to face it since some projects will have problems finding symbols. Again the reporting is quite good, not stellar as clang’s but when the missing symbols are C++ it does a better job than stock binutils in telling you what’s missing from where.

lldb

The new debugger probably isn’t really ready for the prime time. gdb gets its share of complaints about some of its quirks (the macros system is quite minimal and the python interface is good, but not documented as it should), but it is really effective and fast to use.

lldb is not. Almost every command that in gdb is a single statement, and can be shortened to a single letter, in lldb it is two statements , usually with a compulsory option.

Setting breakpoints, watchers, moving through frames; everything gets more cumbersome to use.

The reporting is a little more confusing and the error messages can be misleading. And since you might use the tool while under pressure (e.g. there is a last second bug found before a main release), you want to be as quick as possible.

While debugging some VDA hwaccel improvements for libav I got to spend quite a bit of time tracking why a pointer gets nulled.

The watchpoint I set to figure out triggered at random times in the innards of the osx memory management and I couldn’t actually see when or how that happens.

I ended up writing a dummy hwaccel accessing the same fields on linux, run it through gdb and discover the actual problem in … 10minutes, code and reboots included.

I do hope we’ll see a better interface for lldb and further improvements on gdb (and hopefully combinations such as clang + gdb and gcc + lldb will work better).