rav1e and crav1e – A fast and safe AV1 encoder – Some HowTo

Over the year I contributed to an AV1 encoder written in rust.

Here a small tutorial about what is available right now, there is still lots to do, but I think we could enjoy more user-feedback (and possibly also some help).

Setting up

Install the rust toolchain

If you do not have rust installed, it is quite simple to get a full environment using rustup

$ curl https://sh.rustup.rs -sSf | sh
# Answer the questions asked and make sure you source the `.profile` file created.
$ source ~/.profile

Install cmake, perl and nasm

rav1e uses libaom for testing and and on x86/x86_64 some components have SIMD variants written directly using nasm.

You may follow the instructions, or just install:
nasm (version 2.13 or better)
perl (any recent perl5)
cmake (any recent version)

Once you have those dependencies in you are set.

Building rav1e

We use cargo, so the process is straightforward:

## Pull in the customized libaom if you want to run all the tests
$ git submodule update --init

## Build everything
$ cargo build --release

## Test to make sure everything works as intended
$ cargo test --features decode_test --release

## Install rav1e
$ cargo install

Using rav1e

Right now rav1e has a quite simple interface:

rav1e 0.1.0
AV1 video encoder

USAGE:
    rav1e [OPTIONS]  --output 

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -I, --keyint     Keyframe interval [default: 30]
    -l, --limit                  Maximum number of frames to encode [default: 0]
        --low_latency      low latency mode. true or false [default: true]
    -o, --output                Compressed AV1 in IVF video output
        --quantizer                 Quantizer (0-255) [default: 100]
    -r 
    -s, --speed                  Speed level (0(slow)-10(fast)) [default: 3]
        --tune                    Quality tuning (Will enforce partition sizes >= 8x8) [default: psnr]  [possible
                                        values: Psnr, Psychovisual]

ARGS:
        Uncompressed YUV4MPEG2 video input

It accepts y4m raw source and produces ivf files.

You can configure the encoder by setting the speed and quantizer levels.

The low_latency flag can be turned off to run some additional analysis over a set of frames and have additional quality gains.

Crav1e

While ave and gst-rs will use the rav1e crate directly, there are a number of software such as handbrake or vlc that would be much happier to consume a C API.

Thanks to the staticlib target and cbindgen is quite easy to produce a C-ABI library and its matching header.

Setup

crav1e is built using cargo, so nothing special is needed right now beside nasm if you are building it on x86/x86_64.

Build the library

This step is completely straightforward, you can build it as release:

$ cargo build --release

or as debug

$ cargo build

It will produce a target/release/librav1e.a or a target/debug/librav1e.a.
The C header will be in include/rav1e.h.

Try the example code

I provided a quite minimal sample case.

cc -Wall c-examples/simple_encoding.c -Ltarget/release/ -lrav1e -Iinclude/ -o c-examples/simple_encoding
./c-examples/simple_encoding

If it builds and runs correctly you are set.

Manually copy the .a and the .h

Currently cargo install does not work for our purposes, but it will change in the future.

$ cp target/release/librav1e.a /usr/local/lib
$ cp include/rav1e.h /usr/local/include/

Missing pieces

Right now crav1e works well enough but there are few shortcomings I’m trying to address.

Shared library support

The cdylib target does exist and produce a nearly usable library but there are some issues with soname support. I’m trying to address them with upstream, but it might take some time.

Meanwhile some people suggest to use patchelf or similar tools to fix the library after the fact.

Install target

cargo is generally awesome, but sadly its support for installing arbitrary files to arbitrary paths is limited, luckily there are people proposing solutions.

pkg-config file generation

I consider a library not proper if a .pc file is not provided with it.

Right now there are means to extract the information need to build a pkg-config file, but there isn’t a simple way to do it.

$ cargo rustc -- --print native-static-libs

Provides what is needed for Libs.private, ideally it should be created as part of the install step since you need to know the prefix, libdir and includedir paths.

Coming next

Probably the next blog post will be about my efforts to make cargo able to produce proper cdylib or something quite different.

PS: If somebody feels to help me with matroska in AV1 would be great 🙂

Gentoo on Integricloud

Integricloud gave me access to their infrastructure to track some issues on ppc64 and ppc64le.

Since some of the issues are related to the compilers, I obviously installed Gentoo on it and in the process I started to fix some issues with catalyst to get a working install media, but that’s for another blogpost.

Today I’m just giving a walk-through on how to get a ppc64le (and ppc64 soon) VM up and running.

Preparation

Read this and get your install media available to your instance.

Install Media

I’m using the Gentoo installcd I’m currently refining.

Booting

You have to append console=hvc0 to your boot command, the boot process might figure it out for you on newer install medias (I still have to send patches to update livecd-tools)

Network configuration

You have to manually setup the network.
You can use ifconfig and route or ip as you like, refer to your instance setup for the parameters.

ifconfig enp0s0 ${ip}/16
route add -net default gw ${gw}
echo "nameserver 8.8.8.8" > /etc/resolv.conf
ip a add ${ip}/16 dev enp0s0
ip l set enp0s0 up
ip r add default via ${gw}
echo "nameserver 8.8.8.8" > /etc/resolv.conf

Disk Setup

OpenFirmware seems to like gpt much better:

parted /dev/sda mklabel gpt

You may use fdisk to create:
– a PowerPC PrEP boot partition of 8M
– root partition with the remaining space

Device     Start      End  Sectors Size Type
/dev/sda1   2048    18431    16384   8M PowerPC PReP boot
/dev/sda2  18432 33554654 33536223  16G Linux filesystem

I’m using btrfs and zstd-compress /usr/portage and /usr/src/.

mkfs.btrfs /dev/sda2

Initial setup

It is pretty much the usual.

mount /dev/sda2 /mnt/gentoo
cd /mnt/gentoo
wget https://dev.gentoo.org/~mattst88/ppc-stages/stage3-ppc64le-20180810.tar.xz
tar -xpf stage3-ppc64le-20180810.tar.xz
mount -o bind /dev dev
mount -t devpts devpts dev/pts
mount -t proc proc proc
mount -t sysfs sys sys
cp /etc/resolv.conf etc
chroot .

You just have to emerge grub and gentoo-sources, I diverge from the defconfig by making btrfs builtin.

My /etc/portage/make.conf:

CFLAGS="-O3 -mcpu=power9 -pipe"
# WARNING: Changing your CHOST is not something that should be done lightly.
# Please consult https://wiki.gentoo.org/wiki/Changing_the_CHOST_variable beforee
 changing.
CHOST="powerpc64le-unknown-linux-gnu"

# NOTE: This stage was built with the bindist Use flag enabled
PORTDIR="/usr/portage"
DISTDIR="/usr/portage/distfiles"
PKGDIR="/usr/portage/packages"

USE="ibm altivec vsx"

# This sets the language of build output to English.
# Please keep this setting intact when reporting bugs.
LC_MESSAGES=C
ACCEPT_KEYWORDS=~ppc64

MAKEOPTS="-j4 -l6"
EMERGE_DEFAULT_OPTS="--jobs 10 --load-average 6 "

My minimal set of packages I need before booting:

emerge grub gentoo-sources vim btrfs-progs openssh

NOTE: You want to emerge again openssh and make sure bindist is not in your USE.

Kernel & Bootloader

cd /usr/src/linux
make defconfig
make menuconfig # I want btrfs builtin so I can avoid a initrd
make -j 10 all && make install && make modules_install
grub-install /dev/sda1
grub-mkconfig -o /boot/grub/grub.cfg

NOTE: make sure you pass /dev/sda1 otherwise grub will happily assume OpenFirmware knows about btrfs and just point it to your directory.
That’s not the case unfortunately.

Networking

I’m using netifrc and I’m using the eth0-naming-convention.

touch /etc/udev/rules.d/80-net-name-slot.rules
ln -sf /etc/init.d/net.{lo,eth0}
echo -e "config_eth0=\"${ip}/16\"\nroutes_eth0="default via ${gw}\"\ndns_servers_eth0=\"8.8.8.8\"" > /etc/conf.d/net

Password and SSH

Even if the mticlient is quite nice, you would rather use ssh as much as you could.

passwd 
rc-update add sshd default

Finishing touches

Right now sysvinit does not add the hvc0 console as it should due to a profile quirk, for now check /etc/inittab and in case add:

echo 'hvc0:2345:respawn:/sbin/agetty -L 9600 hvc0' >> /etc/inittab

Add your user and add your ssh key and you are ready to use your new system!

Altivec and VSX in Rust (part 1)

I’m involved in implementing the Altivec and VSX support on rust stdsimd.

Supporting all the instructions in this language is a HUGE endeavor since for each instruction at least 2 tests have to be written and making functions type-generic gets you to the point of having few pages of implementation (that luckily desugars to the single right instruction and nothing else).

Since I’m doing this mainly for my multimedia needs I have a short list of instructions I find handy to get some code written immediately and today I’ll talk a bit about some of them.

This post is inspired by what Luc did for neon, but I’m using rust instead.

If other people find it useful, I’ll try to write down the remaining istructions.

Permutations

Most if not all the SIMD ISAs have at least one or multiple instructions to shuffle vector elements within a vector or among two.

It is quite common to use those instructions to implement matrix transposes, but it isn’t its only use.

In my toolbox I put vec_perm and vec_xxpermdi since even if the portable stdsimd provides some shuffle support it is quite unwieldy compared to the Altivec native offering.

vec_perm: Vector Permute

Since it first iteration Altivec had a quite amazing instruction called vec_perm or vperm:

    fn vec_perm(a: i8x16, b: i8x16, c: i8x16) -> i8x16 {
        let mut d;
        for i in 0..16 {
            let idx = c[i] & 0xf;
            d[i] = if (c[i] & 0x10) == 0 {
                a[idx]
            } else {
                b[idx]
            };
        }
        d
    }

It is important to notice that the displacement map c is a vector and not a constant. That gives you quite a bit of flexibility in a number of situations.

This instruction is the building block you can use to implement a large deal of common patterns, including some that are also covered by stand-alone instructions e.g.:
– packing/unpacking across lanes as long you do not have to saturate: vec_pack, vec_unpackh/vec_unpackl
– interleave/merge two vectors: vec_mergel, vec_mergeh
– shift N bytes in a vector from another: vec_sld

It can be important to recall this since you could always take two permutations and make one, vec_perm itself is pretty fast and replacing two or more instructions with a single permute can get you a pretty neat speed boost.

vec_xxpermdi Vector Permute Doubleword Immediate

Among a good deal of improvements VSX introduced a number of instructions that work on 64bit-elements vectors, among those we have a permute instruction and I found myself using it a lot.

    #[rustc_args_required_const(2)]
    fn vec_xxpermdi(a: i64x2, b: i64x2, c: u8) -> i64x2 {
        match c & 0b11 {
            0b00 => i64x2::new(a[0], b[0]);
            0b01 => i64x2::new(a[1], b[0]);
            0b10 => i64x2::new(a[0], b[1]);
            0b11 => i64x2::new(a[1], b[1]);
        }
    }

This instruction is surely less flexible than the previous permute but it does not require an additional load.

When working on video codecs is quite common to deal with blocks of pixels that go from 4×4 up to 64×64, before vec_xxpermdi the common pattern was:

    #[inline(always)]
    fn store8(dst: &mut [u8x16], v: &[u8x16]) {
        let data = dst[i];
        dst[i] = vec_perm(v, data, TAKE_THE_FIRST_8);
    }

That implies to load the mask as often as needed as long as the destination.

Using vec_xxpermdi avoids the mask load and that usually leads to a quite significative speedup when the actual function is tiny.

Mixed Arithmetics

With mixed arithmetics I consider all the instructions that do in a single step multiple vector arithmetics.

The original altivec has the following operations available for the integer types:
vec_madds
vec_mladd
vec_mradds
vec_msum
vec_msums
vec_sum2s
vec_sum4s
vec_sums

And the following two for the float type:
vec_madd
vec_nmsub

All of them are quite useful and they will all find their way in stdsimd pretty soon.

I’m describing today vec_sums, vec_msums and vec_madds.

They are quite representative and the other instructions are similar in spirit:
vec_madds, vec_mladd and vec_mradds all compute a lane-wise product, take either the high-order or the low-order part of it
and add a third vector returning a vector of the same element size.
vec_sums, vec_sum2s and vec_sum4s all combine an in-vector sum operation with a sum with another vector.
vec_msum and vec_msums both compute a sum of products, the intermediates are added together and then added to a wider-element
vector.

If there is enough interest and time I can extend this post to cover all of them, for today we’ll go with this approximation.

vec_sums: Vector Sum Saturated

Usually SIMD instruction work with two (or 3) vectors and execute the same operation for each vector element.
Sometimes you want to just do operations within the single vector and vec_sums is one of the few instructions that let you do that:

    fn vec_sums(a: i32x4, b: i32x4) -> i32x4 {
        let d = i32x4::new(0, 0, 0, 0);

        d[3] = b[3].saturating_add(a[0]).saturating_add(a[1]).saturating_add(a[2]).saturating_add(a[3]);

        d
    }

It returns in the last element of the vector the sum of the vector elements of a and the last element of b.
It is pretty handy when you need to compute an average or similar operations.

It works only with 32bit signed element vectors.

vec_msums: Vector Multiply Sum Saturated

This instruction sums the 32bit element of the third vector with the two products of the respective 16bit
elements of the first two vectors overlapping the element.

It does quite a bit:

    fn vmsumshs(a: i16x8, b: i16x8, c: i32x4) -> i32x4 {
        let d;
        for i in 0..4 {
            let idx = 2 * i;
            let m0 = a[idx] as i32 * b[idx] as i32;
            let m1 = a[idx + 1] as i32 * b[idx + 1] as i32;
            d[i] = c[i].saturating_add(m0).saturating_add(m1);
        }
        d
    }

    fn vmsumuhs(a: u16x8, b: u16x8, c: u32x4) -> u32x4 {
        let d;
        for i in 0..4 {
            let idx = 2 * i;
            let m0 = a[idx] as u32 * b[idx] as u32;
            let m1 = a[idx + 1] as u32 * b[idx + 1] as u32;
            d[i] = c[i].saturating_add(m0).saturating_add(m1);
        }
        d
    }

    ...

    fn vec_msums<T, U>(a: T, b: T, c: U) -> U
    where T: sealed::VectorMultiplySumSaturate<U> {
        a.msums(b, c)
    }

It works only with 16bit elements, signed or unsigned. In order to support that on rust we have to use some creative trait.
It is quite neat if you have to implement some filters.

vec_madds: Vector Multiply Add Saturated

    fn vec_madds(a: i16x8, b: i16x8, c: i16x8) -> i16x8 {
        let d;
        for i in 0..8 {
            let v = (a[i] as i32 * b[i] as i32) >> 15;
            d[i] = (v as i16).saturating_add(c[i]);
        }
        d
    }

Takes the high-order 17bit of the lane-wise product of the first two vectors and adds it to a third one.

Coming next

Raptor Enginering kindly gave me access to a Power 9 through their Integricloud hosting.

We could run some extensive benchmarks and we found some peculiar behaviour with the C compilers available on the machine and that got me, Luc and Alexandra a little puzzled.

Next time I’ll try to collect in a little more organic way what I randomly put on my twitter as I noticed it.

Video Compression Bounty Hunters

In this post, we (Luca Barbato and Luc Trudeau) joined forces to talk about the awesome work we’ve been doing on Altivec/VSX optimizations for the libvpx library, you can read it here or on Luc’s medium.

Both of us where in Brussels for FOSDEM 2018, Luca presented his work on rust-av and Luc was there to hack on rav1e – an experimental AV1 video encoder in Rust.

Luca joined the rav1e team and helped give hints about how to effectively leverage rust. Together, we worked on AV1 intra prediction code, among the other things.

Luc Trudeau: I was finishing up my work on Chroma from Luma in AV1, and wanted to stay involved in royalty free open source video codecs. When Luca talked to me about libvpx bounties on Bountysource, I was immediately intrigued.

Luca Barbato: Luc just finished implementing the Neon version of his CfL work and I wondered how that code could work using VSX. I prepared some of the machinery that was missing in libaom and Luc tried his hand on Altivec. We still had some pending libvpx work sponsored by IBM and I asked him if he wanted to join in.

What’s libvpx?

For those less familiar, libvpx is the official Google implementation of the VP9 video format. VP9 is most notably used in YouTube and Netflix. VP9 playback is available on some browsers including Chrome, Edge and Firefox and also on Android devices, covering the 75.31% of the global user base.

Ref: caniuse.com VP9 support in browsers.

Why use VP9, when the de facto video format is H.264/AVC?

Because VP9 is royalty free and the bandwidth savings are substantial when compared to H.264 when playback is available (an estimated 3.3B devices support VP9). In other words, having VP9 as a secondary codec can pay for itself in bandwidth savings by not having to send H.264 to most users.

Ref: Netflix VP9 compression analysis.

Why care about libvpx on Power?

Dynamic adaptive streaming formats like HLS and MPEG DASH have completely changed the game of streaming video over the internet. Streaming hardware and custom multimedia servers are being replaced by web servers.

From the servers’ perspective streaming video is akin to serving small videos files; lots of small video files! To cover all clients and most network conditions a considerable amount of video files must be encoded, stored and distributed.

Things are changing fast and while the total cost of ownership of video content for previous generation video formats, like H.264, was mostly made up of bandwidth and hosting, encoding costs are growing with more complex video formats like HEVC and VP9.

This complexity is reported to have grown exponentially with the upcoming AV1 video format. A video format, built on the libVPX code base, by the Alliance for Open Media, of which IBM is a founding member.

Ref: Facebook’s AV1 complexity analysis

At the same time, IBM and its partners in the OpenPower Foundation are releasing some very impressive hardware with the new Power9 processor line up. Big Iron Power9 systems, like the Talos II from Raptor Computing Systems and the collaboration between Google and Rackspace on Zaius/Barreleye servers, are ideal solutions to the tackle the growing complexity of video format encoding.

However, these awesome machines are currently at a disadvantage when encoding video. Without the platform specific optimizations, that their competitors enjoy, the Power9 architecture can’t be fully utilized. This is clearly illustrated in the x264 benchmark released in a recent Phoronix article.

Ref: Phoronix x264 server benchmark.

Thanks to the optimization bounties sponsored by IBM, we are hard at work bridging the gap in libvpx.

Optimization bounties?

Just like bug bounty programs, optimization make for great bounties. Companies that see benefit in platform specific optimizations for video codecs can sponsor our bounties on the Bountysource platform.

Multiple companies can sponsor the same bounty, thus sharing cost of more important bounties. Furthermore, bounties are a minimal risk investment for sponsors, as they are only paid out when the work is completed (and peer reviewed by libvpx maintainers)

Not only is the Bountysource platform a win for companies that directly benefit from the bounties they are sponsoring, it’s also a win for developers (like us) who can get paid to work on free and open source projects that we are passionate about. Optimization bounties are a source of sustainability in the free and open source software ecosystem.

How do you choose bounties?

Since we’re a small team of bounty hunters (Luca Barbato, Alexandra Hájková, Rafael de Lucena Valle and Luc Trudeau), we need to play it smart and maximize the impact of our work. We’ve identified two common use cases related to streaming on the Power architecture: YouTube-like encodes and real time (a.k.a. low latency) encodes.

By profiling libvpx under these conditions, we can determine the key functions to optimize. The following charts show the percentage of time spent the in top 20 functions of the libvpx encoder (without Altivec/VSX optimisations) on a Power8 system, for both YouTube-like and real time settings.

It’s interesting to see that the top 20 functions make up about 80% of the encoding time. That’s similar in spirit to the Pareto principle, in that we don’t have to optimize the whole encoder to make the Power architecture competitive for video encoding.

We see a similar distribution between YouTube-like encoding settings and real time video encoding. In other words, optimization bounties for libvpx benefit both Video on Demand (VOD) and live broadcast services.

We add bounties on the Bountysource platform around common themed functions like: convolution, sum of absolute differences (SAD), variance, etc. Companies interested in libvpx optimization can go and fund these bounties.

What’s the impact of this project so far?

So far, we delivered multiple libvpx bounties including:

  • Convolution
  • Sum of absolute differences (SAD)
  • Quantization
  • Inverse transforms
  • Intra prediction
  • etc.

To see the benefit of our work, we compiled the latest version of libVPX with and without VSX optimizations and ran it on a Power8 machine. Note that the C compiled versions can produce Altivec/VSX code via auto vectorization. The results, in frames per minutes, are shown below for both YouTube-like encoding and Real time encoding.

Our current VSX optimizations give approximately a 40% and 30% boost in encoding speed for YouTube-like and real time encoding respectively. Encoding speed increases in the range of 10 to 14 frames per minute can considerably reduce cloud encoding costs for Power architecture users.

In the context of real time encoding, the time saved by the platform optimization can be put to good use to improve compression efficiency. Concretely, a real time encoder will encode in real time speed, but speeding up the encoders allows for operators to increase the number of coding tools, resulting in better quality for the viewers and bandwidth savings for operators.

What’s next?

We’re energized by the impact that our small team of bounty hunters is having on libvpx performance for the Power architecture and we wanted to share it in this blog post. We look forward to getting even more performance from libvpx on the Power architecture. Expect considerable performance improvement for the Power architecture in the next libvpx release (1.8).

As IBM targets its Power9 line of systems at heavy cloud computations, it seems natural to also aim all that power at tackling the growing costs of AV1 encodes. This won’t happen without platform specific optimizations and the time to start is now; as the AV1 format is being finalized, everyone is still in the early phases of optimization. We are currently working with our sponsors to set up AV1 bounties, so stay tuned for an upcoming post.

Rust-av: Rust and Multimedia

Recently I presented my new project at Fosdem and since I was afraid of not having enough time for the questions I trimmed the content to the bare minimum. This blog post should add some more details.

What is it?

Rust-av aims to be a complete multimedia toolkit written in rust.

Rust is a quite promising language that aims to offer high execution speed while granting a number of warranties on the code behavior that you cannot have in C, C++, Java and so on.

Its zero-cost abstraction feature coupled with the fact that the compiler actively prevents you from committing a large class of mistakes related to memory access seems a perfect match to implement a multimedia toolkit that is easy to use, fast enough and trustworthy.

Why something completely new?

Since rust code can be intermixed with C code, an evolutive approach of replacing little by little small components in a larger project is perfectly feasible, and it is what we are currently trying to do with vlc.

But rust is not just good to write some inner routines so they are fast and correct, its trait system is also quite useful to have a more expressive API.

Most of the multimedia concepts are pretty simple at the high level (e.g frame is just a picture or some sound with some timestamp) with an excruciating amount of quirk and details that require your toolkit to make choices for you or make you face a large amount of complexity.

That leads to API that are either easy but quite inflexible (and opinionated) or API providing all the flexibility, but forcing the user to have to learn a lot of information in order to achieve what the simpler API would let you implement in an handful of lines of code.

I wanted to leverage Rust to make the low level implementations with less bugs and, at the same time, try to provide a better API to use it.

Why now?

Since 2016 I kept bouncing ideas with Kostya and Geoffroy but between my work duties and other projects I couldn’t devote enough time to it. Thanks to the Mozilla Open Source Support initiative that awarded me with enough to develop it full time, now the project has already some components published and more will follow during the next months.

Philosophy

I’m trying to leverage the experience I have from contributing to vlc and libav and keep what is working well and try to not make the same mistakes.

Ease of use

I want that the whole toolkit to be useful to a wide audience. Developers often fight against the library in order to undo what is happening under the hood or end up vendoring some part of it since they need only a tiny subset of all the features that are provided.

Rust makes quite natural split large projects in independent components (called crates) and it is already quite common to have meta-crates re-exporting many smaller crates to provide some uniform access.

The rust-av code, as opposed to the rather monolithic approach taken in Libav, can be reused with the granularity of the bare codec or format:

  • Integrating it in a foreign toolkit won’t require to undo what the common utility code does.
  • Even when using it through the higher level layers, rust-av won’t force the developer to bring in any unrelated dependencies.
  • On the other hand users that enjoy a fully integrated and all-encompassing solution can simply depend on the meta-crates and get the support for everything.

Speed

Multimedia playback boils down to efficiently do complex computation so an arbitrary large amount of data can be rendered within a fraction of second, multimedia real time streaming requires to compress an equally large amount of data in the same time.

Speed in multimedia is important.

Rust provides high level idiomatic constructs that surprisingly lead to pretty decent runtime speed. The stdsimd effort and the seamless C ABI support make easier to leverage the SIMD instructions provided by the recent CPU architectures.

Trustworthy

Traditionally the most effective way to write fast multimedia code had been pairing C and assembly. Sadly the combination makes quite easy to overlook corner cases and have any kind of memory hazards (use-after-free, out of bound reads and writes, NULL-dereferences…).

Rust effectively prevents a good deal of those issues at compile time. Since its abstractions usually do not cause slowdowns it is possible to write code that is, arguably, less misleading and as fast.

Structure

The toolkit is composed of multiple, loosely coupled, crates. They can be grouped by level of abstraction.

Essential

av-data: Used by nearly all the other crates it provides basic data types and a minimal amount of functionality on top of it. It provides the following structs mainly:

  • Frame: it binds together a time reference and a buffer, representing either a video picture or some audio samples.
  • Packet: it bind together a time reference and a buffer, containing compressed data.
  • Value: Simple key value type abstraction, used to pass arbitrary data to the configuration functions.

Core

They provide the basic abstraction (traits) implemented by specific set of components.

  • av-format: It provides a set of traits to implement muxers and demuxers and an utility Context to bridge the normal rust I/O Write and Read traits and the actual muxers and demuxers.
  • av-codec: It provides a set of traits to implement encoders and decoders and an utility Context that wraps.

Utility

They provide building blocks that may be used to implement actual codecs and formats.

  • av-bitstream: Utility crate to write and read bits and bytes
  • av-audio: Audio-specific utilities
  • av-video: Video-specific utilities

Implementation

Actual implementations of codec and format, they can be used directly or through the utility Contexts.

The direct usage is suggested only if you are integrating it in larger frameworks that already implement, possibly in different ways, the integration code provided by the Context (e.g. binding it together with the I/O for the formats or internal queues for the codecs).

Higher-level

They provide higher level Contexts to playback or encode data through a simplified interface:

  • av-player reads bytes though a provided Read and outputs decoded Frames. Under the hood it probes the data, allocates and configures a Demuxer and a Decoder for each stream of interest.
  • av-encoder consumes Frames and outputs encoded and muxed data through a Write output. It automatically setup the encoders and the muxer.

Meta-crates

They ease the use in bulk of everything provided by rust-av.

There are 4 crates providing a list of specific components: av-demuxers, av-muxers, av-decoders and av-encoders; and 2 grouping them by type: av-formats and av-codecs.

Their use is suggested when you’d like to support every format and codec available.

So far

All the development happens on the github organization and so far the initial Core and Essential crates are ready to be used.

There is a nom-based matroska demuxer in working condition and some non-native wrappers providing implementations for some decoders and encoders.

Thanks to est31 we have native vorbis support.

I’m working on a native implementation of opus and soon I’ll move to a video codec.

There is a tiny player called avp and an encoder tool (named ave) will appear once the matroska muxer is complete.

What’s missing in rust-av

API-wise, right now rust-av provides only for simple decode and encoding, muxing and demuxing. There are already enough wrapped codecs to let people play with the library and, hopefully, help in polishing it.

For each crate I’m trying to prepare some easy tasks so people willing to contribute to the project can start from them, all help is welcome!

What’s missing in rust

So far my experience with rust had been quite positive, but there are a number of features that are missing or that could be addressed.

  • SIMD support is shaping up nicely and it is coming soon.
  • The natural fallback, going down to assembly, is available since rust supports the C ABI, inline assembly support on the other hand seems that is still pending some discussion before it reaches stable.
  • Arbitrarily aligned allocation is a MUST in order to support hardware acceleration and SIMD works usually better with aligned buffers.
  • I’d love to have const generics now, luckily associated constants with traits allow some workarounds that let you specialize by constants (and result in neat speedups).
  • I think that focusing a little more on array/slice support would lead to the best gains, since right now there isn’t an equivalent to collect() to fill arrays in an idiomatic way and in multimedia large lookup tables are pretty much a staple.

In closing

Rust and Multimedia seem a really good match, in my experience beside a number of missing features the language seems quite good for the purpose.

Once I have more native implementations complete I will be able to have better means to evaluate the speed difference from writing the same code in C.

Optimizing rust

After the post about optimization, Kostya and many commenters (me included) discussed a bit about if there are better ways to optimize that loop without using unsafe code.

Kostya provided me with a test function and multiple implementations from him and I polished and benchmarked the whole thing.

The code

I put the code in a simple project, initially it was a simple main.rs and then it grew a little.

All it started with this function:

pub fn recombine_plane_reference(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    let mut idx0 = 0;
    let mut idx1 = w / 2;
    let mut idx2 = (h / 2) * sstride;
    let mut idx3 = idx2 + idx1;
    let mut oidx0 = 0;
    let mut oidx1 = dstride;

    for _ in 0..(h / 2) {
        for x in 0..(w / 2) {
            let p0 = src[idx0 + x];
            let p1 = src[idx1 + x];
            let p2 = src[idx2 + x];
            let p3 = src[idx3 + x];
            let s0 = p0.wrapping_add(p2);
            let d0 = p0.wrapping_sub(p2);
            let s1 = p1.wrapping_add(p3);
            let d1 = p1.wrapping_sub(p3);
            let o0 = s0.wrapping_add(s1).wrapping_add(2);
            let o1 = d0.wrapping_add(d1).wrapping_add(2);
            let o2 = s0.wrapping_sub(s1).wrapping_add(2);
            let o3 = d0.wrapping_sub(d1).wrapping_add(2);
            dst[oidx0 + x * 2 + 0] = clip8(o0.wrapping_shr(2).wrapping_add(128));
            dst[oidx0 + x * 2 + 1] = clip8(o1.wrapping_shr(2).wrapping_add(128));
            dst[oidx1 + x * 2 + 0] = clip8(o2.wrapping_shr(2).wrapping_add(128));
            dst[oidx1 + x * 2 + 1] = clip8(o3.wrapping_shr(2).wrapping_add(128));
        }
        idx0 += sstride;
        idx1 += sstride;
        idx2 += sstride;
        idx3 += sstride;
        oidx0 += dstride * 2;
        oidx1 += dstride * 2;
    }
}

Benchmark

Kostya used perf to measure the number of samples it takes over a large number of iterations, I wanted to make the benchmark a little more portable so I used the time::PreciseTime Rust provides to measure something a little more coarse, but good enough for our purposes.

We want to see if rewriting the loop using unsafe pointers or using high level iterators provides a decent speedup, no need to be overly precise.

NB: I decided to not use the bencher utility provided with nightly rust to make the code even easier to use.

+fn benchme<F>(name: &str, n: usize, mut f: F)
+    where F : FnMut() {
+    let start = PreciseTime::now();
+    for _ in 0..n {
+        f();
+    }
+    let end = PreciseTime::now();
+    println!("Runtime {} {}", name, start.to(end));
+}
# cargo run --release

Unsafe code

Both me and Kostya have a C background so for him (and for me), was sort of natural embracing unsafe {} and use the raw pointers like we are used to.

pub fn recombine_plane_unsafe(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    unsafe {
        let hw = (w / 2) as isize;
        let mut band0 = src.as_ptr();
        let mut band1 = band0.offset(hw);
        let mut band2 = band0.offset(((h / 2) * sstride) as isize);
        let mut band3 = band2.offset(hw);
        let mut dst0 = dst.as_mut_ptr();
        let mut dst1 = dst0.offset(dstride as isize);
        let hh = (h / 2) as isize;
        for _ in 0..hh {
            let mut b0_ptr = band0;
            let mut b1_ptr = band1;
            let mut b2_ptr = band2;
            let mut b3_ptr = band3;
            let mut d0_ptr = dst0;
            let mut d1_ptr = dst1;
            for _ in 0..hw {
                let p0 = *b0_ptr;
                let p1 = *b1_ptr;
                let p2 = *b2_ptr;
                let p3 = *b3_ptr;
                let s0 = p0.wrapping_add(p2);
                let s1 = p1.wrapping_add(p3);
                let d0 = p0.wrapping_sub(p2);
                let d1 = p1.wrapping_sub(p3);
                let o0 = s0.wrapping_add(s1).wrapping_add(2);
                let o1 = d0.wrapping_add(d1).wrapping_add(2);
                let o2 = s0.wrapping_sub(s1).wrapping_add(2);
                let o3 = d0.wrapping_sub(d1).wrapping_add(2);
                *d0_ptr.offset(0) = clip8((o0 >> 2).wrapping_add(128));
                *d0_ptr.offset(1) = clip8((o1 >> 2).wrapping_add(128));
                *d1_ptr.offset(0) = clip8((o2 >> 2).wrapping_add(128));
                *d1_ptr.offset(1) = clip8((o3 >> 2).wrapping_add(128));
                b0_ptr = b0_ptr.offset(1);
                b1_ptr = b1_ptr.offset(1);
                b2_ptr = b2_ptr.offset(1);
                b3_ptr = b3_ptr.offset(1);
                d0_ptr = d0_ptr.offset(2);
                d1_ptr = d1_ptr.offset(2);
            }
            band0 = band0.offset(sstride as isize);
            band1 = band1.offset(sstride as isize);
            band2 = band2.offset(sstride as isize);
            band3 = band3.offset(sstride as isize);
            dst0 = dst0.offset((dstride * 2) as isize);
            dst1 = dst1.offset((dstride * 2) as isize);
        }
    }
}

The function is faster than baseline:

    Runtime reference   PT1.598052169S
    Runtime unsafe      PT1.222646190S

Explicit upcasts

Kostya noticed that telling rust to use i32 instead of i16 gave some performance boost.

    Runtime reference       PT1.601846926S
    Runtime reference 32bit PT1.371876242S
    Runtime unsafe          PT1.223115917S
    Runtime unsafe 32bit    PT1.124667021S

I’ll keep variants between i16 and i32 to see when it is important and when it is not.

Note: Making code generic over primitive types is currently pretty painful and hopefully will be fixed in the future.

High level abstractions

Most of the comments to Kostya’s original post were about leveraging the high level abstractions to make the compiler understand the code better.

Use Iterators

Rust is able to omit the bound checks if there is a warranty that the code cannot go out of the array boundaries. Using Iterators instead of for loops over an external variables should do the trick.

Use Chunks

chunks and chunks_mut take a slice and provides a nice iterator that gets you at-most-N-sized pieces of the input slice.

Since that the code works by line it is sort of natural to use it.

Use split_at

split_at and split_at_mut get you independent slices, even mutable. The code is writing two lines at time so having the ability to access mutably two regions of the frame is a boon.

The (read-only) input is divided in bands and the output produced is 2 lines at time. split_at is much better than using hand-made slicing and
split_at_mut is perfect to write at the same time the even and the odd line.

All together

pub fn recombine_plane_chunks_32(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    let hw = w / 2;
    let hh = h / 2;
    let (src1, src2) = src.split_at(sstride * hh);
    let mut src1i = src1.chunks(sstride);
    let mut src2i = src2.chunks(sstride);
    let mut dstch = dst.chunks_mut(dstride * 2);
    for _ in 0..hh {
        let s1 = src1i.next().unwrap();
        let s2 = src2i.next().unwrap();
        let mut d = dstch.next().unwrap();
        let (mut d0, mut d1) = d.split_at_mut(dstride);
        let (b0, b1) = s1.split_at(hw);
        let (b2, b3) = s2.split_at(hw);
        let mut di0 = d0.iter_mut();
        let mut di1 = d1.iter_mut();
        let mut bi0 = b0.iter();
        let mut bi1 = b1.iter();
        let mut bi2 = b2.iter();
        let mut bi3 = b3.iter();
        for _ in 0..hw {
            let p0 = bi0.next().unwrap();
            let p1 = bi1.next().unwrap();
            let p2 = bi2.next().unwrap();
            let p3 = bi3.next().unwrap();
            recombine_core_32(*p0, *p1, *p2, *p3, &mut di0, &mut di1);
        }
    }
}

It is a good improvement over the reference baseline, but still not as fast as unsafe.

    Runtime reference       PT1.621158410S
    Runtime reference 32bit PT1.467441931S
    Runtime unsafe          PT1.226046003S
    Runtime unsafe 32bit    PT1.126615305S
    Runtime chunks          PT1.349947181S
    Runtime chunks 32bit    PT1.350027322S

Use of zip or izip

Using next().unwrap() feels clumsy and force the iterator to be explicitly mutable. The loop can be written in a nicer way using the system provided zip and the itertools-provided izip.

zip works fine for 2 iterators, then you start piling up (so, (many, (tuples, (that, (feels, lisp))))) (or (feels (lisp, '(so, many, tuples))) according to a reader). izip flattens the result so it is sort of nicers.

pub fn recombine_plane_zip_16(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    let hw = w / 2;
    let hh = h / 2;
    let (src1, src2) = src.split_at(sstride * hh);
    let src1i = src1.chunks(sstride);
    let src2i = src2.chunks(sstride);
    let mut dstch = dst.chunks_mut(dstride * 2);
    for (s1, s2) in src1i.zip(src2i) {
        let mut d = dstch.next().unwrap();
        let (mut d0, mut d1) = d.split_at_mut(dstride);
        let (b0, b1) = s1.split_at(hw);
        let (b2, b3) = s2.split_at(hw);
        let mut di0 = d0.iter_mut();
        let mut di1 = d1.iter_mut();
        let iterband = b0.iter().zip(b1.iter().zip(b2.iter().zip(b3.iter())));
        for (p0, (p1, (p2, p3))) in iterband {
            recombine_core_16(*p0, *p1, *p2, *p3, &mut di0, &mut di1);
        }
    }
}

How they would fare?

    Runtime reference        PT1.614962959S
    Runtime reference 32bit  PT1.369636641S
    Runtime unsafe           PT1.223157417S
    Runtime unsafe 32bit     PT1.125534521S
    Runtime chunks           PT1.350069795S
    Runtime chunks 32bit     PT1.381841742S
    Runtime zip              PT1.249227707S
    Runtime zip 32bit        PT1.094282423S
    Runtime izip             PT1.366320546S
    Runtime izip 32bit       PT1.208708213S

Pretty well.

Looks like izip is a little more wasteful than zip currently, so looks like we have a winner 🙂

Conclusions

  • Compared to common imperative programming patterns, using the high level abstractions does lead to a nice speedup: use iterators when you can!
  • Not all the abstractions cost zero, zip made the overall code faster while izip lead to a speed regression.
  • Do benchmark your time critical code. nightly has some facility for it BUT it is not great for micro-benchmarks.

Overall I’m enjoying a lot writing code in Rust.

Contributing to x264

Another project I contribute to is x264. As per the previous post on the topic I’ll try to summarize how things work.

Overview

Coding style

x264 has a coding style and on review you will get asked to follow it, sadly the sources do not contain a document about it, you have to look around the code and match what is there.

Testing

x264 has an amazing test harness that doubles as benchmark harness to add support for additional architecture-specific optimizations. checkasm is really good in helping you write new code for this kind of purpose and make sure it is really faster.

It is automatic to use if you are adding a function already implemented in other architectures, if you want to extend the coverage for something new it is moderately difficult, mainly because you have to read the code since no documentation is available otherwise.

Submitting patches

Submitting code to x264 requires you to sign a cla, the process is sort of manual and involves exchanging emails with the person in charge to provide and collect the cla pdf once you signed it.

Once you are done on that you should rebase your changes over the sandbox branch, that’s somehow similar to the next branch on other projects and send them to the developer mailing list using git send-email.

Interaction

The review process will happen in the mailing list and you are supposed to be subscribed to it and interact with the reviewers there.

TL;DR

  • Mimic the coding style used in the project and hope you get it right
  • Peruse checkasm to make sure what you are doing works as intended and it is fast
  • Subscribe to the developer mailing list and learn how to use git send-email.
  • Be patient and wait for review comments in the mail box.

Contributing to libvpx

Recently I started to write the PowerPC/VSX support for libvpx, Alexandra will help as well.

Every open source project has its own rules, I found the choices taken in Libvpx interesting enough to write them down (and possibly help newcomers with some shortcuts).

Overview

Coding style

The coding style is strongly enforced, the CI system will bounce your code if it doesn’t adhere to the style.

This constraint is enforced through a clang-format ruleset.

If you are using vim, this makes your life easier, otherwise the git integration comes handy.

Otherwise:

# clang-format -i what/I/m/working/on.c

Works no matter what.

Testing

New code should have its testcase, if it isn’t already covered.

Libvpx uses gtest and it has a quite decent test coverage. A full run of the tests can take a large chunk of time, if you are working on specific code (e.g. dsp functions), is easy to run only the tests you care about like this:

# ./test_libvpx --gtest_filter="*pattern*with*globs"

The current system does not double as benchmarking tool, so you are quite on your own if you are trying to speed up some parts.

Adding brand new tests more annoying than it should since gtest is quite bloated, updating a test to cover a variant is quite painless though.

Submitting patches

Libvpx uses gerrit and jenkins in a setup that makes them almost painless and has a detailed guide on how to register and fill in some forms to make the Google lawyers happy.

Gerrit and Jenkins defaults are quite clunky, so they Libvpx maintainer definitely invested some time to get them in a better shape.

Once you registered and set the hook to tag your commits sending a set boils down to:

# git push https://chromium-review.googlesource.com/webm/libvpx HEAD:refs/for/master

Interaction

Comments and reports end up in your mailbox, spamming it a lot (expect about 5-6 emails per commit). You have to use the web interface to have decent interaction and luckily PolyGerrit isn’t terrible at all (just make sure your replies gets sent since it has the tendency of keeping them in draft mode).

TL;DR

  • read this
  • install clang-format, including the git integration
  • be ready to make changes in test/*.cc and cope with gtest verboseness.
  • be ready to receive tons of email and use your web browser

Intel MediaSDK mini-walkthrough

Using hwaccel

Had been a while since I mentioned the topic and we made a huge progress on this field.

Currently with Libav12 we already have nice support for multiple different hardware support for decoding, scaling, deinterlacing and encoding.

The whole thing works nicely but it isn’t foolproof yet so I’ll start describing how to setup and use it for some common tasks.

This post will be about Intel MediaSDK, the next post will be about NVIDIA Video Codec SDK.

Setup

Prerequisites

  • A machine with QSV hardware, Haswell, Skylake or better.
  • The ability to compile your own kernel and modules
  • The MediaSDK mfx_dispatch

It works nicely both on Linux and Windows. If you happen to have other platforms feel free to contact Intel and let them know, they’ll be delighted.

Installation

The MediaSDK comes with either the usual Windows setup binary or a Linux bash script that tries its best to install the prerequisites.

# tar -xvf MediaServerStudioEssentials2017.tar.gz
MediaServerStudioEssentials2017/
MediaServerStudioEssentials2017/Intel(R)_Media_Server_Studio_EULA.pdf
MediaServerStudioEssentials2017/MediaSamples_Linux_2017.tar.gz
MediaServerStudioEssentials2017/intel_sdk_for_opencl_2016_6.2.0.1760_x64.tgz
MediaServerStudioEssentials2017/site_license_materials.txt
MediaServerStudioEssentials2017/third_party_programs.txt
MediaServerStudioEssentials2017/redist.txt
MediaServerStudioEssentials2017/FEI2017-16.5.tar.gz
MediaServerStudioEssentials2017/SDK2017Production16.5.tar.gz
MediaServerStudioEssentials2017/media_server_studio_essentials_release_notes.pdf

Focus on SDK2017Production16.5.tar.gz.

tar -xvf SDK2017Production16.5.tar.gz
SDK2017Production16.5/
SDK2017Production16.5/Generic/
SDK2017Production16.5/Generic/intel-opencl-16.5-55964.x86_64.tar.xz.sig
SDK2017Production16.5/Generic/intel-opencl-devel-16.5-55964.x86_64.tar.xz.sig
SDK2017Production16.5/Generic/intel-opencl-devel-16.5-55964.x86_64.tar.xz
SDK2017Production16.5/Generic/intel-linux-media_generic_16.5-55964_64bit.tar.gz
SDK2017Production16.5/Generic/intel-opencl-16.5-55964.x86_64.tar.xz
SDK2017Production16.5/Generic/vpg_ocl_linux_rpmdeb.public
SDK2017Production16.5/media_server_studio_getting_started_guide.pdf
SDK2017Production16.5/intel-opencl-16.5-release-notes.pdf
SDK2017Production16.5/intel-opencl-16.5-installation.pdf
SDK2017Production16.5/CentOS/
SDK2017Production16.5/CentOS/libva-1.67.0.pre1-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/libdrm-devel-2.4.66-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/intel-linux-media-devel-16.5-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/intel-i915-firmware-16.5-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/install_scripts_centos_16.5-55964.tar.gz
SDK2017Production16.5/CentOS/intel-opencl-devel-16.5-55964.x86_64.rpm
SDK2017Production16.5/CentOS/ukmd-kmod-16.5-55964.el7.src.rpm
SDK2017Production16.5/CentOS/libdrm-2.4.66-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/libva-utils-1.67.0.pre1-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/intel-linux-media-16.5-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/kmod-ukmd-16.5-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/intel-opencl-16.5-55964.x86_64.rpm
SDK2017Production16.5/CentOS/libva-devel-1.67.0.pre1-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/drm-utils-2.4.66-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/MediaSamples_Linux_bin-16.5-55964.tar.gz
SDK2017Production16.5/CentOS/vpg_ocl_linux_rpmdeb.public
SDK2017Production16.5/media_server_studio_sdk_release_notes.pdf

Libraries

The MediaSDK leverages libva to access the hardware together with an highly extended DRI kernel module.
They support CentOS with rpms and all the other distros with a tarball.

BEWARE: if you use the installer script the custom libva would override your system one, you might not want that.

I’m using Gentoo so it is intel-linux-media_generic_16.5-55964_64bit.tar.gz for me.

The bits of this tarball you really want to install in the system no matter what is the firmware:

./lib/firmware/i915/skl_dmc_ver1_26.bin

If you are afraid of adding custom stuff on your system I advise to offset the whole installation and then override the LD paths to use that only for Libav.

BEWARE: you must use the custom iHD libva driver with the custom i915 kernel module.

If you want to install using the provided script on Gentoo you should first emerge lsb-release.

emerge lsb-release
bash install_media.sh
source /etc/profile.d/*.sh
echo /opt/intel/mediasdk/lib64/ >> /etc/ld.so.conf.d/intel-msdk.conf
ldconfig

Kernel Modules

The patchset resides in:

opt/intel/mediasdk/opensource/patches/kmd/4.4/intel-kernel-patches.tar.bz2

The current set is 143 patches against linux 4.4, trying to apply on a more recent kernel requires patience and care.

The 4.4.27 works almost fine (even btrfs does not seem to have many horrible bugs).

Libav

In order to use the Media SDK with Libav you should use the mfx_dispatch from yours truly since it provides a default for Linux so it behaves in an uniform way compared to Windows.

Building the dispatcher

It is a standard autotools package.

git clone git://github.com/lu-zero/mfx_dispatch
cd mfx_dispatch
autoreconf -ifv
./configure --prefix=/some/where
make -j 8
make install

Building Libav

If you want to use the advanced hwcontext features on Linux you must enable both the vaapi and the mfx support.

git clone git://github.com/libav/libav
cd libav
export PKG_CONFIG_PATH=/some/where/lib/pkg-config
./configure --enable-libmfx --enable-vaapi --prefix=/that/you/like
make -j 8
make install

Troubleshooting

Media SDK is sort of temperamental and the setup process requires manual tweaking so the odds of having to do debug and investigate are high.

If something misbehave here is a checklist:
– Make sure you are using the right kernel and you are loading the module.

uname -a
lsmod
dmesg
  • Make sure libva is the correct one and it is loading the right thing.
vainfo
strace -e open ./avconv -c:v h264_qsv -i test.h264 -f null -
  • Make sure you aren’t using the wrong ratecontrol or not passing all the parameters required
./avconv -v verbose -filter_complex testsrc -c:v h264_qsv {ratecontrol params omitted} out.mkv

See below for some examples of working rate-control settings.
– Use the MediaSDK examples provided with the distribution to confirm that everything works in case the SDK is more recent than the updates.

Usage

The Media SDK support in Libav covers decoding, encoding, scaling and deinterlacing.

Decoding is straightforward, the rest has still quite a bit of rough edges and this blog post had been written mainly to explain them.

Currently the most interesting format supported are h264 and hevc, but even other formats such as vp8 and vc1 are supported.

./avconv -codecs | grep qsv

Decoding

The decoders can output directly to system memory and can be used as normal decoders and feed a software implementation just fine.

./avconv -c:v h264_qsv -i input.h264 -c:v av1 output.mkv

Or they can decode to opaque (gpu backed) buffers so further processing can happen

./avconv -hwaccel qsv -c:v h264_qsv -vf deinterlace_qsv,hwdownload,format=nv12 -c:v x265 output.mov

NOTICE: you have to explicitly pass the filterchain hwdownload,format=nv12 not have mysterious failures.

Encoding

The encoders are almost as straightforward beside the fact that the MediaSDK provides multiple rate-control systems and they do require explicit parameters to work.

./avconv -i input.mkv -c:v h264_qsv -q 20 output.mkv

Failing to set the nominal framerate or the bitrate would make the look-ahead rate control not happy at all.

Rate controls

The rate control is one of the most rough edges of the current MediaSDK support, most of them do require a nominal frame rate and that requires an explicit -r to be passed.

There isn’t a default bitrate so also -b:v should be passed if you want to use a rate-control that has a bitrate target.

Is it possible to use a look-ahead rate-control aiming to a quality metric passing -global_quality -la_depth.

The full list is documented.

Transcoding

It is possible to have a full hardware transcoding pipeline with Media SDK.

Deinterlacing

./avconv -hwaccel qsv -c:v h264_qsv -i input.mkv -vf deinterlace_qsv -c:v h264_qsv -r 25 -b:v 2M output.mov

Scaling

./avconv -hwaccel qsv -c:v h264_qsv -i input.mkv -vf scale_qsv=640:480 -c:v h264_qsv -r 25 -b:v 2M -la_depth 10 output.mov

Both at the same time

./avconv -hwaccel qsv -c:v h264_qsv -i input.mkv -vf deinterlace_qsv,scale_qsv=640:480 -c:v h264_qsv -r 25 -b:v 2M -la_depth 10 output.mov

Hardware filtering caveats

The hardware filtering system is quite new and introducing it shown a number of shortcomings in the Libavfilter architecture regarding format autonegotiation so for hybrid pipelines (those that do not keep using hardware frames all over) it is necessary to explicitly call for hwupload and hwdownload explictitly in such ways:

./avconv -hwaccel qsv -c:v h264_qsv -i in.mkv -vf deinterlace_qsv,hwdownload,format=nv12 -c:v vp9 out.mkv

Future for MediaSDK in Libav

The Media SDK supports already a good number of interesting codecs (h264, hevc, vp8/vp9) and Intel seems to be quite receptive regarding what codecs support.
The Libav support for it will improve over time as we improve the hardware acceleration support in the filtering layer and we make the libmfx interface richer.

We’d need more people testing and helping us to figure use-cases and corner-cases that hadn’t been thought of yet, your feedback is important!

AVScale – part1

swscale is one of the most annoying part of Libav, after a couple of years since the initial blueprint we have something almost functional you can play with.

Colorspace conversion and Scaling

Before delving in the library architecture and the outher API probably might be good to make a extra quick summary of what this library is about.

Most multimedia concepts are more or less intuitive:
encoding is taking some data (e.g. video frames, audio samples) and compress it by leaving out unimportant details
muxing is the act of storing such compressed data and timestamps so that audio and video can play back in sync
demuxing is getting back the compressed data with the timing information stored in the container format
decoding inflates somehow the data so that video frames can be rendered on screen and the audio played on the speakers

After the decoding step would seem that all the hard work is done, but since there isn’t a single way to store video pixels or audio samples you need to process them so they work with your output devices.

That process is usually called resampling for audio and for video we have colorspace conversion to change the pixel information and scaling to change the amount of pixels in the image.

Today I’ll introduce you to the new library for colorspace conversion and scaling we are working on.

AVScale

The library aims to be as simple as possible and hide all the gory details from the user, you won’t need to figure the heads and tails of functions with a quite large amount of arguments nor special-purpose functions.

The API itself is modelled after avresample and approaches the problem of conversion and scaling in a way quite different from swscale, following the same design of NAScale.

Everything is a Kernel

One of the key concept of AVScale is that the conversion chain is assembled out of different components, separating the concerns.

Those components are called kernels.

The kernels can be conceptually divided in two kinds:
Conversion kernels, taking an input in a certain format and providing an output in another (e.g. rgb2yuv) without changing any other property.
Process kernels, modifying the data while keeping the format itself unchanged (e.g. scale)

This pipeline approach gets great flexibility and helps code reuse.

The most common use-cases (such as scaling without conversion or conversion with out scaling) can be faster than solutions trying to merge together scaling and conversion in a single step.

API

AVScale works with two kind of structures:
AVPixelFormaton: A full description of the pixel format
AVFrame: The frame data, its dimension and a reference to its format details (aka AVPixelFormaton)

The library will have an AVOption-based system to tune specific options (e.g. selecting the scaling algorithm).

For now only avscale_config and avscale_convert_frame are implemented.

So if the input and output are pre-determined the context can be configured like this:

AVScaleContext *ctx = avscale_alloc_context();

if (!ctx)
    ...

ret = avscale_config(ctx, out, in);
if (ret < 0)
    ...

But you can skip it and scale and/or convert from a input to an output like this:

AVScaleContext *ctx = avscale_alloc_context();

if (!ctx)
    ...

ret = avscale_convert_frame(ctx, out, in);
if (ret < 0)
    ...

avscale_free(&ctx);

The context gets lazily configured on the first call.

Notice that avscale_free() takes a pointer to a pointer, to make sure the context pointer does not stay dangling.

As said the API is really simple and essential.

Help welcome!

Kostya kindly provided an initial proof of concept and me, Vittorio and Anton prepared this preview on the spare time. There is plenty left to do, if you like the idea (since many kept telling they would love a swscale replacement) we even have a fundraiser.