Optimizing rust

After the post about optimization, Kostya and many commenters (me included) discussed a bit about if there are better ways to optimize that loop without using unsafe code.

Kostya provided me with a test function and multiple implementations from him and I polished and benchmarked the whole thing.

The code

I put the code in a simple project, initially it was a simple main.rs and then it grew a little.

All it started with this function:

pub fn recombine_plane_reference(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    let mut idx0 = 0;
    let mut idx1 = w / 2;
    let mut idx2 = (h / 2) * sstride;
    let mut idx3 = idx2 + idx1;
    let mut oidx0 = 0;
    let mut oidx1 = dstride;

    for _ in 0..(h / 2) {
        for x in 0..(w / 2) {
            let p0 = src[idx0 + x];
            let p1 = src[idx1 + x];
            let p2 = src[idx2 + x];
            let p3 = src[idx3 + x];
            let s0 = p0.wrapping_add(p2);
            let d0 = p0.wrapping_sub(p2);
            let s1 = p1.wrapping_add(p3);
            let d1 = p1.wrapping_sub(p3);
            let o0 = s0.wrapping_add(s1).wrapping_add(2);
            let o1 = d0.wrapping_add(d1).wrapping_add(2);
            let o2 = s0.wrapping_sub(s1).wrapping_add(2);
            let o3 = d0.wrapping_sub(d1).wrapping_add(2);
            dst[oidx0 + x * 2 + 0] = clip8(o0.wrapping_shr(2).wrapping_add(128));
            dst[oidx0 + x * 2 + 1] = clip8(o1.wrapping_shr(2).wrapping_add(128));
            dst[oidx1 + x * 2 + 0] = clip8(o2.wrapping_shr(2).wrapping_add(128));
            dst[oidx1 + x * 2 + 1] = clip8(o3.wrapping_shr(2).wrapping_add(128));
        }
        idx0 += sstride;
        idx1 += sstride;
        idx2 += sstride;
        idx3 += sstride;
        oidx0 += dstride * 2;
        oidx1 += dstride * 2;
    }
}

Benchmark

Kostya used perf to measure the number of samples it takes over a large number of iterations, I wanted to make the benchmark a little more portable so I used the time::PreciseTime Rust provides to measure something a little more coarse, but good enough for our purposes.

We want to see if rewriting the loop using unsafe pointers or using high level iterators provides a decent speedup, no need to be overly precise.

NB: I decided to not use the bencher utility provided with nightly rust to make the code even easier to use.

+fn benchme<F>(name: &str, n: usize, mut f: F)
+    where F : FnMut() {
+    let start = PreciseTime::now();
+    for _ in 0..n {
+        f();
+    }
+    let end = PreciseTime::now();
+    println!("Runtime {} {}", name, start.to(end));
+}

# cargo run --release

Unsafe code

Both me and Kostya have a C background so for him (and for me), was sort of natural embracing unsafe {} and use the raw pointers like we are used to.

pub fn recombine_plane_unsafe(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    unsafe {
        let hw = (w / 2) as isize;
        let mut band0 = src.as_ptr();
        let mut band1 = band0.offset(hw);
        let mut band2 = band0.offset(((h / 2) * sstride) as isize);
        let mut band3 = band2.offset(hw);
        let mut dst0 = dst.as_mut_ptr();
        let mut dst1 = dst0.offset(dstride as isize);
        let hh = (h / 2) as isize;
        for _ in 0..hh {
            let mut b0_ptr = band0;
            let mut b1_ptr = band1;
            let mut b2_ptr = band2;
            let mut b3_ptr = band3;
            let mut d0_ptr = dst0;
            let mut d1_ptr = dst1;
            for _ in 0..hw {
                let p0 = *b0_ptr;
                let p1 = *b1_ptr;
                let p2 = *b2_ptr;
                let p3 = *b3_ptr;
                let s0 = p0.wrapping_add(p2);
                let s1 = p1.wrapping_add(p3);
                let d0 = p0.wrapping_sub(p2);
                let d1 = p1.wrapping_sub(p3);
                let o0 = s0.wrapping_add(s1).wrapping_add(2);
                let o1 = d0.wrapping_add(d1).wrapping_add(2);
                let o2 = s0.wrapping_sub(s1).wrapping_add(2);
                let o3 = d0.wrapping_sub(d1).wrapping_add(2);
                *d0_ptr.offset(0) = clip8((o0 >> 2).wrapping_add(128));
                *d0_ptr.offset(1) = clip8((o1 >> 2).wrapping_add(128));
                *d1_ptr.offset(0) = clip8((o2 >> 2).wrapping_add(128));
                *d1_ptr.offset(1) = clip8((o3 >> 2).wrapping_add(128));
                b0_ptr = b0_ptr.offset(1);
                b1_ptr = b1_ptr.offset(1);
                b2_ptr = b2_ptr.offset(1);
                b3_ptr = b3_ptr.offset(1);
                d0_ptr = d0_ptr.offset(2);
                d1_ptr = d1_ptr.offset(2);
            }
            band0 = band0.offset(sstride as isize);
            band1 = band1.offset(sstride as isize);
            band2 = band2.offset(sstride as isize);
            band3 = band3.offset(sstride as isize);
            dst0 = dst0.offset((dstride * 2) as isize);
            dst1 = dst1.offset((dstride * 2) as isize);
        }
    }
}

The function is faster than baseline:

    Runtime reference   PT1.598052169S
    Runtime unsafe      PT1.222646190S

Explicit upcasts

Kostya noticed that telling rust to use i32 instead of i16 gave some performance boost.

    Runtime reference       PT1.601846926S
    Runtime reference 32bit PT1.371876242S
    Runtime unsafe          PT1.223115917S
    Runtime unsafe 32bit    PT1.124667021S

I’ll keep variants between i16 and i32 to see when it is important and when it is not.

Note: Making code generic over primitive types is currently pretty painful and hopefully will be fixed in the future.

High level abstractions

Most of the comments to Kostya’s original post were about leveraging the high level abstractions to make the compiler understand the code better.

Use Iterators

Rust is able to omit the bound checks if there is a warranty that the code cannot go out of the array boundaries. Using Iterators instead of for loops over an external variables should do the trick.

Use `Chunks`

chunks and chunks_mut take a slice and provides a nice iterator that gets you at-most-N-sized pieces of the input slice.

Since that the code works by line it is sort of natural to use it.

Use `split_at`

split_at and split_at_mut get you independent slices, even mutable. The code is writing two lines at time so having the ability to access mutably two regions of the frame is a boon.

The (read-only) input is divided in bands and the output produced is 2 lines at time. split_at is much better than using hand-made slicing and
split_at_mut is perfect to write at the same time the even and the odd line.

All together

pub fn recombine_plane_chunks_32(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    let hw = w / 2;
    let hh = h / 2;
    let (src1, src2) = src.split_at(sstride * hh);
    let mut src1i = src1.chunks(sstride);
    let mut src2i = src2.chunks(sstride);
    let mut dstch = dst.chunks_mut(dstride * 2);
    for _ in 0..hh {
        let s1 = src1i.next().unwrap();
        let s2 = src2i.next().unwrap();
        let mut d = dstch.next().unwrap();
        let (mut d0, mut d1) = d.split_at_mut(dstride);
        let (b0, b1) = s1.split_at(hw);
        let (b2, b3) = s2.split_at(hw);
        let mut di0 = d0.iter_mut();
        let mut di1 = d1.iter_mut();
        let mut bi0 = b0.iter();
        let mut bi1 = b1.iter();
        let mut bi2 = b2.iter();
        let mut bi3 = b3.iter();
        for _ in 0..hw {
            let p0 = bi0.next().unwrap();
            let p1 = bi1.next().unwrap();
            let p2 = bi2.next().unwrap();
            let p3 = bi3.next().unwrap();
            recombine_core_32(*p0, *p1, *p2, *p3, &mut di0, &mut di1);
        }
    }
}

It is a good improvement over the reference baseline, but still not as fast as unsafe.

    Runtime reference       PT1.621158410S
    Runtime reference 32bit PT1.467441931S
    Runtime unsafe          PT1.226046003S
    Runtime unsafe 32bit    PT1.126615305S
    Runtime chunks          PT1.349947181S
    Runtime chunks 32bit    PT1.350027322S

Use of `zip` or `izip`

Using next().unwrap() feels clumsy and force the iterator to be explicitly mutable. The loop can be written in a nicer way using the system provided zip and the itertools-provided izip.

zip works fine for 2 iterators, then you start piling up (so, (many, (tuples, (that, (feels, lisp))))) (or (feels (lisp, '(so, many, tuples))) according to a reader). izip flattens the result so it is sort of nicers.

pub fn recombine_plane_zip_16(
    src: &[i16],
    sstride: usize,
    dst: &mut [u8],
    dstride: usize,
    w: usize,
    h: usize,
) {
    let hw = w / 2;
    let hh = h / 2;
    let (src1, src2) = src.split_at(sstride * hh);
    let src1i = src1.chunks(sstride);
    let src2i = src2.chunks(sstride);
    let mut dstch = dst.chunks_mut(dstride * 2);
    for (s1, s2) in src1i.zip(src2i) {
        let mut d = dstch.next().unwrap();
        let (mut d0, mut d1) = d.split_at_mut(dstride);
        let (b0, b1) = s1.split_at(hw);
        let (b2, b3) = s2.split_at(hw);
        let mut di0 = d0.iter_mut();
        let mut di1 = d1.iter_mut();
        let iterband = b0.iter().zip(b1.iter().zip(b2.iter().zip(b3.iter())));
        for (p0, (p1, (p2, p3))) in iterband {
            recombine_core_16(*p0, *p1, *p2, *p3, &mut di0, &mut di1);
        }
    }
}

How they would fare?

    Runtime reference        PT1.614962959S
    Runtime reference 32bit  PT1.369636641S
    Runtime unsafe           PT1.223157417S
    Runtime unsafe 32bit     PT1.125534521S
    Runtime chunks           PT1.350069795S
    Runtime chunks 32bit     PT1.381841742S
    Runtime zip              PT1.249227707S
    Runtime zip 32bit        PT1.094282423S
    Runtime izip             PT1.366320546S
    Runtime izip 32bit       PT1.208708213S

Pretty well.

Looks like izip is a little more wasteful than zip currently, so looks like we have a winner 🙂

Conclusions

Compared to common imperative programming patterns, using the high level abstractions does lead to a nice speedup: use iterators when you can!
Not all the abstractions cost zero, zip made the overall code faster while izip lead to a speed regression.
Do benchmark your time critical code. nightly has some facility for it BUT it is not great for micro-benchmarks.

Overall I’m enjoying a lot writing code in Rust.

Contributing to x264

Another project I contribute to is x264. As per the previous post on the topic I’ll try to summarize how things work.

Overview

Coding style

x264 has a coding style and on review you will get asked to follow it, sadly the sources do not contain a document about it, you have to look around the code and match what is there.

Testing

x264 has an amazing test harness that doubles as benchmark harness to add support for additional architecture-specific optimizations. checkasm is really good in helping you write new code for this kind of purpose and make sure it is really faster.

It is automatic to use if you are adding a function already implemented in other architectures, if you want to extend the coverage for something new it is moderately difficult, mainly because you have to read the code since no documentation is available otherwise.

Submitting patches

Submitting code to x264 requires you to sign a cla, the process is sort of manual and involves exchanging emails with the person in charge to provide and collect the cla pdf once you signed it.

Once you are done on that you should rebase your changes over the sandbox branch, that’s somehow similar to the next branch on other projects and send them to the developer mailing list using git send-email.

Interaction

The review process will happen in the mailing list and you are supposed to be subscribed to it and interact with the reviewers there.

TL;DR

Mimic the coding style used in the project and hope you get it right
Peruse checkasm to make sure what you are doing works as intended and it is fast
Subscribe to the developer mailing list and learn how to use git send-email.
Be patient and wait for review comments in the mail box.

Contributing to libvpx

Recently I started to write the PowerPC/VSX support for libvpx, Alexandra will help as well.

Every open source project has its own rules, I found the choices taken in Libvpx interesting enough to write them down (and possibly help newcomers with some shortcuts).

Overview

Coding style

The coding style is strongly enforced, the CI system will bounce your code if it doesn’t adhere to the style.

This constraint is enforced through a clang-format ruleset.

If you are using vim, this makes your life easier, otherwise the git integration comes handy.

Otherwise:

# clang-format -i what/I/m/working/on.c

Works no matter what.

Testing

New code should have its testcase, if it isn’t already covered.

Libvpx uses gtest and it has a quite decent test coverage. A full run of the tests can take a large chunk of time, if you are working on specific code (e.g. dsp functions), is easy to run only the tests you care about like this:

# ./test_libvpx --gtest_filter="*pattern*with*globs"

The current system does not double as benchmarking tool, so you are quite on your own if you are trying to speed up some parts.

Adding brand new tests more annoying than it should since gtest is quite bloated, updating a test to cover a variant is quite painless though.

Submitting patches

Libvpx uses gerrit and jenkins in a setup that makes them almost painless and has a detailed guide on how to register and fill in some forms to make the Google lawyers happy.

Gerrit and Jenkins defaults are quite clunky, so they Libvpx maintainer definitely invested some time to get them in a better shape.

Once you registered and set the hook to tag your commits sending a set boils down to:

# git push https://chromium-review.googlesource.com/webm/libvpx HEAD:refs/for/master

Interaction

Comments and reports end up in your mailbox, spamming it a lot (expect about 5-6 emails per commit). You have to use the web interface to have decent interaction and luckily PolyGerrit isn’t terrible at all (just make sure your replies gets sent since it has the tendency of keeping them in draft mode).

TL;DR

read this
install clang-format, including the git integration
be ready to make changes in test/*.cc and cope with gtest verboseness.
be ready to receive tons of email and use your web browser

Intel MediaSDK mini-walkthrough

Using hwaccel

Had been a while since I mentioned the topic and we made a huge progress on this field.

Currently with Libav12 we already have nice support for multiple different hardware support for decoding, scaling, deinterlacing and encoding.

The whole thing works nicely but it isn’t foolproof yet so I’ll start describing how to setup and use it for some common tasks.

This post will be about Intel MediaSDK, the next post will be about NVIDIA Video Codec SDK.

Setup

Prerequisites

A machine with QSV hardware, Haswell, Skylake or better.
The ability to compile your own kernel and modules
The MediaSDK mfx_dispatch

It works nicely both on Linux and Windows. If you happen to have other platforms feel free to contact Intel and let them know, they’ll be delighted.

Installation

The MediaSDK comes with either the usual Windows setup binary or a Linux bash script that tries its best to install the prerequisites.

# tar -xvf MediaServerStudioEssentials2017.tar.gz
MediaServerStudioEssentials2017/
MediaServerStudioEssentials2017/Intel(R)_Media_Server_Studio_EULA.pdf
MediaServerStudioEssentials2017/MediaSamples_Linux_2017.tar.gz
MediaServerStudioEssentials2017/intel_sdk_for_opencl_2016_6.2.0.1760_x64.tgz
MediaServerStudioEssentials2017/site_license_materials.txt
MediaServerStudioEssentials2017/third_party_programs.txt
MediaServerStudioEssentials2017/redist.txt
MediaServerStudioEssentials2017/FEI2017-16.5.tar.gz
MediaServerStudioEssentials2017/SDK2017Production16.5.tar.gz
MediaServerStudioEssentials2017/media_server_studio_essentials_release_notes.pdf

Focus on SDK2017Production16.5.tar.gz.

tar -xvf SDK2017Production16.5.tar.gz
SDK2017Production16.5/
SDK2017Production16.5/Generic/
SDK2017Production16.5/Generic/intel-opencl-16.5-55964.x86_64.tar.xz.sig
SDK2017Production16.5/Generic/intel-opencl-devel-16.5-55964.x86_64.tar.xz.sig
SDK2017Production16.5/Generic/intel-opencl-devel-16.5-55964.x86_64.tar.xz
SDK2017Production16.5/Generic/intel-linux-media_generic_16.5-55964_64bit.tar.gz
SDK2017Production16.5/Generic/intel-opencl-16.5-55964.x86_64.tar.xz
SDK2017Production16.5/Generic/vpg_ocl_linux_rpmdeb.public
SDK2017Production16.5/media_server_studio_getting_started_guide.pdf
SDK2017Production16.5/intel-opencl-16.5-release-notes.pdf
SDK2017Production16.5/intel-opencl-16.5-installation.pdf
SDK2017Production16.5/CentOS/
SDK2017Production16.5/CentOS/libva-1.67.0.pre1-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/libdrm-devel-2.4.66-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/intel-linux-media-devel-16.5-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/intel-i915-firmware-16.5-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/install_scripts_centos_16.5-55964.tar.gz
SDK2017Production16.5/CentOS/intel-opencl-devel-16.5-55964.x86_64.rpm
SDK2017Production16.5/CentOS/ukmd-kmod-16.5-55964.el7.src.rpm
SDK2017Production16.5/CentOS/libdrm-2.4.66-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/libva-utils-1.67.0.pre1-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/intel-linux-media-16.5-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/kmod-ukmd-16.5-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/intel-opencl-16.5-55964.x86_64.rpm
SDK2017Production16.5/CentOS/libva-devel-1.67.0.pre1-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/drm-utils-2.4.66-55964.el7.x86_64.rpm
SDK2017Production16.5/CentOS/MediaSamples_Linux_bin-16.5-55964.tar.gz
SDK2017Production16.5/CentOS/vpg_ocl_linux_rpmdeb.public
SDK2017Production16.5/media_server_studio_sdk_release_notes.pdf

Libraries

The MediaSDK leverages libva to access the hardware together with an highly extended DRI kernel module.
They support CentOS with rpms and all the other distros with a tarball.

BEWARE: if you use the installer script the custom libva would override your system one, you might not want that.

I’m using Gentoo so it is intel-linux-media_generic_16.5-55964_64bit.tar.gz for me.

The bits of this tarball you really want to install in the system no matter what is the firmware:

./lib/firmware/i915/skl_dmc_ver1_26.bin

If you are afraid of adding custom stuff on your system I advise to offset the whole installation and then override the LD paths to use that only for Libav.

BEWARE: you must use the custom iHD libva driver with the custom i915 kernel module.

If you want to install using the provided script on Gentoo you should first emerge lsb-release.

emerge lsb-release
bash install_media.sh
source /etc/profile.d/*.sh
echo /opt/intel/mediasdk/lib64/ >> /etc/ld.so.conf.d/intel-msdk.conf
ldconfig

Kernel Modules

The patchset resides in:

opt/intel/mediasdk/opensource/patches/kmd/4.4/intel-kernel-patches.tar.bz2

The current set is 143 patches against linux 4.4, trying to apply on a more recent kernel requires patience and care.

The 4.4.27 works almost fine (even btrfs does not seem to have many horrible bugs).

Libav

In order to use the Media SDK with Libav you should use the mfx_dispatch from yours truly since it provides a default for Linux so it behaves in an uniform way compared to Windows.

Building the dispatcher

It is a standard autotools package.

git clone git://github.com/lu-zero/mfx_dispatch
cd mfx_dispatch
autoreconf -ifv
./configure --prefix=/some/where
make -j 8
make install

Building Libav

If you want to use the advanced hwcontext features on Linux you must enable both the vaapi and the mfx support.

git clone git://github.com/libav/libav
cd libav
export PKG_CONFIG_PATH=/some/where/lib/pkg-config
./configure --enable-libmfx --enable-vaapi --prefix=/that/you/like
make -j 8
make install

Troubleshooting

Media SDK is sort of temperamental and the setup process requires manual tweaking so the odds of having to do debug and investigate are high.

If something misbehave here is a checklist:
– Make sure you are using the right kernel and you are loading the module.

uname -a
lsmod
dmesg

Make sure libva is the correct one and it is loading the right thing.

vainfo
strace -e open ./avconv -c:v h264_qsv -i test.h264 -f null -

Make sure you aren’t using the wrong ratecontrol or not passing all the parameters required

./avconv -v verbose -filter_complex testsrc -c:v h264_qsv {ratecontrol params omitted} out.mkv

See below for some examples of working rate-control settings.
– Use the MediaSDK examples provided with the distribution to confirm that everything works in case the SDK is more recent than the updates.

Usage

The Media SDK support in Libav covers decoding, encoding, scaling and deinterlacing.

Decoding is straightforward, the rest has still quite a bit of rough edges and this blog post had been written mainly to explain them.

Currently the most interesting format supported are h264 and hevc, but even other formats such as vp8 and vc1 are supported.

./avconv -codecs | grep qsv

Decoding

The decoders can output directly to system memory and can be used as normal decoders and feed a software implementation just fine.

./avconv -c:v h264_qsv -i input.h264 -c:v av1 output.mkv

Or they can decode to opaque (gpu backed) buffers so further processing can happen

./avconv -hwaccel qsv -c:v h264_qsv -vf deinterlace_qsv,hwdownload,format=nv12 -c:v x265 output.mov

NOTICE: you have to explicitly pass the filterchain hwdownload,format=nv12 not have mysterious failures.

Encoding

The encoders are almost as straightforward beside the fact that the MediaSDK provides multiple rate-control systems and they do require explicit parameters to work.

./avconv -i input.mkv -c:v h264_qsv -q 20 output.mkv

Failing to set the nominal framerate or the bitrate would make the look-ahead rate control not happy at all.

Rate controls

The rate control is one of the most rough edges of the current MediaSDK support, most of them do require a nominal frame rate and that requires an explicit -r to be passed.

There isn’t a default bitrate so also -b:v should be passed if you want to use a rate-control that has a bitrate target.

Is it possible to use a look-ahead rate-control aiming to a quality metric passing -global_quality -la_depth.

The full list is documented.

Transcoding

It is possible to have a full hardware transcoding pipeline with Media SDK.

Deinterlacing

./avconv -hwaccel qsv -c:v h264_qsv -i input.mkv -vf deinterlace_qsv -c:v h264_qsv -r 25 -b:v 2M output.mov

Scaling

./avconv -hwaccel qsv -c:v h264_qsv -i input.mkv -vf scale_qsv=640:480 -c:v h264_qsv -r 25 -b:v 2M -la_depth 10 output.mov

Both at the same time

./avconv -hwaccel qsv -c:v h264_qsv -i input.mkv -vf deinterlace_qsv,scale_qsv=640:480 -c:v h264_qsv -r 25 -b:v 2M -la_depth 10 output.mov

Hardware filtering caveats

The hardware filtering system is quite new and introducing it shown a number of shortcomings in the Libavfilter architecture regarding format autonegotiation so for hybrid pipelines (those that do not keep using hardware frames all over) it is necessary to explicitly call for hwupload and hwdownload explictitly in such ways:

./avconv -hwaccel qsv -c:v h264_qsv -i in.mkv -vf deinterlace_qsv,hwdownload,format=nv12 -c:v vp9 out.mkv

Future for MediaSDK in Libav

The Media SDK supports already a good number of interesting codecs (h264, hevc, vp8/vp9) and Intel seems to be quite receptive regarding what codecs support.
The Libav support for it will improve over time as we improve the hardware acceleration support in the filtering layer and we make the libmfx interface richer.

We’d need more people testing and helping us to figure use-cases and corner-cases that hadn’t been thought of yet, your feedback is important!

AVScale – part1

swscale is one of the most annoying part of Libav, after a couple of years since the initial blueprint we have something almost functional you can play with.

Colorspace conversion and Scaling

Before delving in the library architecture and the outher API probably might be good to make a extra quick summary of what this library is about.

Most multimedia concepts are more or less intuitive:
– encoding is taking some data (e.g. video frames, audio samples) and compress it by leaving out unimportant details
– muxing is the act of storing such compressed data and timestamps so that audio and video can play back in sync
– demuxing is getting back the compressed data with the timing information stored in the container format
– decoding inflates somehow the data so that video frames can be rendered on screen and the audio played on the speakers

After the decoding step would seem that all the hard work is done, but since there isn’t a single way to store video pixels or audio samples you need to process them so they work with your output devices.

That process is usually called resampling for audio and for video we have colorspace conversion to change the pixel information and scaling to change the amount of pixels in the image.

Today I’ll introduce you to the new library for colorspace conversion and scaling we are working on.

AVScale

The library aims to be as simple as possible and hide all the gory details from the user, you won’t need to figure the heads and tails of functions with a quite large amount of arguments nor special-purpose functions.

The API itself is modelled after avresample and approaches the problem of conversion and scaling in a way quite different from swscale, following the same design of NAScale.

Everything is a Kernel

One of the key concept of AVScale is that the conversion chain is assembled out of different components, separating the concerns.

Those components are called kernels.

The kernels can be conceptually divided in two kinds:
– Conversion kernels, taking an input in a certain format and providing an output in another (e.g. rgb2yuv) without changing any other property.
– Process kernels, modifying the data while keeping the format itself unchanged (e.g. scale)

This pipeline approach gets great flexibility and helps code reuse.

The most common use-cases (such as scaling without conversion or conversion with out scaling) can be faster than solutions trying to merge together scaling and conversion in a single step.

API

AVScale works with two kind of structures:
– AVPixelFormaton: A full description of the pixel format
– AVFrame: The frame data, its dimension and a reference to its format details (aka AVPixelFormaton)

The library will have an AVOption-based system to tune specific options (e.g. selecting the scaling algorithm).

For now only avscale_config and avscale_convert_frame are implemented.

So if the input and output are pre-determined the context can be configured like this:

AVScaleContext *ctx = avscale_alloc_context();

if (!ctx)
    ...

ret = avscale_config(ctx, out, in);
if (ret < 0)
    ...

But you can skip it and scale and/or convert from a input to an output like this:

AVScaleContext *ctx = avscale_alloc_context();

if (!ctx)
    ...

ret = avscale_convert_frame(ctx, out, in);
if (ret < 0)
    ...

avscale_free(&ctx);

The context gets lazily configured on the first call.

Notice that avscale_free() takes a pointer to a pointer, to make sure the context pointer does not stay dangling.

As said the API is really simple and essential.

Help welcome!

Kostya kindly provided an initial proof of concept and me, Vittorio and Anton prepared this preview on the spare time. There is plenty left to do, if you like the idea (since many kept telling they would love a swscale replacement) we even have a fundraiser.

New AVCodec API

Another week another API landed in the tree and since I spent some time drafting it, I guess I should describe how to use it now what is implemented. This is part I

What is here now

Between theory and practice there is a bit of discussion and obviously the (lack) of time to implement, so here what is different from what I drafted originally:

Function Names: push got renamed to send and pull got renamed to receive.
No separated function to probe the process state, need_data and have_data are not here.
No codecs ported to use the new API, so no actual asyncronicity for now.
Subtitles aren’t supported yet.

New API

There are just 4 new functions replacing both audio-specific and video-specific ones:

// Decode
int avcodec_send_packet(AVCodecContext *avctx, const AVPacket *avpkt);
int avcodec_receive_frame(AVCodecContext *avctx, AVFrame *frame);

// Encode
int avcodec_send_frame(AVCodecContext *avctx, const AVFrame *frame);
int avcodec_receive_packet(AVCodecContext *avctx, AVPacket *avpkt);

The workflow is sort of simple:
– You setup the decoder or the encoder as usual
– You feed data using the avcodec_send_* functions until you get a AVERROR(EAGAIN), that signals that the internal input buffer is full.
– You get the data back using the matching avcodec_receive_* function until you get a AVERROR(EAGAIN), signalling that the internal output buffer is empty.
– Once you are done feeding data you have to pass a NULL to signal the end of stream.
– You can keep calling the avcodec_receive_* function until you get AVERROR_EOF.
– You free the contexts as usual.

Decoding examples

Setup

The setup uses the usual avcodec_open2.

    ...

    c = avcodec_alloc_context3(codec);

    ret = avcodec_open2(c, codec, &opts);
    if (ret < 0)
        ...

Simple decoding loop

People using the old API usually have some kind of simple loop like

while (get_packet(pkt)) {
    ret = avcodec_decode_video2(c, picture, &got_picture, pkt);
    if (ret < 0) {
        ...
    }
    if (got_picture) {
        ...
    }
}

The old functions can be replaced by calling something like the following.

// The flush packet is a non-NULL packet with size 0 and data NULL
int decode(AVCodecContext *avctx, AVFrame *frame, int *got_frame, AVPacket *pkt)
{
    int ret;

    *got_frame = 0;

    if (pkt) {
        ret = avcodec_send_packet(avctx, pkt);
        // In particular, we don't expect AVERROR(EAGAIN), because we read all
        // decoded frames with avcodec_receive_frame() until done.
        if (ret < 0)
            return ret == AVERROR_EOF ? 0 : ret;
    }

    ret = avcodec_receive_frame(avctx, frame);
    if (ret < 0 && ret != AVERROR(EAGAIN) && ret != AVERROR_EOF)
        return ret;
    if (ret >= 0)
        *got_frame = 1;

    return 0;
}

Callback approach

Since the new API will output multiple frames in certain situations would be better to process them as they are produced.

// return 0 on success, negative on error
typedef int (*process_frame_cb)(void *ctx, AVFrame *frame);

int decode(AVCodecContext *avctx, AVFrame *pkt,
           process_frame_cb cb, void *priv)
{
    AVFrame *frame = av_frame_alloc();
    int ret;

    ret = avcodec_send_packet(avctx, pkt);
    // Again EAGAIN is not expected
    if (ret < 0)
        goto out;

    while (!ret) {
        ret = avcodec_receive_frame(avctx, frame);
        if (!ret)
            ret = cb(priv, frame);
    }

out:
    av_frame_free(&frame);
    if (ret == AVERROR(EAGAIN))
        return 0;
    return ret;
}

Separated threads

The new API makes sort of easy to split the workload in two separated threads.

// Assume we have context with a mutex, a condition variable and the AVCodecContext


// Feeding loop
{
    AVPacket *pkt = NULL;

    while ((ret = get_packet(ctx, pkt)) >= 0) {
        pthread_mutex_lock(&ctx->lock);

        ret = avcodec_send_packet(avctx, pkt);
        if (!ret) {
            pthread_cond_signal(&ctx->cond);
        } else if (ret == AVERROR(EAGAIN)) {
            // Signal the draining loop
            pthread_cond_signal(&ctx->cond);
            // Wait here
            pthread_cond_wait(&ctx->cond, &ctx->mutex);
        } else if (ret < 0)
            goto out;

        pthread_mutex_unlock(&ctx->lock);
    }

    pthread_mutex_lock(&ctx->lock);
    ret = avcodec_send_packet(avctx, NULL);

    pthread_cond_signal(&ctx->cond);

out:
    pthread_mutex_unlock(&ctx->lock)
    return ret;
}

// Draining loop
{
    AVFrame *frame = av_frame_alloc();

    while (!done) {
        pthread_mutex_lock(&ctx->lock);

        ret = avcodec_receive_frame(avctx, frame);
        if (!ret) {
            pthread_cond_signal(&ctx->cond);
        } else if (ret == AVERROR(EAGAIN)) {
            // Signal the feeding loop
            pthread_cond_signal(&ctx->cond);
            // Wait
            pthread_cond_wait(&ctx->cond, &ctx->mutex);
        } else if (ret < 0)
            goto out;

        pthread_mutex_unlock(&ctx->lock);

        if (!ret) {
            do_something(frame);
        }
    }

out:
        pthread_mutex_unlock(&ctx->lock)
    return ret;
}

It isn’t as neat as having all this abstracted away, but is mostly workable.

Encoding Examples

Simple encoding loop

Some compatibility with the old API can be achieved using something along the lines of:

int encode(AVCodecContext *avctx, AVPacket *pkt, int *got_packet, AVFrame *frame)
{
    int ret;

    *got_packet = 0;

    ret = avcodec_send_frame(avctx, frame);
    if (ret < 0)
        return ret;

    ret = avcodec_receive_packet(avctx, pkt);
    if (!ret)
        *got_packet = 1;
    if (ret == AVERROR(EAGAIN))
        return 0;

    return ret;
}

Callback approach

Since for each input multiple output could be produced, would be better to loop over the output as soon as possible.

// return 0 on success, negative on error
typedef int (*process_packet_cb)(void *ctx, AVPacket *pkt);

int encode(AVCodecContext *avctx, AVFrame *frame,
           process_packet_cb cb, void *priv)
{
    AVPacket *pkt = av_packet_alloc();
    int ret;

    ret = avcodec_send_frame(avctx, frame);
    if (ret < 0)
        goto out;

    while (!ret) {
        ret = avcodec_receive_packet(avctx, pkt);
        if (!ret)
            ret = cb(priv, pkt);
    }

out:
    av_packet_free(&pkt);
    if (ret == AVERROR(EAGAIN))
        return 0;
    return ret;
}

The I/O should happen in a different thread when possible so the callback should just enqueue the packets.

Coming Next

This post is long enough so the next one might involve converting a codec to the new API.

Bitstream Filtering

Last weekend, after few months of work, the new bitstream filter API eventually landed.

Bitstream filters

In Libav is possible to manipulate raw and encoded data in many ways, the most common being

Demuxing: extracting single data packets and their timing information
Decoding: converting the compressed data packets in raw video or audio frames
Encoding: converting the raw multimedia information in a compressed form
Muxing: store the compressed information along timing information and additional information.

Bitstream filtering is somehow less considered even if the are widely used under the hood to demux and mux many widely used formats.

It could be consider an optional final demuxing or muxing step since it works on encoded data and its main purpose is to reformat the data so it can be accepted by decoders consuming only a specific serialization of the many supported (e.g. the HEVC QSV decoder) or it can be correctly muxed in a container format that stores only a specific kind.

In Libav this kind of reformatting happens normally automatically with the annoying exception of MPEGTS muxing.

New API

The new API is modeled against the pull/push paradigm I described for AVCodec before, it works on AVPackets and has the following concrete implementation:

// Query
const AVBitStreamFilter *av_bsf_next(void **opaque);
const AVBitStreamFilter *av_bsf_get_by_name(const char *name);

// Setup
int av_bsf_alloc(const AVBitStreamFilter *filter, AVBSFContext **ctx);
int av_bsf_init(AVBSFContext *ctx);

// Usage
int av_bsf_send_packet(AVBSFContext *ctx, AVPacket *pkt);
int av_bsf_receive_packet(AVBSFContext *ctx, AVPacket *pkt);

// Cleanup
void av_bsf_free(AVBSFContext **ctx);

In order to use a bsf you need to:

Look up its definition AVBitStreamFilter using a query function.
Set up a specific context AVBSFContext, by allocating, configuring and then initializing it.
Feed the input using av_bsf_send_packet function and get the processed output once it is ready using av_bsf_receive_packet.
Once you are done av_bsf_free cleans up the memory used for the context and the internal buffers.

Query

You can enumerate the available filters

void *state = NULL;

const AVBitStreamFilter *bsf;

while ((bsf = av_bsf_next(&state)) {
    av_log(NULL, AV_LOG_INFO, "%s\n", bsf->name);
}

or directly pick the one you need by name:

const AVBitStreamFilter *bsf = av_bsf_get_by_name("hevc_mp4toannexb");

Setup

A bsf may use some codec parameters and time_base and provide updated ones.

AVBSFContext *ctx;

ret = av_bsf_alloc(bsf, &ctx);
if (ret < 0)
    return ret;

ret = avcodec_parameters_copy(ctx->par_in, in->codecpar);
if (ret < 0)
    goto fail;

ctx->time_base_in = in->time_base;

ret = av_bsf_init(ctx);
if (ret < 0)
    goto fail;

ret = avcodec_parameters_copy(out->codecpar, ctx->par_out);
if (ret < 0)
    goto fail;

out->time_base = ctx->time_base_out;

Usage

Multiple AVPackets may be consumed before an AVPacket is emitted or multiple AVPackets may be produced out of a single input one.

AVPacket *pkt;

while (got_new_packet(&pkt)) {
    ret = av_bsf_send_packet(ctx, pkt);
    if (ret < 0)
        goto fail;

    while ((ret = av_bsf_receive_packet(ctx, pkt)) == 0) {
        yield_packet(pkt);
    }

    if (ret == AVERROR(EAGAIN)
        continue;
    IF (ret == AVERROR_EOF)
        goto end;
    if (ret < 0)
        goto fail;
}

// Flush
ret = av_bsf_send_packet(ctx, NULL);
if (ret < 0)
    goto fail;

while ((ret = av_bsf_receive_packet(ctx, pkt)) == 0) {
    yield_packet(pkt);
}

if (ret != AVERROR_EOF)
    goto fail;

In order to signal the end of stream a NULL pkt should be fed to send_packet.

Cleanup

The cleanup function matches the av_freep signature so it takes the address of the AVBSFContext pointer.

    av_bsf_free(&ctx);

All the memory is freed and the ctx pointer is set to NULL.

Coming Soon

Hopefully next I’ll document the new HWAccel layer that already landed and some other API that I discussed with Kostya before.
Sadly my blog-time (and spare time in general) shrunk a lot in the past months so he rightfully blamed me a lot.

lxc, ipv6 and iproute2

Not so recently I got a soyoustart system since it is provided with an option to install Gentoo out of box.

The machine comes with a single ipv4 and a /64 amount of ipv6 addresses.

LXC

I want to use the box to host some of my flask applications (plaid mainly), keep some continuous integration instances for libav and some other experiments with compilers and libraries (such as musl, cparser other).

Since Diego was telling me about lxc I picked it. It is simple, requires not much effort and in Gentoo we have at least some documentation.

Setting up

I followed the documentation provided and it worked quite well up to a point. The btrfs integration works as explained, creating new Gentoo instances just worked, setting up the network… Required some effort.

Network woes

I have just 1 single ipv4 and some ipv6 so why not leveraging them? I decided to partition my /64 and use some, configured the bridge to take ::::1::1 and set up the container configuration like this:

lxc.network.type = veth
lxc.network.link = br0
lxc.network.flags = up
lxc.network.ipv4 = 192.168.1.4/16
lxc.network.ipv4.gateway = auto
lxc.network.ipv6 = ::::1::4/80
lxc.network.ipv6.gateway = auto
lxc.network.hwaddr = 02:00:ee:cb:8a:04

But the route to my container wasn’t advertised.

Having no idea why I just kept poking around sysctl and iproute2 until I got:

sysctl.conf:

  net.ipv6.conf.all.forwarding = 1
  net.ipv6.conf.eth0.proxy_ndp = 1

And

ip -6 neigh add proxy ::::1::4 dev eth0

In my container runner script.

I know that at least other people had the problem so here this mini-post.

Code and Conduct

This is a sort of short list of checklists and few ramblings in the wake of Fosdem’s Code of Conduct discussions and the not exactly welcoming statements about how to perceive a Code of Conduct such as this one.

Code of Conduct and OpenSource projects

A Code of Conduct is generally considered a mean to get rid of problematic people (and thus avoid toxic situations). I prefer consider it a mean to welcome people and provide good guidelines to newcomers.

Communities without a code of conduct tend to reject the idea of having one, thinking that it is only needed to solve the above mentioned issue and adding more bureaucracy would just actually give more leeway to macchiavellian ploys.

Sadly, no matter how good the environment is, it takes just few poisonous people to get in an unbearable situation and a you just need one in few selected cases.

If you consider the CoC a shackle or a stick to beat “bad guys” so you do not need it until you see a bad guy, that is naive and utterly wrong: you will end up writing something that excludes people due a, quite understandable but wrong, knee-jerk reaction.

A Code of Conduct should do exactly the opposite, it should embrace people and make easier joining and fit in. It should be the social equivalent of the developer handbook or the coding style guidelines.

As everybody can make a little effort and make sure to send code with spaces between operators everybody can make an effort and not use colorful language. Likewise as people would be more happy to contribute if the codebase they are hacking on is readable so they are more confident in joining the community if the environment is pleasant.

Making an useful Code of Conduct

The Code of Conduct should be a guideline for people that have no idea what the expected behavior is.
It should be written thinking on how to help people get along not on how to punish who does not.

It should be short. It is pointless to enumerate ALL the possible way to make people uncomfortable, you are bound to miss few.
It should be understanding and inclusive. Always assume cultural biases and not ill will.
It should be enforced. It gets quite depressing when you have a 100+ lines code of conduct but then nobody cares about it and nobody really enforces it. And I’m not talking about having specifically designated people to enforce it. Your WHOLE community should agree on what is an acceptable behavior and act accordingly on breaches.

People joining the community should consider the Code of Conduct first as a request (and not a demand) to make an effort to get along with the others.

Pitfalls

Since I saw quite some long and convoluted wall of text being suggested as THE CODE OF CONDUCT everybody MUST ABIDE TO, here some suggestion on what NOT do.

It should not be a political statement: this is a strong cultural bias that would make potential contributors just stay away. No matter how good and great you think your ideas are, those are unrelated to a project that should gather all the people that enjoy writing code in their spare time. The Open Source movement is already an ideology in itself, overloading it with more is just a recipe for a disaster.
Do not try to make a long list of definitions, you just dilute the content and give even more ammo to lawyer-type arguers.
Do not think much about making draconian punishments, this is a community on internet, even nowadays nobody really knows if you are actually a dog or not, you cannot really enforce anything if the other party really wants to be a pest.

Good examples

Some CoC I consider good are obviously the ones used in the communities I belong to, Gentoo and Libav, they are really short and to the point.

Enforcing

As I said before no matter how well written a code of conduct is, the only way to really make it useful is if the community as whole helps new (and not so new) people to get along.

The rule of thumb “if anybody feels uncomfortable in a non-technical discussion, once they say they are, drop it immediately”, is ok as long:

The person uncomfortable speaks up. If you are shy you might ask somebody else to speak up for you, but do not be quiet when it happens and then fill a complaint much later, that is NOT OK.
The rule is not abused to derail technical discussions. See my post about reviews to at least avoid this pitfall.
People agree to drop at least some of their cultural biases, otherwise it would end up like walking on eggshells every moment.

Letting situations going unchecked is probably the main issue, newcomers can think it is OK to behave in a certain way if people are behaving such way and nobody stops that, again, not just specific enforcers of some kind, everybody should behave and tell clearly to those not behaving that they are problematic.

Gentoo is a big community, so gets problematic having a swift reaction: lots of people prefer not to speak up when something happens, so people unwillingly causing the problem are not made aware immediately.

Libav is a much smaller community and in general nobody has qualms in saying “please stop” (that is also partially due how the community evolved).

Hopefully this post would help avoid making some mistakes and help people getting along better.

Trusting the context

This mini-post spurred from this bug.

AVFrame and AVCodecContext

In Libav there are a number of patterns shared across most of the components.
Does not matter if it models a codec, a demuxer or a resampler: You interact with it using a Context and you get data in or out of the module using some kind of Abstraction that wraps data and useful information such as the timestamp. Today’s post is about AVFrames and AVCodecContext.

AVFrame

The most used abstraction in Libav by far is the AVFrame. It wraps some kind of raw data that can be produced by decoders and fed to encoders, passed through filters, scalers and resamplers.

It is quite flexible and contains the data and all the information to understand it e.g.:

format: Used to describe either the pixel format for video and the sample format for audio.
width and height: The dimension of a video frame.
channel_layout, nb_samples and sample_rate for audio frames.

AVCodecContext

This context contains all the information useful to describe a codec and to configure an encoder or a decoder (the generic, common features, there are private options for specific features).

Being shared with encoder, decoder and (until Anton’s plan to avoid it is deployed) container streams this context is fairly large and a good deal of its fields are a little confusing since they seem to replicate what is present in the AVFrame or because they aren’t marked as write-only since they might be read in few situation.

In the bug mentioned channel_layout was the confusing one but also width and height caused problems to people thinking the value of those fields in the AVCodecContext would represent what is in the AVFrame (then you’d wonder why you should have them in two different places…).

As a rule of thumb everything that is set in a context is either the starting configuration and bound to change in the future.

Video decoders can reconfigure themselves and output video frames with completely different geometries, audio decoders can report a completely different number of channels or variations in their layout and so on.

Some encoders are able to reconfigure on the fly as well, but usually with more strict constraints.

Why their information is not the same

The fields in the AVCodecContext are used internally and updated as needed by the decoder. The decoder can be multithreaded so the AVFrame you are getting from one of the avcodec_decode_something() functions is not the last frame decoded.

Do not expect any of the fields with names similar to the ones provided by AVFrame to stay immutable or to match the values provided by the AVFrame.

Common pitfalls

Allocating video surfaces

Some quite common mistake is to use the AVCodecContext coded_width and coded_height to allocate the surfaces to present the decoded frames.

As said the frame geometry can change mid-stream, so if you do that best case you have some lovely green surrounding your picture, worst case you have a bad crash.

I suggest to always check that the AVFrame dimensions fit and be ready to reconfigure your video out when that happens.

Resampling audio

If you are using a current version of Libav you have avresample_convert_frame() doing most of the work for you, if you are not you need to check that format channel_layout and sample_rate do not change and manually reconfigure.

Rescaling video

Similarly you can misconfigure swscale and you should check manually that format, width and height and reconfigure as well. The AVScale draft API on provides an avscale_process_frame().

In closing

Be extra careful, think twice and beware of the examples you might find on internet, they might work until they wont.

The code

Benchmark

Unsafe code

Explicit upcasts

High level abstractions

Use Iterators

Use Chunks

Use split_at

Use of zip or izip

Conclusions

Overview

Coding style

Testing

Submitting patches

Interaction

TL;DR

Overview

Coding style

Testing

Submitting patches

Interaction

TL;DR

Using hwaccel

Setup

Prerequisites

Installation

Libraries

Kernel Modules

Libav

Building the dispatcher

Building Libav

Troubleshooting

Usage

Decoding

Encoding

Rate controls

Transcoding

Hardware filtering caveats

Future for MediaSDK in Libav

Colorspace conversion and Scaling

AVScale

Everything is a Kernel

API

Help welcome!

What is here now

New API

Decoding examples

Setup

Simple decoding loop

Callback approach

Separated threads

Encoding Examples

Simple encoding loop

Callback approach

Coming Next

Bitstream filters

New API

Query

Setup

Usage

Cleanup

Coming Soon

LXC

Setting up

Network woes

Code of Conduct and OpenSource projects

Making an useful Code of Conduct

Pitfalls

Good examples

Enforcing

AVFrame and AVCodecContext

AVFrame

AVCodecContext

Why their information is not the same

Common pitfalls

Allocating video surfaces

Resampling audio

Rescaling video

In closing

Use `Chunks`

Use `split_at`

Use of `zip` or `izip`