AVScale – part1

swscale is one of the most annoying part of Libav, after a couple of years since the initial blueprint we have something almost functional you can play with.

Colorspace conversion and Scaling

Before delving in the library architecture and the outher API probably might be good to make a extra quick summary of what this library is about.

Most multimedia concepts are more or less intuitive:
encoding is taking some data (e.g. video frames, audio samples) and compress it by leaving out unimportant details
muxing is the act of storing such compressed data and timestamps so that audio and video can play back in sync
demuxing is getting back the compressed data with the timing information stored in the container format
decoding inflates somehow the data so that video frames can be rendered on screen and the audio played on the speakers

After the decoding step would seem that all the hard work is done, but since there isn’t a single way to store video pixels or audio samples you need to process them so they work with your output devices.

That process is usually called resampling for audio and for video we have colorspace conversion to change the pixel information and scaling to change the amount of pixels in the image.

Today I’ll introduce you to the new library for colorspace conversion and scaling we are working on.

AVScale

The library aims to be as simple as possible and hide all the gory details from the user, you won’t need to figure the heads and tails of functions with a quite large amount of arguments nor special-purpose functions.

The API itself is modelled after avresample and approaches the problem of conversion and scaling in a way quite different from swscale, following the same design of NAScale.

Everything is a Kernel

One of the key concept of AVScale is that the conversion chain is assembled out of different components, separating the concerns.

Those components are called kernels.

The kernels can be conceptually divided in two kinds:
Conversion kernels, taking an input in a certain format and providing an output in another (e.g. rgb2yuv) without changing any other property.
Process kernels, modifying the data while keeping the format itself unchanged (e.g. scale)

This pipeline approach gets great flexibility and helps code reuse.

The most common use-cases (such as scaling without conversion or conversion with out scaling) can be faster than solutions trying to merge together scaling and conversion in a single step.

API

AVScale works with two kind of structures:
AVPixelFormaton: A full description of the pixel format
AVFrame: The frame data, its dimension and a reference to its format details (aka AVPixelFormaton)

The library will have an AVOption-based system to tune specific options (e.g. selecting the scaling algorithm).

For now only avscale_config and avscale_convert_frame are implemented.

So if the input and output are pre-determined the context can be configured like this:

AVScaleContext *ctx = avscale_alloc_context();

if (!ctx)
    ...

ret = avscale_config(ctx, out, in);
if (ret < 0)
    ...

But you can skip it and scale and/or convert from a input to an output like this:

AVScaleContext *ctx = avscale_alloc_context();

if (!ctx)
    ...

ret = avscale_convert_frame(ctx, out, in);
if (ret < 0)
    ...

avscale_free(&ctx);

The context gets lazily configured on the first call.

Notice that avscale_free() takes a pointer to a pointer, to make sure the context pointer does not stay dangling.

As said the API is really simple and essential.

Help welcome!

Kostya kindly provided an initial proof of concept and me, Vittorio and Anton prepared this preview on the spare time. There is plenty left to do, if you like the idea (since many kept telling they would love a swscale replacement) we even have a fundraiser.

New AVCodec API

Another week another API landed in the tree and since I spent some time drafting it, I guess I should describe how to use it now what is implemented. This is part I

What is here now

Between theory and practice there is a bit of discussion and obviously the (lack) of time to implement, so here what is different from what I drafted originally:

  • Function Names: push got renamed to send and pull got renamed to receive.
  • No separated function to probe the process state, need_data and have_data are not here.
  • No codecs ported to use the new API, so no actual asyncronicity for now.
  • Subtitles aren’t supported yet.

New API

There are just 4 new functions replacing both audio-specific and video-specific ones:

// Decode
int avcodec_send_packet(AVCodecContext *avctx, const AVPacket *avpkt);
int avcodec_receive_frame(AVCodecContext *avctx, AVFrame *frame);

// Encode
int avcodec_send_frame(AVCodecContext *avctx, const AVFrame *frame);
int avcodec_receive_packet(AVCodecContext *avctx, AVPacket *avpkt);

The workflow is sort of simple:
– You setup the decoder or the encoder as usual
– You feed data using the avcodec_send_* functions until you get a AVERROR(EAGAIN), that signals that the internal input buffer is full.
– You get the data back using the matching avcodec_receive_* function until you get a AVERROR(EAGAIN), signalling that the internal output buffer is empty.
– Once you are done feeding data you have to pass a NULL to signal the end of stream.
– You can keep calling the avcodec_receive_* function until you get AVERROR_EOF.
– You free the contexts as usual.

Decoding examples

Setup

The setup uses the usual avcodec_open2.

    ...

    c = avcodec_alloc_context3(codec);

    ret = avcodec_open2(c, codec, &opts);
    if (ret < 0)
        ...

Simple decoding loop

People using the old API usually have some kind of simple loop like

while (get_packet(pkt)) {
    ret = avcodec_decode_video2(c, picture, &got_picture, pkt);
    if (ret < 0) {
        ...
    }
    if (got_picture) {
        ...
    }
}

The old functions can be replaced by calling something like the following.

// The flush packet is a non-NULL packet with size 0 and data NULL
int decode(AVCodecContext *avctx, AVFrame *frame, int *got_frame, AVPacket *pkt)
{
    int ret;

    *got_frame = 0;

    if (pkt) {
        ret = avcodec_send_packet(avctx, pkt);
        // In particular, we don't expect AVERROR(EAGAIN), because we read all
        // decoded frames with avcodec_receive_frame() until done.
        if (ret < 0)
            return ret == AVERROR_EOF ? 0 : ret;
    }

    ret = avcodec_receive_frame(avctx, frame);
    if (ret < 0 && ret != AVERROR(EAGAIN) && ret != AVERROR_EOF)
        return ret;
    if (ret >= 0)
        *got_frame = 1;

    return 0;
}

Callback approach

Since the new API will output multiple frames in certain situations would be better to process them as they are produced.

// return 0 on success, negative on error
typedef int (*process_frame_cb)(void *ctx, AVFrame *frame);

int decode(AVCodecContext *avctx, AVFrame *pkt,
           process_frame_cb cb, void *priv)
{
    AVFrame *frame = av_frame_alloc();
    int ret;

    ret = avcodec_send_packet(avctx, pkt);
    // Again EAGAIN is not expected
    if (ret < 0)
        goto out;

    while (!ret) {
        ret = avcodec_receive_frame(avctx, frame);
        if (!ret)
            ret = cb(priv, frame);
    }

out:
    av_frame_free(&frame);
    if (ret == AVERROR(EAGAIN))
        return 0;
    return ret;
}

Separated threads

The new API makes sort of easy to split the workload in two separated threads.

// Assume we have context with a mutex, a condition variable and the AVCodecContext


// Feeding loop
{
    AVPacket *pkt = NULL;

    while ((ret = get_packet(ctx, pkt)) >= 0) {
        pthread_mutex_lock(&ctx->lock);

        ret = avcodec_send_packet(avctx, pkt);
        if (!ret) {
            pthread_cond_signal(&ctx->cond);
        } else if (ret == AVERROR(EAGAIN)) {
            // Signal the draining loop
            pthread_cond_signal(&ctx->cond);
            // Wait here
            pthread_cond_wait(&ctx->cond, &ctx->mutex);
        } else if (ret < 0)
            goto out;

        pthread_mutex_unlock(&ctx->lock);
    }

    pthread_mutex_lock(&ctx->lock);
    ret = avcodec_send_packet(avctx, NULL);

    pthread_cond_signal(&ctx->cond);

out:
    pthread_mutex_unlock(&ctx->lock)
    return ret;
}

// Draining loop
{
    AVFrame *frame = av_frame_alloc();

    while (!done) {
        pthread_mutex_lock(&ctx->lock);

        ret = avcodec_receive_frame(avctx, frame);
        if (!ret) {
            pthread_cond_signal(&ctx->cond);
        } else if (ret == AVERROR(EAGAIN)) {
            // Signal the feeding loop
            pthread_cond_signal(&ctx->cond);
            // Wait
            pthread_cond_wait(&ctx->cond, &ctx->mutex);
        } else if (ret < 0)
            goto out;

        pthread_mutex_unlock(&ctx->lock);

        if (!ret) {
            do_something(frame);
        }
    }

out:
        pthread_mutex_unlock(&ctx->lock)
    return ret;
}

It isn’t as neat as having all this abstracted away, but is mostly workable.

Encoding Examples

Simple encoding loop

Some compatibility with the old API can be achieved using something along the lines of:

int encode(AVCodecContext *avctx, AVPacket *pkt, int *got_packet, AVFrame *frame)
{
    int ret;

    *got_packet = 0;

    ret = avcodec_send_frame(avctx, frame);
    if (ret < 0)
        return ret;

    ret = avcodec_receive_packet(avctx, pkt);
    if (!ret)
        *got_packet = 1;
    if (ret == AVERROR(EAGAIN))
        return 0;

    return ret;
}

Callback approach

Since for each input multiple output could be produced, would be better to loop over the output as soon as possible.

// return 0 on success, negative on error
typedef int (*process_packet_cb)(void *ctx, AVPacket *pkt);

int encode(AVCodecContext *avctx, AVFrame *frame,
           process_packet_cb cb, void *priv)
{
    AVPacket *pkt = av_packet_alloc();
    int ret;

    ret = avcodec_send_frame(avctx, frame);
    if (ret < 0)
        goto out;

    while (!ret) {
        ret = avcodec_receive_packet(avctx, pkt);
        if (!ret)
            ret = cb(priv, pkt);
    }

out:
    av_packet_free(&pkt);
    if (ret == AVERROR(EAGAIN))
        return 0;
    return ret;
}

The I/O should happen in a different thread when possible so the callback should just enqueue the packets.

Coming Next

This post is long enough so the next one might involve converting a codec to the new API.

Bitstream Filtering

Last weekend, after few months of work, the new bitstream filter API eventually landed.

Bitstream filters

In Libav is possible to manipulate raw and encoded data in many ways, the most common being

  • Demuxing: extracting single data packets and their timing information
  • Decoding: converting the compressed data packets in raw video or audio frames
  • Encoding: converting the raw multimedia information in a compressed form
  • Muxing: store the compressed information along timing information and additional information.

Bitstream filtering is somehow less considered even if the are widely used under the hood to demux and mux many widely used formats.

It could be consider an optional final demuxing or muxing step since it works on encoded data and its main purpose is to reformat the data so it can be accepted by decoders consuming only a specific serialization of the many supported (e.g. the HEVC QSV decoder) or it can be correctly muxed in a container format that stores only a specific kind.

In Libav this kind of reformatting happens normally automatically with the annoying exception of MPEGTS muxing.

New API

The new API is modeled against the pull/push paradigm I described for AVCodec before, it works on AVPackets and has the following concrete implementation:

// Query
const AVBitStreamFilter *av_bsf_next(void **opaque);
const AVBitStreamFilter *av_bsf_get_by_name(const char *name);

// Setup
int av_bsf_alloc(const AVBitStreamFilter *filter, AVBSFContext **ctx);
int av_bsf_init(AVBSFContext *ctx);

// Usage
int av_bsf_send_packet(AVBSFContext *ctx, AVPacket *pkt);
int av_bsf_receive_packet(AVBSFContext *ctx, AVPacket *pkt);

// Cleanup
void av_bsf_free(AVBSFContext **ctx);

In order to use a bsf you need to:

  • Look up its definition AVBitStreamFilter using a query function.
  • Set up a specific context AVBSFContext, by allocating, configuring and then initializing it.
  • Feed the input using av_bsf_send_packet function and get the processed output once it is ready using av_bsf_receive_packet.
  • Once you are done av_bsf_free cleans up the memory used for the context and the internal buffers.

Query

You can enumerate the available filters

void *state = NULL;

const AVBitStreamFilter *bsf;

while ((bsf = av_bsf_next(&state)) {
    av_log(NULL, AV_LOG_INFO, "%s\n", bsf->name);
}

or directly pick the one you need by name:

const AVBitStreamFilter *bsf = av_bsf_get_by_name("hevc_mp4toannexb");

Setup

A bsf may use some codec parameters and time_base and provide updated ones.

AVBSFContext *ctx;

ret = av_bsf_alloc(bsf, &ctx);
if (ret < 0)
    return ret;

ret = avcodec_parameters_copy(ctx->par_in, in->codecpar);
if (ret < 0)
    goto fail;

ctx->time_base_in = in->time_base;

ret = av_bsf_init(ctx);
if (ret < 0)
    goto fail;

ret = avcodec_parameters_copy(out->codecpar, ctx->par_out);
if (ret < 0)
    goto fail;

out->time_base = ctx->time_base_out;

Usage

Multiple AVPackets may be consumed before an AVPacket is emitted or multiple AVPackets may be produced out of a single input one.

AVPacket *pkt;

while (got_new_packet(&pkt)) {
    ret = av_bsf_send_packet(ctx, pkt);
    if (ret < 0)
        goto fail;

    while ((ret = av_bsf_receive_packet(ctx, pkt)) == 0) {
        yield_packet(pkt);
    }

    if (ret == AVERROR(EAGAIN)
        continue;
    IF (ret == AVERROR_EOF)
        goto end;
    if (ret < 0)
        goto fail;
}

// Flush
ret = av_bsf_send_packet(ctx, NULL);
if (ret < 0)
    goto fail;

while ((ret = av_bsf_receive_packet(ctx, pkt)) == 0) {
    yield_packet(pkt);
}

if (ret != AVERROR_EOF)
    goto fail;

In order to signal the end of stream a NULL pkt should be fed to send_packet.

Cleanup

The cleanup function matches the av_freep signature so it takes the address of the AVBSFContext pointer.

    av_bsf_free(&ctx);

All the memory is freed and the ctx pointer is set to NULL.

Coming Soon

Hopefully next I’ll document the new HWAccel layer that already landed and some other API that I discussed with Kostya before.
Sadly my blog-time (and spare time in general) shrunk a lot in the past months so he rightfully blamed me a lot.

Deprecating AVPicture

In Libav we try to clean up the API and make it more regular, this is one of the possibly many articles I write about APIs, this time about deprecating some relic from the past and why we are doing it.

AVPicture

This struct used to store image data using data pointers and linesizes. It comes from the far past and it looks like this:

typedef struct AVPicture {
    uint8_t *data[AV_NUM_DATA_POINTERS];
    int linesize[AV_NUM_DATA_POINTERS];
} AVPicture;

Once the AVFrame was introduced it was made so it would alias to it and for some time the two structures were actually defined sharing the commond initial fields through a macro.

The AVFrame then evolved to store both audio and image data, to use AVBuffer to not have to do needless copies and to provide more useful information (e.g. the actual data format), now it looks like:

typedef struct AVFrame {
    uint8_t *data[AV_NUM_DATA_POINTERS];
    int linesize[AV_NUM_DATA_POINTERS];

    uint8_t **extended_data;

    int width, height;

    int nb_samples;

    int format;

    int key_frame;

    enum AVPictureType pict_type;

    AVRational sample_aspect_ratio;

    int64_t pts;

    ...
} AVFrame;

The image-data manipulation functions moved to the av_image namespace and use directly data and linesize pointers, while the equivalent avpicture became a wrapper over them.

int avpicture_fill(AVPicture *picture, uint8_t *ptr,
                   enum AVPixelFormat pix_fmt, int width, int height)
{
    return av_image_fill_arrays(picture->data, picture->linesize,
                                ptr, pix_fmt, width, height, 1);
}

int avpicture_layout(const AVPicture* src, enum AVPixelFormat pix_fmt,
                     int width, int height,
                     unsigned char *dest, int dest_size)
{
    return av_image_copy_to_buffer(dest, dest_size,
                                   src->data, src->linesize,
                                   pix_fmt, width, height, 1);
}

...

It is also used in the subtitle abstraction:

typedef struct AVSubtitleRect {
    int x, y, w, h;
    int nb_colors;

    AVPicture pict;
    enum AVSubtitleType type;

    char *text;
    char *ass;
    int flags;
} AVSubtitleRect;

And to crudely pass AVFrame from the decoder level to the muxer level, for certain rawvideo muxers by doing something such as:

    pkt.data   = (uint8_t *)frame;
    pkt.size   =  sizeof(AVPicture);

AVPicture problems

In the codebase its remaining usage is dubious at best:

AVFrame as AVPicture

In some codecs the AVFrame produced or consumed are casted as AVPicture and passed to avpicture functions instead
of directly use the av_image functions.

AVSubtitleRect

For the subtitle codecs, accessing the Rect data requires a pointless indirection, usually something like:

    wrap3 = rect->pict.linesize[0];
    p = rect->pict.data[0];
    pal = (const uint32_t *)rect->pict.data[1];  /* Now in YCrCb! */

AVFMT_RAWPICTURE

Copying memory from a buffer to another when can be avoided is consider a major sin (“memcpy is murder”) since it is a costly operation in itself and usually it invalidates the cache if we are talking about large buffers.

Certain muxers for rawvideo, try to spare a memcpy and thus avoid a “murder” by not copying the AVFrame data to the AVPacket.

The idea in itself is simple enough, store the AVFrame pointer as if it would point a flat array, consider the data size as the AVPicture size and hope that the data pointed by the AVFrame remains valid while the AVPacket is consumed.

Simple and faulty: with the AVFrame ref-counted API codecs may use a Pool of AVFrames and reuse them.
It can lead to surprising results because the buffer gets updated before the AVPacket is actually written.
If the frame referenced changes dimensions or gets deallocated it could even lead to crashes.

Definitely not a great idea.

Solutions

Vittorio, wm4 and I worked together to fix the problems. Radically.

AVFrame as AVPicture

The av_image functions are now used when needed.
Some pointless copies got replaced by av_frame_ref, leading to less memory usage and simpler code.

No AVPictures are left in the video codecs.

AVSubtitle

The AVSubtitleRect is updated to have simple data and linesize fields and each codec is updated to keep the AVPicture and the new fields in sync during the deprecation window.

The code is already a little easier to follow now.

AVFMT_RAWPICTURE

Just dropping the “feature” would be a problem since those muxers are widely used in FATE and the time the additional copy takes adds up to quite a lot. Your regression test must be as quick as possible.

I wrote a safer wrapper pseudo-codec that leverages the fact that both AVPacket and AVFrame use a ref-counted system:

  • The AVPacket takes the AVFrame and increases its ref-count by 1.
  • The AVFrame is then stored in the data field and wrapped in a custom AVBuffer.
  • That AVBuffer destructor callback unrefs the frame.

This way the AVFrame data won’t change until the AVPacket gets destroyed.

Goodbye AVPicture

With the release 14 the AVPicture struct will be removed completely from Libav, people using it outside Libav should consider moving to use full AVFrame (and leverage the additional feature it provides) or the av_image functions directly.

Decoupling an API – Part II

During the VDD we had lots of discussions and I enjoyed reviewing the initial NihAV implementation. Kostya already wrote some more about the decoupled API that I described at high level here.

This article is about some possible implementation details, at least another will follow.

The new API requires some additional data structures, mainly something to keep the data that is being consumed/produced, additional implementation-callbacks in AVCodec and possibly a mean to skip the queuing up completely.

Data Structures

AVPacketQueue and AVFrameQueue

In the previous post I considered as given some kind of Queue.

Ideally the API for it could be really simple:

typedef struct AVPacketQueue;

AVPacketQueue *av_packet_queue_alloc(int size);
int av_packet_queue_put(AVPacketQueue *q, AVPacket *pkt);
int av_packet_queue_get(AVPacketQueue *q, AVPacket *pkt);
int av_packet_queue_size(AVPacketQueue *q);
void av_packet_queue_free(AVPacketQueue **q);
typedef struct AVFrameQueue;

AVFrameQueue *av_frame_queue_alloc(int size);
int av_frame_queue_put(AVFrameQueue *q, AVPacket *pkt);
int av_frame_queue_get(AVFrameQueue *q, AVPacket *pkt);
int av_frame_queue_size(AVFrameQueue *q);
void av_frame_queue_free(AVFrameQueue **q);

Internally it leverages the ref-counted API (av_packet_move_ref and av_frame_move_ref) and any data structure that could fit the queue-usage. It will be used in a multi-thread scenario so a form of Lock has to be fit into it.

We have already something specific for AVPlay, using a simple Linked List and a FIFO for some other components that have a near-constant maximum number of items (e.g. avconv, NVENC, QSV).

Possibly also a Tree could be used to implement something such as av_packet_queue_insert_by_pts and have some forms of reordering happen on the fly. I’m not a fan of it, but I’m sure someone will come up with the idea..

The Queues are part of AVCodecContext.

typedef struct AVCodecContext {
    // ...

    AVPacketQueue *packet_queue;
    AVFrameQueue *frame_queue;

    // ...
} AVCodecContext;

Implementation Callbacks

In Libav the AVCodec struct describes some specific codec features (such as the supported framerates) and hold the actual codec implementation through callbacks such as init, decode/encode2, flush and close.
The new model obviously requires additional callbacks.

Once the data is in a queue it is ready to be processed, the actual decoding or encoding can happen in multiple places, for example:

  • In avcodec_*_push or avcodec_*_pull, once there is enough data. In that case the remaining functions are glorified proxies for the matching queue function.
  • somewhere else such as a separate thread that is started on avcodec_open or the first avcodec_decode_push and is eventually stopped once the context related to it is freed by avcodec_close. This is what happens under the hood when you have certain hardware acceleration.

Common

typedef struct AVCodec {
    // ... previous fields
    int (*need_data)(AVCodecContext *avctx);
    int (*has_data)(AVCodecContext *avctx);
    // ...
} AVCodec;

Those are used by both the encoder and decoder, some implementations such as QSV have functions that can be used to probe the internal state in this regard.

Decoding

typedef struct AVCodec {
    // ... previous fields
    int (*decode_push)(AVCodecContext *avctx, AVPacket *packet);
    int (*decode_pull)(AVCodecContext *avctx, AVFrame *frame);
    // ...
} AVCodec;

Those two functions can take a portion of the work the current decode function does, for example:
– the initial parsing and dispatch to a worker thread can happen in the _push.
– reordering and blocking until there is data to output can happen on _pull.

Assuming the reordering does not happen outside the pull callback in some generic code.

Encoding

typedef struct AVCodec {
    // ... previous fields
    int (*encode_push)(AVCodecContext *avctx, AVFrame *frame);
    int (*encode_pull)(AVCodecContext *avctx, AVPacket *packet);
} AVCodec;

As per the Decoding callbacks, encode2 workload is split. the _push function might just keep queuing up until there are enough frames to complete the initial the analysis, while, for single thread encoding, the rest of the work happens at the _pull.

Yielding data directly

So far the API mainly keeps some queue filled and let some magic happen under the hood, let see some usage examples first:

Simple Usage

Let’s expand the last example from the previous post: register callbacks to pull/push the data and have some simple loops.

Decoding

typedef struct DecodeCallback {
    int (*pull_packet)(void *priv, AVPacket *pkt);
    int (*push_frame)(void *priv, AVFrame *frame);
    void *priv_data_pull, *priv_data_push;
} DecodeCallback;

Two pointers since you pull from a demuxer+parser and you push to a splitter+muxer.

int decode_loop(AVCodecContext *avctx, DecodeCallback *cb)
{
    AVPacket *pkt  = av_packet_alloc();
    AVFrame *frame = av_frame_alloc();
    int ret;
    while ((ret = avcodec_decode_need_data(avctx)) > 0) {
        ret = cb->pull_packet(cb->priv_data_pull, pkt);
        if (ret < 0)
            goto end;
        ret = avcodec_decode_push(avctx, pkt);
        if (ret < 0)
            goto end;
    }
    while ((ret = avcodec_decode_have_data(avctx)) > 0) {
        ret = avcodec_decode_pull(avctx, frame);
        if (ret < 0)
            goto end;
        ret = cb->push_frame(cb->priv_data_push, frame);
        if (ret < 0)
            goto end;
    }

end:
    av_frame_free(&frame);
    av_packet_free(&pkt);
    return ret;
}

Encoding

For encoding something quite similar can be done:

typedef struct EncodeCallback {
    int (*pull_frame)(void *priv, AVFrame *frame);
    int (*push_packet)(void *priv, AVPacket *packet);
    void *priv_data_push, *priv_data_pull;
} EncodeCallback;

The loop is exactly the same beside the data types swapped.

int encode_loop(AVCodecContext *avctx, EncodeCallback *cb)
{
    AVPacket *pkt  = av_packet_alloc();
    AVFrame *frame = av_frame_alloc();
    int ret;
    while ((ret = avcodec_encode_need_data(avctx)) > 0) {
        ret = cb->pull_frame(cb->priv_data_pull, frame);
        if (ret < 0)
            goto end;
        ret = avcodec_encode_push(avctx, frame);
        if (ret < 0)
            goto end;
    }
    while ((ret = avcodec_encode_have_data(avctx)) > 0) {
        ret = avcodec_encode_pull(avctx, pkt);
        if (ret < 0)
            goto end;
        ret = cb->push_packet(cb->priv_data_push, pkt);
        if (ret < 0)
            goto end;
    }

end:
    av_frame_free(&frame);
    av_packet_free(&pkt);
    return ret;
}

Transcoding

Transcoding, the naive way, could be something such as

int transcode(AVFormatContext *mux,
              AVFormatContext *dem,
              AVCodecContext *enc,
              AVCodecContext *dec)
{
    DecodeCallbacks dcb = {
        get_packet,
        av_frame_queue_put,
        dem, enc->frame_queue };
    EncodeCallbacks ecb = {
        av_frame_queue_get,
        push_packet,
        enc->frame_queue, mux };
    int ret = 0;

    while (ret > 0) {
        if ((ret = decode_loop(dec, &dcb)) > 0)
            ret = encode_loop(enc, &ecb);
    }
}

One loop feeds the other throught the queue. get_packet and push_packet are muxing and demuxing functions, they might end up being other two Queue functions once the AVFormat layer gets a similar overhaul.

Advanced usage

From the examples above you would notice that in some situation you would possibly do better,
all the loops pull data from a queue push it immediately to another:

  • why not feeding right queue immediately once you have the data ready?
  • why not doing some processing before feeding the decoded data to the encoder, such as conver the pixel format?

Here some additional structures and functions to enable advanced users:

typedef struct AVFrameCallback {
    int (*yield)(void *priv, AVFrame *frame);
    void *priv_data;
} AVFrameCallback;

typedef struct AVPacketCallback {
    int (*yield)(void *priv, AVPacket *pkt);
    void *priv_data;
} AVPacketCallback;

typedef struct AVCodecContext {
// ...

AVFrameCallback *frame_cb;
AVPacketCallback *packet_cb;

// ...

} AVCodecContext;

int av_frame_yield(AVFrameCallback *cb, AVFrame *frame)
{
    return cb->yield(cb->priv_data, frame);
}

int av_packet_yield(AVPacketCallback *cb, AVPacket *packet)
{
    return cb->yield(cb->priv_data, packet);
}

Instead of using directly the Queue API, would be possible to use yield functions and give the user a mean to override them.

Some API sugar could be something along the lines of this:

int avcodec_decode_yield(AVCodecContext *avctx, AVFrame *frame)
{
    int ret;

    if (avctx->frame_cb) {
        ret = av_frame_yield(avctx->frame_cb, frame);
    } else {
        ret = av_frame_queue_put(avctx->frame_queue, frame);
    }

    return ret;
}

Whenever a frame (or a packet) is ready it could be passed immediately to another function, depending on your threading model and cpu it might be much more efficient skipping some enqueuing+dequeuing steps such as feeding directly some user-queue that uses different datatypes.

This approach might work well even internally to insert bitstream reformatters after the encoding or before the decoding.

Open problems

The callback system is quite powerful but you have at least a couple of issues to take care of:
– Error reporting: when something goes wrong how to notify what broke?
– Error recovery: how much the user have to undo to fallback properly?

Probably this part should be kept for later, since there is already a huge amount of work.

What’s next

Muxing and demuxing

Ideally the container format layer should receive the same kind of overhaul, I’m not even halfway documenting what should
change, but from this blog post you might guess the kind of changes. Spoiler: The I/O layer gets spun in a separate library.

Proof of Concept

Soon^WNot so late I’ll complete a POC out of this and possibly hack avplay so that either it uses QSV or videotoolbox as test-case (depending on which operating system I’m playing with when I start), probably I’ll see which are the limitations in this approach soon.

If you like the ideas posted above or you want to discuss them more, you can join the Libav irc channel or mailing list to discuss and help.

Rethinking AVFormat – part 1

Container formats should be just a boring application of serialization of multiple arrays of tuples timestamp-binary blob.

Instead there are tons of implementation details and there are fun
and exceedingly annoying means to lose your sanity.

This post is yet another post about APIs you can see other here and here.

Current Status

In Libav we have libavformat taking care of general I/O, Muxing, Demuxing.

This blog post will not cover the additional grouping given by Programs, Chapters and such to not make the whole article huge and just focus on the basics.

I/O

The AVIO abstraction provides a mean to uniformly access content stored in files, available as remote streams (e.g. served through http or rtmp) or through custom implementations.

This part of the API is rightly coupled with the Muxer and Demuxer implementation.

It uses the common Context pattern you can find across the rest of Libav with some of twists:

  • The protocol handler can be guessed using the url provided, e.g. file:///tmp/foo.
  • The functions that allocate a context take an extra parameter than the usual options AVDictionary in the form of a callback function.
  • You can create your own custom protocol easily.
int avio_open2(AVIOContext **s, const char *url, int flags, const AVIOInterruptCB *int_cb, AVDictionary **options)

AVIOContext *avio_alloc_context(unsigned char *buffer, int buffer_size, int write_flag, void *opaque,
                                int(*read_packet)(void *opaque, uint8_t *buf, int buf_size),
                                int(*write_packet)(void *opaque, uint8_t *buf, int buf_size),
                                int64_t(*seek)(void *opaque, int64_t offset, int whence))
int avio_closep(AVIOContext **s);

The api tries to mimic the C stdio plus lots of API sugar.

# core functions
int avio_read(AVIOContext *s, unsigned char *buf, int size);
void avio_write(AVIOContext *s, const unsigned char *buf, int size);
int64_t avio_seek(AVIOContext *s, int64_t offset, int whence);


# simple integer readers
int          avio_r8  (AVIOContext *s);
uint64_t     avio_rb64(AVIOContext *s);
uint64_t     avio_rl64(AVIOContext *s);
unsigned int avio_rb16(AVIOContext *s);
unsigned int avio_rb24(AVIOContext *s);
unsigned int avio_rb32(AVIOContext *s);
unsigned int avio_rl16(AVIOContext *s);
unsigned int avio_rl24(AVIOContext *s);
unsigned int avio_rl32(AVIOContext *s);

# simple integer writers
void avio_w8(AVIOContext *s, int b);
void avio_wb16(AVIOContext *s, unsigned int val);
void avio_wb24(AVIOContext *s, unsigned int val);
void avio_wb32(AVIOContext *s, unsigned int val);
void avio_wb64(AVIOContext *s, uint64_t val);
void avio_wl16(AVIOContext *s, unsigned int val);
void avio_wl24(AVIOContext *s, unsigned int val);
void avio_wl32(AVIOContext *s, unsigned int val);
void avio_wl64(AVIOContext *s, uint64_t val);


# utf8 and utf16 strings
int avio_get_str(AVIOContext *pb, int maxlen, char *buf, int buflen);

int avio_get_str16le(AVIOContext *pb, int maxlen, char *buf, int buflen);
int avio_get_str16be(AVIOContext *pb, int maxlen, char *buf, int buflen);

int avio_put_str(AVIOContext *s, const char *str);

int avio_put_str16le(AVIOContext *s, const char *str);

... (and more) ...

Buffering

All the function use an intermediate buffer to back reads and writes, the buffer can be explicitly flushed or it gets flushed automatically once the request would end outside it.

void avio_flush(AVIOContext *s);

A special kind of AVIOContext is a dynamic write buffer, it extends on demand and can be used to build complex recourring patterns once and write them as many time as needed.

int avio_open_dyn_buf(AVIOContext **s);

int avio_close_dyn_buf(AVIOContext *s, uint8_t **pbuffer);

Error handling

An I/O layer has to take in account the fact the resource being read or written could be abruptly disappear or suddenly slow down. This is valid for both local and remote resources.

The internal buffer allocation might fail.

A seek too far could lead to the end of file.

AVIO approach to errors is quite simplicistic:
– A write can silently fail.
– A failing read just returns 0-ed buffer or value.
– All the functions set the error field or the eof_reached field.

Is up to the user to decide when to check for I/O problems or leverage the AVIOInterruptCB to implement timeouts or other mean to interrupt a read or a write that otherwise would just quietly block till it is completed.

Demuxing (and Probing)

The AVFormat part taking care of input streams can be split in three: Probing the data to guess the right demuxer, the actual Demuxing and optionally parse the demuxed data and fit it in packets containing the information needed by the decoder to decode a frame of video or a matching amount of audio samples, later I call it frame-worth amount of data and I call this process chopping amorphous data streams. It is colorful as expression but represents quite well the endeavor.

Probing

The Probe functions take an arbitrary big chunk of data (stored in a AVProbeData struct) and figure out which demuxer should be able to actually parse it correctly.

As a rule of thumb probes need to be fast since all of them have to be run over the data at least once and possibly multiple times since if the result is not really conclusive increasing the data and trying again is an option.

AVInputFormat *av_probe_input_format2(AVProbeData *pd, int is_opened,
                                      int *score_max);

An helper function to probe from an AVIOContext and get the possible input format is provided.

int av_probe_input_buffer(AVIOContext *pb, AVInputFormat **fmt,
                          const char *filename, void *logctx,
                          unsigned int offset, unsigned int max_probe_size);

It used internally by avformat_open_input to automatically figure out the demuxer to use and it might look a little confusing.

Demuxing

Once that the input format is either guessed or selected the actual muxing conceptually is just providing AVPackets
as they are parsed. You might want to reposition within the stream at random times (the infamous seeking opening yet another can of worms).

int avformat_open_input(AVFormatContext **ps, const char *filename,
                        AVInputFormat *fmt, AVDictionary **options);

int av_read_frame(AVFormatContext *s, AVPacket *pkt);

void avformat_close_input(AVFormatContext **ps);
Figuring out the data inside the format

Some container formats keep the information regarding their contents in a global header at the start of the file, other, that could have new data streams appearing at random times, do not.

Since there is no easy mean to figure out which kind of data they are storing, the only safe way to figure out is to try to decode some packets in order to know which kind of data is available, avformat_find_stream_info.

int avformat_find_stream_info(AVFormatContext *ic, AVDictionary **options);

The apparently simple function does a lot of work behind the scenes: it demuxes and decodes a settable number of packets before giving up and keeps all of them in an internal queue so that they will be available for demuxing even if the input stream is not seekable.

Getting the data outside

Containers such as MPEG PS mux data in small fixed-sized chunks
while usually muxers and decoders expect to receive AVPackets containing enough data to produce a frame.

Specific parsers can be inserted automatically to take amorphous stream of demuxed data and chop out of it AVPackets containing the expected amount of data.

This happens usually automatically so the user does not have to care about it as long as the codec parser is present.

Timestamps

The multimedia data is expected to carry a timestamp to present at the same time video frames and audio frames (and subtitles).

Some containers do provide directly such timestamps, other do not, requiring some amount of guesswork by some heuristics that might or might not work depending on the codec at hand.

For example, if the container is supposed to not allow variable frame rate, the implicit time stamp for video can be deduced from the frame number. This might not work as expected if the codec uses B-frames and requires some form
of reordering.

This part in Libav is sort of hidden and often causing a number of problems.

Seeking

Seeking is quite a different and large can of worms.

Ideally seeking just sets the AVIOContext to a certain position and the demuxer keeps working from there.

int av_seek_frame(AVFormatContext *s, int stream_index,
                  int64_t timestamp, int flags);

Depending on the container format and the codec picking the correct byte offset from the user provided timestamp can be incredibly simple or really complex, with various degrees of precision.

Some format provide an precise index so a plain lookup is enough, a dichotomic search looking for the closest I-frame is the common case and in the worst situation a linear search might be required.

In some cases auxiliary indexes are built to speed up seeking within previously parsed areas.

Seeking is not fun at the demuxer level and gets even worse at the codec level if the data provided is not the one expected.

Muxing

Muxing is sort of simpler than demuxing. The output format is always known and the data always come in AVPackets matching a frame-worth of raw data and possibly sporting correct timestamps.

API-wise it expects an AVFormatContext with the oformat set to the correct AVOutputFormat and the pb
set with an allocated AVIOContext and populated AVStreams.

Once the AVFormatContext is configured is possible to write the packets. First the global header should be written, then as many packets as needed are muxed, interleaving audio and video so that demuxing and seeking work correctly.

int avformat_write_header(AVFormatContext *s, AVDictionary **options);

int av_interleaved_write_frame(AVFormatContext *s, AVPacket *pkt);

int av_write_trailer(AVFormatContext *s);

Bitstream filtering

Some codecs have multiple possible representation, e.g. H264 has the AVCC bitstream format and the Annex B bitstream format. Come containers support both, other expect only one or the other. Currently the correct converter from a bitstream to another must be inserted manually.

Packet interleaving

Certain container formats have quite peculiar muxing rules. This is normally hidden from the user, in certain cases being able to override it is a boon.

Shortcomings summary

In the next post I will explain how I would improve the situation, today post is mainly to introduce the structure of AVFormat and start explaining what should be fixed. Here a short list of what I’d like to fix sooner than later.

Non-uniform API

  • There is quite a mixture of av_ and avformat_ namespaces.
  • The muxing and demuxing APIs are sufficiently confusing (and surely I should complete my avformat_open_output to reduce the boilerplate)

Abstractions Leaking the wrong way

  • The demuxing side automagically inserts parsers to chop data streams in a frame-worth amount of data while the muxing side would just fail if the bitstream provided is not matching the one required by the container format.
  • There is quite of hidden magic happening in avformat_find_stream_info and just recently we added options to at least flush the buffer it keeps to probe for codecs. Having a better function and a better mean to control this kind of internal buffer would be surely appreciated by the user that need to keep the latency low.
  • There is no good mean to be notified if the number of streams change (new streams found or old streams disappearing).

Bad implementations

  • The old muxers sometimes do not even use the now-available internals (e.g. the interleaver helpers) but implement internally queues and logic that should be now common and shared across all the muxers.
  • While AVCodec has (now) quite an uniform mean to slice bytes and bits, avformat is not leveraging it beside few places.

PS: Kostya prefers to provide both amorphous stream and chopped packets. It makes sense since you might have some codec you cannot parse but you can sort of safely remux if the container is the same.
For the common case I’d rather suggest to use a set of functions that always insert parsers when they can both demuxing and muxing and provide another set of functions to get arbitrary lumps of stream as provided by the container format.

Splitting a library – hashes

libavutil contains lots that is common with the other libraries that compose Libav. It grown a lot over the years and it’s time to consider splitting it.

Monolithic vs Modular

There will always be some discussion on which approach is globally better.
– Jumbling everything together so you have everything there and doesn’t matter what, you have your super hammer supporting screws, bolts, nuts and nails.
– Keeping the tools in separate boxes so you carry only the set of spanners you need when you need it.

For software libraries you have this kind of problem all the time and at multiple levels:
– Do you want to have a single huge header file with every function your library provides or a set of them organized to keep all the function related together?
– Do you want to link a single library or have the concerns split in multiple so you do not have to carry lots of stuff you do not use (storage and memory are still important in some applications).

Usually modularity comes with the price of additional initial effort (you have to think about what you are going to use a little harder) and maintenance (which library should I update?).

This blogpost is about trying to group and split bunch of unrelated functions present in a library and try to get a better API for some of them.

Libavutil

The Libav libraries are written mainly in languages (C, asm) and they focus a lot on being portable. Libavutil is the foundation.

It contains all the code that is common across libraries from the basics such as memory management to higher level data structures, to video and audio-specific basic manipulation and hashes, cryptographic primitives and lossless compressors.

A lot indeed.

Problems

Irregular Mushroom-API

Some of the highest level part of the library appeared little by little, first you need md5 and you add it, then is aes, then you want lzo. All the crypto expose direct functions to that specific hash, making those components non-optional even if you do not need them.

# libavutil/aes.h
struct AVAES;

struct AVAES *av_aes_alloc(void);

int av_aes_init(struct AVAES *a, const uint8_t *key, int key_bits, int decrypt);
void av_aes_crypt(struct AVAES *a, uint8_t *dst, const uint8_t *src, int count, uint8_t *iv, int decrypt);

# libavutil/xtea.h
typedef struct AVXTEA {
    uint32_t key[16];
} AVXTEA;

void av_xtea_init(struct AVXTEA *ctx, const uint8_t key[16]);
void av_xtea_crypt(struct AVXTEA *ctx, uint8_t *dst, const uint8_t *src,
                   int count, uint8_t *iv, int decrypt);

# libavutil/sha.h
struct AVSHA;

struct AVSHA *av_sha_alloc(void);
int av_sha_init(struct AVSHA* context, int bits);
void av_sha_update(struct AVSHA* context, const uint8_t* data, unsigned int len);
void av_sha_final(struct AVSHA* context, uint8_t *digest);

# libavutil/md5.h
struct AVMD5;

struct AVMD5 *av_md5_alloc(void);
void av_md5_init(struct AVMD5 *ctx);
void av_md5_update(struct AVMD5 *ctx, const uint8_t *src, const int len);
void av_md5_final(struct AVMD5 *ctx, uint8_t *dst);
void av_md5_sum(uint8_t *dst, const uint8_t *src, const int len);

As you might notice it got to have lots and lots of expose, similar-but-non-uniform API popping out.

And if it was acceptable having a couple of hashes always around it gets not so nice if you have more to add.

Right now libavutil exposes 50 separate headers.

Extending it is painful now

Since we already have that many different components inside it you think twice about adding more stuff (if you are careful and caring), Libav is fairly modular and people do appreciate that.

In my wishlist I have few items such as getting more decompressors natively implemented.

Every new API is a burden to maintain (if you care about legacy and you keep maintaining releasing your older software) so adding or exposing more is always something you should consider.

Abstracting some details always helps, think what would be the API if each of the supported codecs has an exposed, non-uniform set of functions to decode each?

Ideal structure

Ideally I’d have the following layout:
– libavutil: basic memory abstraction, error, logs and not much else
– libavdata: basic data structures, including refcounted buffers, dictionaries, trees and such
– libavmedia: audio samples, pixel formats, metadata, frames, packets, side data types.
– libavhash: hashes such md5, sha and such
– libavcomp: compressors such as lzo
– libavcrypto: aes, blowfish and such

API

I already described my ideal api for the codecs, today I’d detail the hashes

As seen above it is common to have init, update
final and an optional utility function sum (or calc) that takes whole buffer buffer and returns the hash.

typedef struct AVHashLibrary;
typedef struct AVHash;
typedef struct AVHashContext;

int av_hash_register_all(AVHashLibrary *hashes)

const AVHash *av_hash_get(AVHashLibrary *hashes, const char *name);

AVHashContext *avhash_open(AVHash *hash, AVDictionary *opts);

int av_hash_update(AVHashContext *ctx, const uint8_t *src, const int len);

uint8_t *av_hash_final(AVHashContext *ctx, int *len);

uint8_t *av_hash_sum(AVHashContext *ctx, const uint8_t *src, const uint64_t src_len, int *out_len);

void avhash_close(AVHashContext *hash);

The structures are fully opaque, the AVHashLibrary contains the list of available hashes and possibly some additional hidden state. In Libav we are trying to remove all the global variables so the list of hashes is explicit.

The register_all function just populates the list of hashes and possibly creates accessory lookup tables when needed.

The get call let you look up the hash by name, additional can be made to look it up by id.

The open function takes a dictionary for hash-specific configuration.

The update and final function let you calculate the hash incrementally, the sum function is a simple utility that takes a full buffer (assumed to fit an uint64_t) and produces the hash.

The NihAV from Kostya hopefully will have a similar API with TypeLibrary, Type and TypeContext structs.

Decoupling an API

This weekend on #libav-devel we discussed again a bit about the problems with the current core avcodec api.

Current situation

Decoding

We have 3 decoding functions for each of the supported kind of media types: Audio, Video and Subtitles.

Subtitles are already a sore thumb since they are not using AVFrame but a specialized structure, let’s ignore it for now. Audio and Video share pretty much the same signature:

int avcodec_decode_something(AVCodecContext *avctx, AVFrame *f, int *got_frame, AVPacket *p)

It takes a context pointer containing the decoder state, consumes a demuxed packet and optionally outputs a decoded frame containing raw data in a certain format (audio samples, a video frame).

The usage model is quite simple it takes packets and whenever it has enough encoded data to emit a frame it emits one, the got_frame pointer signals if a frame is ready or more data is needed.

Problem:

What if 1 AVPacket is near always enough to output 2 or more frames of raw data?

This happens with MVC and other real-world scenarios.

In general our current API cannot cope with it cleanly.

While working with the MediaSDK interface from Intel and now with MMAL for the Rasberry Pi, similar problems arisen due the natural parallelism the underlying hardware has.

Encoding

We have again 3 functions again Subtitles are somehow different, while Audio and Video are sort of nicely uniform.

int avcodec_encode_something(AVCodecContext *avctx, AVPacket *p, const AVFrame *f, int *got_packet)

It is pretty much the dual of the decoding function: the context pointer is the same, a frame of raw data enters and a packet of encoded data. Again we have a pointer to signal if we had enough data and an encoded packet had been outputted.

Problem:

Again we might get multiple AVPacket produced out of a single AVFrame data fed.

This happens when the HEVC “workaround” to encode interlaced content makes the encoder to output the two separate fields as separate encoded frames.

Again, the API cannot cope with it cleanly and threaded or otherwise parallel encoding fit the model just barely.

Decoupling the process

To fix this issue (and make our users life simpler) the idea is to split the feeding data function from the one actually providing the processed data.

int avcodec_decode_push(AVCodecContext *avctx, AVPacket *packet);
int avcodec_decode_pull(AVCodecContext *avctx, AVFrame *frame);
int avcodec_decode_need_data(AVCodecContext *avctx);
int avcodec_decode_have_data(AVCodecContext *avctx);
int avcodec_encode_push(AVCodecContext *avctx, AVFrame *frame);
int avcodec_encode_pull(AVCodecContext *avctx, AVPacket *packet);
int avcodec_encode_need_data(AVCodecContext *avctx);
int avcodec_encode_have_data(AVCodecContext *avctx);

From a single function 4 are provided, why it is simple?

The current workflow is more or less like

while (get_packet_from_demuxer(&pkt)) {
    ret = avcodec_decode_something(avctx, frame, &got_frame, pkt);
    if (got_frame) {
        render_frame(frame);
    }
    if (ret < 0) {
        manage_error(ret);
    }
}

The get_packet_from_demuxer() is a function that dequeues from some queue the encoded data or directly call the demuxer (beware: having your I/O-intensive demuxer function blocking your CPU-intensive decoding function isn’t nice), render_frame() is as well either something directly talking to some kind of I/O-subsystem or enqueuing the data to have the actual rendering (including format conversion, overlaying and scaling) in another thread.

The new API makes much easier to keep the multiple area of concern separated, so they won’t trip each other while the casual user would have something like

while (ret >= 0) {
    while ((ret = avcodec_decode_need_data(avctx)) > 0) {
        ret = get_packet_from_demuxer(&pkt);
        if (ret < 0)
           ...
        ret = avcodec_decode_push(avctx, &pkt);
        if (ret < 0)
           ...
    }
    while ((ret = avcodec_decode_have_data(avctx)) > 0) {
        ret = avcodec_decode_pull(avctx, frame);
        if (ret < 0)
           ...
        render_frame(frame);
    }
}

That has probably few more lines.

Asyncronous API

Since the decoupled API is that simple, is possible to craft something more immediate for the casual user.

typedef struct AVCodecDecodeCallback {
    int (*pull_packet)(void *priv, AVPacket *pkt);
    int (*push_frame)(void *priv, AVFrame *frame);
    void *priv_data;
} AVCodecDecodeCallback;

int avcodec_register_decode_callbacks(AVCodecContext *avctx, AVCodecDecodeCallback *cb);

int avcodec_decode_loop(AVCodecContext *avctx)
{
    AVCodecDecodeCallback *cb = avctx->cb;
    int ret;
    while ((ret = avcodec_decode_need_data(avctx)) > 0) {
        ret = cb->pull_packet(cb->priv_data, &pkt);
        if (ret < 0)
            return ret;
        ret = avcodec_decode_push(avctx, &pkt);
        if (ret < 0)
            return ret;
    }
    while ((ret = avcodec_decode_have_data(avctx)) > 0) {
        ret = avcodec_decode_pull(avctx, frame);
        if (ret < 0)
            return ret;
        ret = cb->push_frame(cb->priv_data, frame);
    }
    return ret;
}

So the actual minimum decoding loop can be just 2 calls:

ret = avcodec_register_decode_callbacks(avctx, cb);
if (ret < 0)
   ...
while ((ret = avcodec_decode_loop(avctx)) >= 0);

Cute, isn’t it?

Theory is simple …

… the practice not so much:
– there are plenty of implementation issues to take in account.
LOTS of tedious work converting all the codecs to the new API.
– lots of details to iron out (e.g. have_data() and need_data() should block or not?)

We did radical overhauls before, such as introducing reference-counted AVFrames thanks to Anton, so we aren’t much scared of reshaping and cleaning the codebase once more.

If you like the ideas posted above or you want to discuss them more, you can join the Libav irc channel or mailing list to discuss and help.

Hardware acceleration in Libav

Multimedia formats require lots of computation power since they use fairly complex mathematical transformation. Usually most of them can be implemented efficiently in silicon requiring orders of magnitude less power to run while remaining quite fast to execute.

Hardware acceleration

Most of the current platforms, being them desktop, mobile or server, sports some kind of hardware unit to offload decoding and encoding of multimedia formats.

They usually are accessed through platform-specific API, sometimes the API is even codec-specific, making the whole implementation experience quite painful and a lot time consuming.

Depending on the specific hwaccel implementation, it could be bound to the gpu and use the gpu memory, thus requiring to manage non-system memory in specific ways adding additional burden for those that would just to have some quick gain while opening a world of interesting optimization possibilities such as zerocopy processing pipelines for transcoding or in-gpu pixel-format conversion, scaling and blending.

There are some generic wrappers such as vdpau, vaapi, dxva2 and vt that abstract some of the complexity related providing a more uniform interface, but usually there is a need of a proper (and possibly near transparent) fallback for the situations in which the hardware cannot really manage an advanced codec profile so just leveraging the generic abstractions solve just part of the problem but, as decoding goes, it provides a large performance boost while requiring some effort in managing the non-system memory.

For most of the users the learning curve is too steep for being really useful.

Libav and hwaccel

Hardware acceleration support happened to be implemented more or less around a number of specific implementation (with quite non-uniform approach, so vaapi had some hooks in the codecs, vdpau had full decoders until it had been ported to the same interface and made way nicer to use thanks to Rémi) and requiring quite a number of backend specific boilerplate code to set up the implementation specific context and then manage the opaque buffers the decoder outputs.

High level functionality

The hwaccel infrastructure is currently focused on the following items:
– the fallback from hardware to software should be as seamless as possible
– basic hardware decoders must be taken in account (e.g. for h264 some accept single NAL units and can’t parse the bitstream on their own)
– the user must have a mean to control the context setup and the full memory management

In order to do that the normal software decoder is used to parse the bitstream and depending on if the hwaccel is enabled or not, route the parsed data to the software or the hardware decoder and the output frames are then managed by the decoder frame reordering functionality if present.

This way falling back, even from a specific hwaccel to another, is sort of simple at least conceptually: every time a new extradata appears it is parsed, fed to the first hwaccel setup code if not supported optional fallback hwaccel can try and eventually the software decoder is picked.

The decoded frames, no matter if opaque hw-specific gpu-memory or normal system-memory go to the same codepath and the user has a mean to set the video rendering pipeline to take this in account.

Limit of the infrastructure

All the software evolves, since the minimalistic approach to the hardware acceleration requires a huge amount of boilerplate code and deep knowledge of the bitstream formats in use the new APIs for the new hardware tried to improve and be easier to manage for less savvy users.

API such mfx try to abstract completely and just require to get the input bitstream so the hardware input buffers can be filled and produce the frames, in presentation order, once the (parallel) decoding process yelds. It is in pull mode instead of push, so when more data is needed it gets requested and when a frame is ready it gets notified.

For an user most of the headaches related to frame management and elementary stream parsing are gone, or almost gone since some formats have multiple elementary stream representation and only a single one is supported…

For me (and then Anton), having an API like that poses multiple problems.
– having to feed a bitstream requires constructing it back from the software parsing and this is not terrible, it had been done for vda.
– the decoder wants to get the data only when the hardware buffers require more and that would require at least a queue.
– the frames outputted are already in presentation order, requiring to bypass the frame reordering logic.

Fitting a new-style hardware acceleration API in Libav

I tried the following approaches:
– Consider libmfx a normal third party decoder
– Anton complained about removing completely the ability to user hardware memory
– Implement additional hooks to keep the hwaccel interface but avoid the bitstream parsing and frame reordering.
– Anton and others complained about preventing the transparent fallback

Then I let Anton try for himself and in the end we agreed that providing an interim solution that let the user manage the memory and then trying to complete hwaccel2 on a second time would be the best.

More on the topic later 🙂