During the VDD we had lots of discussions and I enjoyed reviewing the initial NihAV implementation. Kostya already wrote some more about the decoupled API that I described at high level here.
This article is about some possible implementation details, at least another will follow.
The new API requires some additional data structures, mainly something to keep the data that is being consumed/produced, additional implementation-callbacks in AVCodec and possibly a mean to skip the queuing up completely.
Data Structures
AVPacketQueue and AVFrameQueue
In the previous post I considered as given some kind of Queue.
Ideally the API for it could be really simple:
typedef struct AVPacketQueue; AVPacketQueue *av_packet_queue_alloc(int size); int av_packet_queue_put(AVPacketQueue *q, AVPacket *pkt); int av_packet_queue_get(AVPacketQueue *q, AVPacket *pkt); int av_packet_queue_size(AVPacketQueue *q); void av_packet_queue_free(AVPacketQueue **q);
typedef struct AVFrameQueue; AVFrameQueue *av_frame_queue_alloc(int size); int av_frame_queue_put(AVFrameQueue *q, AVPacket *pkt); int av_frame_queue_get(AVFrameQueue *q, AVPacket *pkt); int av_frame_queue_size(AVFrameQueue *q); void av_frame_queue_free(AVFrameQueue **q);
Internally it leverages the ref-counted API (av_packet_move_ref
and av_frame_move_ref
) and any data structure that could fit the queue-usage. It will be used in a multi-thread scenario so a form of Lock has to be fit into it.
We have already something specific for AVPlay, using a simple Linked List and a FIFO for some other components that have a near-constant maximum number of items (e.g. avconv, NVENC, QSV).
Possibly also a Tree could be used to implement something such as av_packet_queue_insert_by_pts
and have some forms of reordering happen on the fly. I’m not a fan of it, but I’m sure someone will come up with the idea..
The Queues are part of AVCodecContext
.
typedef struct AVCodecContext { // ... AVPacketQueue *packet_queue; AVFrameQueue *frame_queue; // ... } AVCodecContext;
Implementation Callbacks
In Libav the AVCodec
struct describes some specific codec features (such as the supported framerates) and hold the actual codec implementation through callbacks such as init
, decode
/encode2
, flush
and close
.
The new model obviously requires additional callbacks.
Once the data is in a queue it is ready to be processed, the actual decoding or encoding can happen in multiple places, for example:
-
In
avcodec_*_push
oravcodec_*_pull
, once there is enough data. In that case the remaining functions are glorified proxies for the matching queue function. -
somewhere else such as a separate thread that is started on
avcodec_open
or the firstavcodec_decode_push
and is eventually stopped once the context related to it is freed byavcodec_close
. This is what happens under the hood when you have certain hardware acceleration.
Common
typedef struct AVCodec { // ... previous fields int (*need_data)(AVCodecContext *avctx); int (*has_data)(AVCodecContext *avctx); // ... } AVCodec;
Those are used by both the encoder and decoder, some implementations such as QSV have functions that can be used to probe the internal state in this regard.
Decoding
typedef struct AVCodec { // ... previous fields int (*decode_push)(AVCodecContext *avctx, AVPacket *packet); int (*decode_pull)(AVCodecContext *avctx, AVFrame *frame); // ... } AVCodec;
Those two functions can take a portion of the work the current decode
function does, for example:
– the initial parsing and dispatch to a worker thread can happen in the _push
.
– reordering and blocking until there is data to output can happen on _pull
.
Assuming the reordering does not happen outside the pull callback in some generic code.
Encoding
typedef struct AVCodec { // ... previous fields int (*encode_push)(AVCodecContext *avctx, AVFrame *frame); int (*encode_pull)(AVCodecContext *avctx, AVPacket *packet); } AVCodec;
As per the Decoding callbacks, encode2
workload is split. the _push
function might just keep queuing up until there are enough frames to complete the initial the analysis, while, for single thread encoding, the rest of the work happens at the _pull
.
Yielding data directly
So far the API mainly keeps some queue filled and let some magic happen under the hood, let see some usage examples first:
Simple Usage
Let’s expand the last example from the previous post: register callbacks to pull/push the data and have some simple loops.
Decoding
typedef struct DecodeCallback { int (*pull_packet)(void *priv, AVPacket *pkt); int (*push_frame)(void *priv, AVFrame *frame); void *priv_data_pull, *priv_data_push; } DecodeCallback;
Two pointers since you pull from a demuxer+parser and you push to a splitter+muxer.
int decode_loop(AVCodecContext *avctx, DecodeCallback *cb) { AVPacket *pkt = av_packet_alloc(); AVFrame *frame = av_frame_alloc(); int ret; while ((ret = avcodec_decode_need_data(avctx)) > 0) { ret = cb->pull_packet(cb->priv_data_pull, pkt); if (ret < 0) goto end; ret = avcodec_decode_push(avctx, pkt); if (ret < 0) goto end; } while ((ret = avcodec_decode_have_data(avctx)) > 0) { ret = avcodec_decode_pull(avctx, frame); if (ret < 0) goto end; ret = cb->push_frame(cb->priv_data_push, frame); if (ret < 0) goto end; } end: av_frame_free(&frame); av_packet_free(&pkt); return ret; }
Encoding
For encoding something quite similar can be done:
typedef struct EncodeCallback { int (*pull_frame)(void *priv, AVFrame *frame); int (*push_packet)(void *priv, AVPacket *packet); void *priv_data_push, *priv_data_pull; } EncodeCallback;
The loop is exactly the same beside the data types swapped.
int encode_loop(AVCodecContext *avctx, EncodeCallback *cb) { AVPacket *pkt = av_packet_alloc(); AVFrame *frame = av_frame_alloc(); int ret; while ((ret = avcodec_encode_need_data(avctx)) > 0) { ret = cb->pull_frame(cb->priv_data_pull, frame); if (ret < 0) goto end; ret = avcodec_encode_push(avctx, frame); if (ret < 0) goto end; } while ((ret = avcodec_encode_have_data(avctx)) > 0) { ret = avcodec_encode_pull(avctx, pkt); if (ret < 0) goto end; ret = cb->push_packet(cb->priv_data_push, pkt); if (ret < 0) goto end; } end: av_frame_free(&frame); av_packet_free(&pkt); return ret; }
Transcoding
Transcoding, the naive way, could be something such as
int transcode(AVFormatContext *mux, AVFormatContext *dem, AVCodecContext *enc, AVCodecContext *dec) { DecodeCallbacks dcb = { get_packet, av_frame_queue_put, dem, enc->frame_queue }; EncodeCallbacks ecb = { av_frame_queue_get, push_packet, enc->frame_queue, mux }; int ret = 0; while (ret > 0) { if ((ret = decode_loop(dec, &dcb)) > 0) ret = encode_loop(enc, &ecb); } }
One loop feeds the other throught the queue. get_packet
and push_packet
are muxing and demuxing functions, they might end up being other two Queue functions once the AVFormat layer gets a similar overhaul.
Advanced usage
From the examples above you would notice that in some situation you would possibly do better,
all the loops pull data from a queue push it immediately to another:
- why not feeding right queue immediately once you have the data ready?
- why not doing some processing before feeding the decoded data to the encoder, such as conver the pixel format?
Here some additional structures and functions to enable advanced users:
typedef struct AVFrameCallback { int (*yield)(void *priv, AVFrame *frame); void *priv_data; } AVFrameCallback; typedef struct AVPacketCallback { int (*yield)(void *priv, AVPacket *pkt); void *priv_data; } AVPacketCallback; typedef struct AVCodecContext { // ... AVFrameCallback *frame_cb; AVPacketCallback *packet_cb; // ... } AVCodecContext; int av_frame_yield(AVFrameCallback *cb, AVFrame *frame) { return cb->yield(cb->priv_data, frame); } int av_packet_yield(AVPacketCallback *cb, AVPacket *packet) { return cb->yield(cb->priv_data, packet); }
Instead of using directly the Queue API, would be possible to use yield
functions and give the user a mean to override them.
Some API sugar could be something along the lines of this:
int avcodec_decode_yield(AVCodecContext *avctx, AVFrame *frame) { int ret; if (avctx->frame_cb) { ret = av_frame_yield(avctx->frame_cb, frame); } else { ret = av_frame_queue_put(avctx->frame_queue, frame); } return ret; }
Whenever a frame (or a packet) is ready it could be passed immediately to another function, depending on your threading model and cpu it might be much more efficient skipping some enqueuing+dequeuing steps such as feeding directly some user-queue that uses different datatypes.
This approach might work well even internally to insert bitstream reformatters after the encoding or before the decoding.
Open problems
The callback system is quite powerful but you have at least a couple of issues to take care of:
– Error reporting: when something goes wrong how to notify what broke?
– Error recovery: how much the user have to undo to fallback properly?
Probably this part should be kept for later, since there is already a huge amount of work.
What’s next
Muxing and demuxing
Ideally the container format layer should receive the same kind of overhaul, I’m not even halfway documenting what should
change, but from this blog post you might guess the kind of changes. Spoiler: The I/O layer gets spun in a separate library.
Proof of Concept
Soon^WNot so late I’ll complete a POC out of this and possibly hack avplay
so that either it uses QSV
or videotoolbox
as test-case (depending on which operating system I’m playing with when I start), probably I’ll see which are the limitations in this approach soon.
Mind you, I’ve written about different system (and different implementation). Try writing an example of muxing output of two demuxers into one file (and don’t forget you can have several streams there). I suspect you’ll have a fragile web of callbacks in the end.
I wouldn’t use the callbacks to do that.
As said it might be beneficial just on some specific situations, anything more would just be problematic.
Fragile probably describes it better, since on this kind of setup reporting errors and reacting to them gets hairy.