This weekend on #libav-devel we discussed again a bit about the problems with the current core avcodec api.
Current situation
Decoding
We have 3 decoding functions for each of the supported kind of media types: Audio, Video and Subtitles.
Subtitles are already a sore thumb since they are not using AVFrame but a specialized structure, let’s ignore it for now. Audio and Video share pretty much the same signature:
int avcodec_decode_something(AVCodecContext *avctx, AVFrame *f, int *got_frame, AVPacket *p)
It takes a context pointer containing the decoder state, consumes a demuxed packet and optionally outputs a decoded frame containing raw data in a certain format (audio samples, a video frame).
The usage model is quite simple it takes packets and whenever it has enough encoded data to emit a frame it emits one, the got_frame pointer signals if a frame is ready or more data is needed.
Problem:
What if 1 AVPacket is near always enough to output 2 or more frames of raw data?
This happens with MVC and other real-world scenarios.
In general our current API cannot cope with it cleanly.
While working with the MediaSDK interface from Intel and now with MMAL for the Rasberry Pi, similar problems arisen due the natural parallelism the underlying hardware has.
Encoding
We have again 3 functions again Subtitles are somehow different, while Audio and Video are sort of nicely uniform.
int avcodec_encode_something(AVCodecContext *avctx, AVPacket *p, const AVFrame *f, int *got_packet)
It is pretty much the dual of the decoding function: the context pointer is the same, a frame of raw data enters and a packet of encoded data. Again we have a pointer to signal if we had enough data and an encoded packet had been outputted.
Problem:
Again we might get multiple AVPacket produced out of a single AVFrame data fed.
This happens when the HEVC “workaround” to encode interlaced content makes the encoder to output the two separate fields as separate encoded frames.
Again, the API cannot cope with it cleanly and threaded or otherwise parallel encoding fit the model just barely.
Decoupling the process
To fix this issue (and make our users life simpler) the idea is to split the feeding data function from the one actually providing the processed data.
int avcodec_decode_push(AVCodecContext *avctx, AVPacket *packet); int avcodec_decode_pull(AVCodecContext *avctx, AVFrame *frame); int avcodec_decode_need_data(AVCodecContext *avctx); int avcodec_decode_have_data(AVCodecContext *avctx);
int avcodec_encode_push(AVCodecContext *avctx, AVFrame *frame); int avcodec_encode_pull(AVCodecContext *avctx, AVPacket *packet); int avcodec_encode_need_data(AVCodecContext *avctx); int avcodec_encode_have_data(AVCodecContext *avctx);
From a single function 4 are provided, why it is simple?
The current workflow is more or less like
while (get_packet_from_demuxer(&pkt)) { ret = avcodec_decode_something(avctx, frame, &got_frame, pkt); if (got_frame) { render_frame(frame); } if (ret < 0) { manage_error(ret); } }
The get_packet_from_demuxer()
is a function that dequeues from some queue the encoded data or directly call the demuxer (beware: having your I/O-intensive demuxer function blocking your CPU-intensive decoding function isn’t nice), render_frame()
is as well either something directly talking to some kind of I/O-subsystem or enqueuing the data to have the actual rendering (including format conversion, overlaying and scaling) in another thread.
The new API makes much easier to keep the multiple area of concern separated, so they won’t trip each other while the casual user would have something like
while (ret >= 0) { while ((ret = avcodec_decode_need_data(avctx)) > 0) { ret = get_packet_from_demuxer(&pkt); if (ret < 0) ... ret = avcodec_decode_push(avctx, &pkt); if (ret < 0) ... } while ((ret = avcodec_decode_have_data(avctx)) > 0) { ret = avcodec_decode_pull(avctx, frame); if (ret < 0) ... render_frame(frame); } }
That has probably few more lines.
Asyncronous API
Since the decoupled API is that simple, is possible to craft something more immediate for the casual user.
typedef struct AVCodecDecodeCallback { int (*pull_packet)(void *priv, AVPacket *pkt); int (*push_frame)(void *priv, AVFrame *frame); void *priv_data; } AVCodecDecodeCallback; int avcodec_register_decode_callbacks(AVCodecContext *avctx, AVCodecDecodeCallback *cb); int avcodec_decode_loop(AVCodecContext *avctx) { AVCodecDecodeCallback *cb = avctx->cb; int ret; while ((ret = avcodec_decode_need_data(avctx)) > 0) { ret = cb->pull_packet(cb->priv_data, &pkt); if (ret < 0) return ret; ret = avcodec_decode_push(avctx, &pkt); if (ret < 0) return ret; } while ((ret = avcodec_decode_have_data(avctx)) > 0) { ret = avcodec_decode_pull(avctx, frame); if (ret < 0) return ret; ret = cb->push_frame(cb->priv_data, frame); } return ret; }
So the actual minimum decoding loop can be just 2 calls:
ret = avcodec_register_decode_callbacks(avctx, cb); if (ret < 0) ... while ((ret = avcodec_decode_loop(avctx)) >= 0);
Cute, isn’t it?
Theory is simple …
… the practice not so much:
– there are plenty of implementation issues to take in account.
– LOTS of tedious work converting all the codecs to the new API.
– lots of details to iron out (e.g. have_data()
and need_data()
should block or not?)
We did radical overhauls before, such as introducing reference-counted AVFrames thanks to Anton, so we aren’t much scared of reshaping and cleaning the codebase once more.
If you like the ideas posted above or you want to discuss them more, you can join the Libav irc channel or mailing list to discuss and help.
Converting is trivial – you just need to provide default wrappers for current decoders (maybe something more complicated for multiframe packed audio).
The idea by itself LGTM.
What if got_frame is the number of decoded frames?
while (get_packet_from_demuxer(&pkt)) {
ret = avcodec_decode_something(avctx, frame, &got_frame, pkt);
while (got_frame > 0) {
ret = avcodec_decode_something(avctx, frame, &got_frame, special_pkt_or_null);
if (ret < 0) {
manage_error(ret);
}
render_frame(frame);
}
if (ret < 0) {
manage_error(ret);
}
}
You can work around the API as you suggested and have it return a special got_frame requiring to call again it with NULL packet in order to get the frame but is a propelled pig: it would fly ok but the landing would be problematic.
The whole exercise is due, strange encoding layouts such as MVC and other multi-layer encodings, better threading models and hardware acceleration API such as QSV, NVENC, Apple’s VDA and VT, RaspPi mmal.
Incidentally the asyncronous model would let you be extremely efficient in some scenarios so ideally you’d like to move away completely API-wise.
I guess I should find some time to explain better why it works only up to a point in a whole blogpost.
Thanks a lot for the comment!