Rethinking AVFormat – part 1

Container formats should be just a boring application of serialization of multiple arrays of tuples timestamp-binary blob.

Instead there are tons of implementation details and there are fun
and exceedingly annoying means to lose your sanity.

This post is yet another post about APIs you can see other here and here.

Current Status

In Libav we have libavformat taking care of general I/O, Muxing, Demuxing.

This blog post will not cover the additional grouping given by Programs, Chapters and such to not make the whole article huge and just focus on the basics.

I/O

The AVIO abstraction provides a mean to uniformly access content stored in files, available as remote streams (e.g. served through http or rtmp) or through custom implementations.

This part of the API is rightly coupled with the Muxer and Demuxer implementation.

It uses the common Context pattern you can find across the rest of Libav with some of twists:

  • The protocol handler can be guessed using the url provided, e.g. file:///tmp/foo.
  • The functions that allocate a context take an extra parameter than the usual options AVDictionary in the form of a callback function.
  • You can create your own custom protocol easily.
int avio_open2(AVIOContext **s, const char *url, int flags, const AVIOInterruptCB *int_cb, AVDictionary **options)

AVIOContext *avio_alloc_context(unsigned char *buffer, int buffer_size, int write_flag, void *opaque,
                                int(*read_packet)(void *opaque, uint8_t *buf, int buf_size),
                                int(*write_packet)(void *opaque, uint8_t *buf, int buf_size),
                                int64_t(*seek)(void *opaque, int64_t offset, int whence))
int avio_closep(AVIOContext **s);

The api tries to mimic the C stdio plus lots of API sugar.

# core functions
int avio_read(AVIOContext *s, unsigned char *buf, int size);
void avio_write(AVIOContext *s, const unsigned char *buf, int size);
int64_t avio_seek(AVIOContext *s, int64_t offset, int whence);


# simple integer readers
int          avio_r8  (AVIOContext *s);
uint64_t     avio_rb64(AVIOContext *s);
uint64_t     avio_rl64(AVIOContext *s);
unsigned int avio_rb16(AVIOContext *s);
unsigned int avio_rb24(AVIOContext *s);
unsigned int avio_rb32(AVIOContext *s);
unsigned int avio_rl16(AVIOContext *s);
unsigned int avio_rl24(AVIOContext *s);
unsigned int avio_rl32(AVIOContext *s);

# simple integer writers
void avio_w8(AVIOContext *s, int b);
void avio_wb16(AVIOContext *s, unsigned int val);
void avio_wb24(AVIOContext *s, unsigned int val);
void avio_wb32(AVIOContext *s, unsigned int val);
void avio_wb64(AVIOContext *s, uint64_t val);
void avio_wl16(AVIOContext *s, unsigned int val);
void avio_wl24(AVIOContext *s, unsigned int val);
void avio_wl32(AVIOContext *s, unsigned int val);
void avio_wl64(AVIOContext *s, uint64_t val);


# utf8 and utf16 strings
int avio_get_str(AVIOContext *pb, int maxlen, char *buf, int buflen);

int avio_get_str16le(AVIOContext *pb, int maxlen, char *buf, int buflen);
int avio_get_str16be(AVIOContext *pb, int maxlen, char *buf, int buflen);

int avio_put_str(AVIOContext *s, const char *str);

int avio_put_str16le(AVIOContext *s, const char *str);

... (and more) ...

Buffering

All the function use an intermediate buffer to back reads and writes, the buffer can be explicitly flushed or it gets flushed automatically once the request would end outside it.

void avio_flush(AVIOContext *s);

A special kind of AVIOContext is a dynamic write buffer, it extends on demand and can be used to build complex recourring patterns once and write them as many time as needed.

int avio_open_dyn_buf(AVIOContext **s);

int avio_close_dyn_buf(AVIOContext *s, uint8_t **pbuffer);

Error handling

An I/O layer has to take in account the fact the resource being read or written could be abruptly disappear or suddenly slow down. This is valid for both local and remote resources.

The internal buffer allocation might fail.

A seek too far could lead to the end of file.

AVIO approach to errors is quite simplicistic:
– A write can silently fail.
– A failing read just returns 0-ed buffer or value.
– All the functions set the error field or the eof_reached field.

Is up to the user to decide when to check for I/O problems or leverage the AVIOInterruptCB to implement timeouts or other mean to interrupt a read or a write that otherwise would just quietly block till it is completed.

Demuxing (and Probing)

The AVFormat part taking care of input streams can be split in three: Probing the data to guess the right demuxer, the actual Demuxing and optionally parse the demuxed data and fit it in packets containing the information needed by the decoder to decode a frame of video or a matching amount of audio samples, later I call it frame-worth amount of data and I call this process chopping amorphous data streams. It is colorful as expression but represents quite well the endeavor.

Probing

The Probe functions take an arbitrary big chunk of data (stored in a AVProbeData struct) and figure out which demuxer should be able to actually parse it correctly.

As a rule of thumb probes need to be fast since all of them have to be run over the data at least once and possibly multiple times since if the result is not really conclusive increasing the data and trying again is an option.

AVInputFormat *av_probe_input_format2(AVProbeData *pd, int is_opened,
                                      int *score_max);

An helper function to probe from an AVIOContext and get the possible input format is provided.

int av_probe_input_buffer(AVIOContext *pb, AVInputFormat **fmt,
                          const char *filename, void *logctx,
                          unsigned int offset, unsigned int max_probe_size);

It used internally by avformat_open_input to automatically figure out the demuxer to use and it might look a little confusing.

Demuxing

Once that the input format is either guessed or selected the actual muxing conceptually is just providing AVPackets
as they are parsed. You might want to reposition within the stream at random times (the infamous seeking opening yet another can of worms).

int avformat_open_input(AVFormatContext **ps, const char *filename,
                        AVInputFormat *fmt, AVDictionary **options);

int av_read_frame(AVFormatContext *s, AVPacket *pkt);

void avformat_close_input(AVFormatContext **ps);
Figuring out the data inside the format

Some container formats keep the information regarding their contents in a global header at the start of the file, other, that could have new data streams appearing at random times, do not.

Since there is no easy mean to figure out which kind of data they are storing, the only safe way to figure out is to try to decode some packets in order to know which kind of data is available, avformat_find_stream_info.

int avformat_find_stream_info(AVFormatContext *ic, AVDictionary **options);

The apparently simple function does a lot of work behind the scenes: it demuxes and decodes a settable number of packets before giving up and keeps all of them in an internal queue so that they will be available for demuxing even if the input stream is not seekable.

Getting the data outside

Containers such as MPEG PS mux data in small fixed-sized chunks
while usually muxers and decoders expect to receive AVPackets containing enough data to produce a frame.

Specific parsers can be inserted automatically to take amorphous stream of demuxed data and chop out of it AVPackets containing the expected amount of data.

This happens usually automatically so the user does not have to care about it as long as the codec parser is present.

Timestamps

The multimedia data is expected to carry a timestamp to present at the same time video frames and audio frames (and subtitles).

Some containers do provide directly such timestamps, other do not, requiring some amount of guesswork by some heuristics that might or might not work depending on the codec at hand.

For example, if the container is supposed to not allow variable frame rate, the implicit time stamp for video can be deduced from the frame number. This might not work as expected if the codec uses B-frames and requires some form
of reordering.

This part in Libav is sort of hidden and often causing a number of problems.

Seeking

Seeking is quite a different and large can of worms.

Ideally seeking just sets the AVIOContext to a certain position and the demuxer keeps working from there.

int av_seek_frame(AVFormatContext *s, int stream_index,
                  int64_t timestamp, int flags);

Depending on the container format and the codec picking the correct byte offset from the user provided timestamp can be incredibly simple or really complex, with various degrees of precision.

Some format provide an precise index so a plain lookup is enough, a dichotomic search looking for the closest I-frame is the common case and in the worst situation a linear search might be required.

In some cases auxiliary indexes are built to speed up seeking within previously parsed areas.

Seeking is not fun at the demuxer level and gets even worse at the codec level if the data provided is not the one expected.

Muxing

Muxing is sort of simpler than demuxing. The output format is always known and the data always come in AVPackets matching a frame-worth of raw data and possibly sporting correct timestamps.

API-wise it expects an AVFormatContext with the oformat set to the correct AVOutputFormat and the pb
set with an allocated AVIOContext and populated AVStreams.

Once the AVFormatContext is configured is possible to write the packets. First the global header should be written, then as many packets as needed are muxed, interleaving audio and video so that demuxing and seeking work correctly.

int avformat_write_header(AVFormatContext *s, AVDictionary **options);

int av_interleaved_write_frame(AVFormatContext *s, AVPacket *pkt);

int av_write_trailer(AVFormatContext *s);

Bitstream filtering

Some codecs have multiple possible representation, e.g. H264 has the AVCC bitstream format and the Annex B bitstream format. Come containers support both, other expect only one or the other. Currently the correct converter from a bitstream to another must be inserted manually.

Packet interleaving

Certain container formats have quite peculiar muxing rules. This is normally hidden from the user, in certain cases being able to override it is a boon.

Shortcomings summary

In the next post I will explain how I would improve the situation, today post is mainly to introduce the structure of AVFormat and start explaining what should be fixed. Here a short list of what I’d like to fix sooner than later.

Non-uniform API

  • There is quite a mixture of av_ and avformat_ namespaces.
  • The muxing and demuxing APIs are sufficiently confusing (and surely I should complete my avformat_open_output to reduce the boilerplate)

Abstractions Leaking the wrong way

  • The demuxing side automagically inserts parsers to chop data streams in a frame-worth amount of data while the muxing side would just fail if the bitstream provided is not matching the one required by the container format.
  • There is quite of hidden magic happening in avformat_find_stream_info and just recently we added options to at least flush the buffer it keeps to probe for codecs. Having a better function and a better mean to control this kind of internal buffer would be surely appreciated by the user that need to keep the latency low.
  • There is no good mean to be notified if the number of streams change (new streams found or old streams disappearing).

Bad implementations

  • The old muxers sometimes do not even use the now-available internals (e.g. the interleaver helpers) but implement internally queues and logic that should be now common and shared across all the muxers.
  • While AVCodec has (now) quite an uniform mean to slice bytes and bits, avformat is not leveraging it beside few places.

PS: Kostya prefers to provide both amorphous stream and chopped packets. It makes sense since you might have some codec you cannot parse but you can sort of safely remux if the container is the same.
For the common case I’d rather suggest to use a set of functions that always insert parsers when they can both demuxing and muxing and provide another set of functions to get arbitrary lumps of stream as provided by the container format.

2 thoughts on “Rethinking AVFormat – part 1”

    1. I’m making notes to see what’s wrong currently and how it should work better in my opinion.

      The notes can be used in 3 ways:
      – Use the current API sort of effectively
      – Try to fix it
      – Not make the same error again

      Probably soon I’ll start writing some prototypes.

Leave a Reply to lu_zero Cancel reply

Your email address will not be published. Required fields are marked *