Hardware acceleration in Libav

Multimedia formats require lots of computation power since they use fairly complex mathematical transformation. Usually most of them can be implemented efficiently in silicon requiring orders of magnitude less power to run while remaining quite fast to execute.

Hardware acceleration

Most of the current platforms, being them desktop, mobile or server, sports some kind of hardware unit to offload decoding and encoding of multimedia formats.

They usually are accessed through platform-specific API, sometimes the API is even codec-specific, making the whole implementation experience quite painful and a lot time consuming.

Depending on the specific hwaccel implementation, it could be bound to the gpu and use the gpu memory, thus requiring to manage non-system memory in specific ways adding additional burden for those that would just to have some quick gain while opening a world of interesting optimization possibilities such as zerocopy processing pipelines for transcoding or in-gpu pixel-format conversion, scaling and blending.

There are some generic wrappers such as vdpau, vaapi, dxva2 and vt that abstract some of the complexity related providing a more uniform interface, but usually there is a need of a proper (and possibly near transparent) fallback for the situations in which the hardware cannot really manage an advanced codec profile so just leveraging the generic abstractions solve just part of the problem but, as decoding goes, it provides a large performance boost while requiring some effort in managing the non-system memory.

For most of the users the learning curve is too steep for being really useful.

Libav and hwaccel

Hardware acceleration support happened to be implemented more or less around a number of specific implementation (with quite non-uniform approach, so vaapi had some hooks in the codecs, vdpau had full decoders until it had been ported to the same interface and made way nicer to use thanks to Rémi) and requiring quite a number of backend specific boilerplate code to set up the implementation specific context and then manage the opaque buffers the decoder outputs.

High level functionality

The hwaccel infrastructure is currently focused on the following items:
– the fallback from hardware to software should be as seamless as possible
– basic hardware decoders must be taken in account (e.g. for h264 some accept single NAL units and can’t parse the bitstream on their own)
– the user must have a mean to control the context setup and the full memory management

In order to do that the normal software decoder is used to parse the bitstream and depending on if the hwaccel is enabled or not, route the parsed data to the software or the hardware decoder and the output frames are then managed by the decoder frame reordering functionality if present.

This way falling back, even from a specific hwaccel to another, is sort of simple at least conceptually: every time a new extradata appears it is parsed, fed to the first hwaccel setup code if not supported optional fallback hwaccel can try and eventually the software decoder is picked.

The decoded frames, no matter if opaque hw-specific gpu-memory or normal system-memory go to the same codepath and the user has a mean to set the video rendering pipeline to take this in account.

Limit of the infrastructure

All the software evolves, since the minimalistic approach to the hardware acceleration requires a huge amount of boilerplate code and deep knowledge of the bitstream formats in use the new APIs for the new hardware tried to improve and be easier to manage for less savvy users.

API such mfx try to abstract completely and just require to get the input bitstream so the hardware input buffers can be filled and produce the frames, in presentation order, once the (parallel) decoding process yelds. It is in pull mode instead of push, so when more data is needed it gets requested and when a frame is ready it gets notified.

For an user most of the headaches related to frame management and elementary stream parsing are gone, or almost gone since some formats have multiple elementary stream representation and only a single one is supported…

For me (and then Anton), having an API like that poses multiple problems.
– having to feed a bitstream requires constructing it back from the software parsing and this is not terrible, it had been done for vda.
– the decoder wants to get the data only when the hardware buffers require more and that would require at least a queue.
– the frames outputted are already in presentation order, requiring to bypass the frame reordering logic.

Fitting a new-style hardware acceleration API in Libav

I tried the following approaches:
– Consider libmfx a normal third party decoder
– Anton complained about removing completely the ability to user hardware memory
– Implement additional hooks to keep the hwaccel interface but avoid the bitstream parsing and frame reordering.
– Anton and others complained about preventing the transparent fallback

Then I let Anton try for himself and in the end we agreed that providing an interim solution that let the user manage the memory and then trying to complete hwaccel2 on a second time would be the best.

Hardware acceleration in Libav