codecs: h264decoder: Add support for output delay

@ndufresne I guess delaying output would be helpful for v4l2 stateless decoder perhaps?

added 78 commits

dcc19a0c...0b0bf1b0 - 76 commits from branch gstreamer:master
42441089 - codecs: h264decoder: Add support for output delay
a6d36984 - nvh264sldec: Add support for output-delay to improve throughput performance

Measured transcoding performance number by using https://download.blender.org/durian/movies/Sintel.2010.1080p.mkv (tested on Windows10 desktop with RTX 2080)

pipeline:

gst-launch-1.0 filesrc location=Sintel.2010.1080p.mkv ! matroskademux ! h264parse ! nvh264sldec ! queue max-size-time=0 max-size-buffers=3 max-size-bytes=0 ! nvh264enc preset=hp ! fakesink

Time taken (average of 3 times running)

AS-IS: 30.06 sec
TO-BE: 28.48 sec

Looks small, but in some case that could be the difference between reaching full let's say 4K60, or getting stuck at 58 (which will play choppy as it won't be real-time).

yes, not that super surprising improvement, but it's improvement :)

Now, to make it work in V4L2, I need to make handling of pending request a bit more robust. Right now it get confused with the queue handling (we need to managed queue/dequeu in decode order :-S)

resolved all threads

Ok, got something running (a bit ugly) on V4L2 now. Anything higher then 1 for the delay yields the same performance, it simply use more RAM. So I'm decoding Sony Sushi 4K Demo.mkv, a 4K60 video from 4kmedia website, which is 2m32.8s duration.

Render delay 0: 2m47,3s
Render delay 1+: 2m19,9s

Now, I find the virtual function a bit weird to use, but I totally understand that this is the only way to tell the liveness.

Now the delay seems highly HW specific, so I guess NVidia is consdered 1 HW and we have doc for a logical default, but for V4L2, 1 is enough for all single core designs, but for RPi, which has two threads (a bit like hyperthreading on CPU), a delay of 2 would be needed to never starve the pipeline.

mentioned in merge request !1881 (merged)

resolved all threads

assigned to @gstreamer-merge-bot

mentioned in commit seungha.yang/gst-plugins-bad@fba807be

mentioned in commit seungha.yang/gst-plugins-bad@86e312c1

added 4 commits

a6d36984...a417a761 - 2 commits from branch gstreamer:master
86e312c1 - codecs: h264decoder: Add support for output delay
fba807be - nvh264sldec: Add support for output-delay to improve throughput performance

Compare with previous version

merged

Does this mean that the output_picture is a sync point for some HW decoder such as V4L2 to complete the current decoding frame? You need to wait for the HW completing the current buffer before pushing it, and so the output-delay can help the performance?

I can speak for v4l. In that case we have a request queue, we set a maximum number of request, so if that max is reached we sync after queuing the current request. Otherwise, well sync on the job completion (wait for the request to complete) before calling finish frame.

The max is needed for two reasons, we want to limit the memory overhead that letting reorder depth + delay would cost in bitstream, and also limit memory overhead for per slice decoders.

Request are processed in order by the driver, if they were processed in parallel perhaps some extra work would be needed.

Admin message

codecs: h264decoder: Add support for output delay

Merge request reports

Activity