codecs: h264decoder: Add support for output delay
Some decoding APIs support delayed output or a command for decoding a frame doesn't need to be sequential to corresponding command for getting decoded frame. For instance, subclass might be able to request decoding for multiple frames and then get for one (oldest) decoded frame or so. If aforementioned case is supported by specific decoding API, delayed output might show better throughput performance.
Merge request reports
Activity
- Resolved by Nicolas Dufresne
@ndufresne I guess delaying output would be helpful for v4l2 stateless decoder perhaps?
added 78 commits
-
dcc19a0c...0b0bf1b0 - 76 commits from branch
gstreamer:master
- 42441089 - codecs: h264decoder: Add support for output delay
- a6d36984 - nvh264sldec: Add support for output-delay to improve throughput performance
-
dcc19a0c...0b0bf1b0 - 76 commits from branch
Measured transcoding performance number by using https://download.blender.org/durian/movies/Sintel.2010.1080p.mkv (tested on Windows10 desktop with RTX 2080)
pipeline:
gst-launch-1.0 filesrc location=Sintel.2010.1080p.mkv ! matroskademux ! h264parse ! nvh264sldec ! queue max-size-time=0 max-size-buffers=3 max-size-bytes=0 ! nvh264enc preset=hp ! fakesink
Time taken (average of 3 times running)
- AS-IS: 30.06 sec
- TO-BE: 28.48 sec
Edited by Seungha YangOk, got something running (a bit ugly) on V4L2 now. Anything higher then 1 for the delay yields the same performance, it simply use more RAM. So I'm decoding
Sony Sushi 4K Demo.mkv
, a 4K60 video from 4kmedia website, which is 2m32.8s duration.- Render delay 0: 2m47,3s
- Render delay 1+: 2m19,9s
Now, I find the virtual function a bit weird to use, but I totally understand that this is the only way to tell the liveness.
Now the delay seems highly HW specific, so I guess NVidia is consdered 1 HW and we have doc for a logical default, but for V4L2, 1 is enough for all single core designs, but for RPi, which has two threads (a bit like hyperthreading on CPU), a delay of 2 would be needed to never starve the pipeline.
- Resolved by Nicolas Dufresne
mentioned in merge request !1881 (merged)
assigned to @gstreamer-merge-bot
mentioned in commit seungha.yang/gst-plugins-bad@fba807be
mentioned in commit seungha.yang/gst-plugins-bad@86e312c1
added 4 commits
-
a6d36984...a417a761 - 2 commits from branch
gstreamer:master
- 86e312c1 - codecs: h264decoder: Add support for output delay
- fba807be - nvh264sldec: Add support for output-delay to improve throughput performance
-
a6d36984...a417a761 - 2 commits from branch
I can speak for v4l. In that case we have a request queue, we set a maximum number of request, so if that max is reached we sync after queuing the current request. Otherwise, well sync on the job completion (wait for the request to complete) before calling finish frame.
The max is needed for two reasons, we want to limit the memory overhead that letting reorder depth + delay would cost in bitstream, and also limit memory overhead for per slice decoders.
Request are processed in order by the driver, if they were processed in parallel perhaps some extra work would be needed.