qtdemux: Fix race on pad reconnection
While working on gst-plugins-bad!907 (merged) I wondered why qtdemux did not expose the same bug, since both use a flowcombiner and qtdemux does not seem to reset it in response to RECONFIGURE events.
By adding some logging I realized the order in which the frames were chained by qtdemux happened to be perfect to avoid all pads reaching not-linked and reproduce the bug... but should one more frame be chained (from the other srcpad, since they are alternating) the bug would reproduce.
In practice, I was never able to reproduce that race in an unpatched qtdemux: even with rr record --chaos
it was too unlikely. But that doesn't prove the race is not there and that it couldn't happen... so with more analysis I wrote a table with a scheduling that would ensure the race was reproduced. Then I made a patch to demonstrate it: by adding strategic waits using a step counter, a mutex and a condition variable, the threads run in an order that reproduces the bug consistently.
In this race there are three relevant threads:
- The qtdemux streaming thread (
qtdemux:sink
). qtdemux is working in pull mode here. - The thread sending the select-streams event (let's consider it to be the main thread here for simplicity,
main
). - The streaming thread of the intially active multiqueue srcpad downstream (
multiqueue:src_0
).
The following is an order of actions that reproduces the issue:
decodebin3 |
+-----------------------------+ +-----------------------+ |
| +-------+ +--------+ +-----------+ +---------------------+
| | src_0 |---->| sink_0 | | src_0 [T] |--+ | |
+----------+ +-------+ +--------+ +-----------+ | +------+ |
| sink [T] | qtdemux | | multiqueue | +-->| sink | audioconcat |
+----------+ +-------+ +--------+ +-----------+ +------+ |
| | src_1 |---->| sink_1 | | src_1 [T] | | |
| +-------+ +--------+ +-----------+ +---------------------+
+-----------------------------+ +-----------------------+ |
|
-
qtdemux:sink
completely chains a buffer throughsrc_0
, returningGST_FLOW_OK
atmultiqueue:sink_0
. -
qtdemux:sink
completely chains a buffer throughsrc_1
, returningGST_FLOW_NOT_LINKED
atmultiqueue:sink_1
. Because of the flow combiner in qtdemux, qtdemux continues. -
multiqueue:src_0
picks the buffer and calls the chain function of it's peer to push it downstream. - Before that chain ends,
main
sends aselect-stream
event upstream to switch the enabled audio track. In consequence, decodebin3 adds an IDLE probe to multiqueue:src_0. -
multiqueue:src_0
returns from the chain function, which immediately triggers the IDLE probe function, which unlinks downstream fromsrc_0
and links it withsrc_1
. -
qtdemux:sink
completely chains a buffer throughsrc_0
... now returningGST_FLOW_NOT_LINKED
. Both pads are not linked, the flow combiner state becomesGST_FLOW_NOT_LINKED
too and an error is propagated.
Things to note:
multiqueue's srcpads return the latest chain return of the matching multiqueue sinkpad. This is necessary because otherwise the queue would be blocking, which would defeat its purpose.
There are things that could prevent this from happening, but don't:
-
When the pads are relinked, a
RECONFIGURE
event is sent upstream. multiqueue receives it and resetssrcresult
toGST_FLOW_OK
, and also signals a condition variable to awake the srcpad streaming thread.This makes multiqueue sinkpads returnGST_FLOW_OK
... until the awaken srcpad setssrcresult
toGST_FLOW_NOT_LINKED
again pretty quickly. In every run this happened before qtdemux chained the next buffer. -
qtdemux also receives the
RECONFIGURE
event, but doesn't react to it. If qtdemux reacted to it by resetting the flowcombiner like in the case of testsrcbin, the problem would be fixed because by the time flow is possible through the pads (when the relink is done and the pads are unlocked), one of them is guaranteed to return GST_FLOW_OK.
Repro:
Checkout the branch qtdemux-switch-track-race
in https://gitlab.freedesktop.org/ntrrgc/gst-plugins-base and https://gitlab.freedesktop.org/ntrrgc/gst-plugins-good.
Then play an MP4 file with two audios in playbin3, like this one: double-audio-long.mp4
gst-play-1.0 --use-playbin3 double-audio-long.mp4
Fix:
This MR. Rebase the provided patch on top of the repro to see how the track switch works even in the bad scheduling case.