waylandsink: add support wayland presentation time interface

Submitted by Wonchul Lee

Description

I bring comments about this tasks and wrap writer's name with angle brackets, sorry for the poor readability.

Waylandsink was handled by George Kiagiadakis and he had written presentation time interface codes for the demo, but the interface has been changed and settled down as a stable protocol.

I was starting it based on George's work (http://cgit.collabora.com/git/user/gkiagia/gst-plugins-bad.git/log/?h=demo), removing presentation queue and considering display stack delay.
It was predicting latency at display stack from wl_commit/damage/attach to frame presence and Pekka Paalanen(pq) advised that would not estimate the delay from wl_surface_commit() to display.

(it's part of comments)
<pq> wonchul, if you are trying to estimate the delay from wl_surface_commit() to display, and you don't sync the time you call commit() to the incoming events, that's going to be a lot less accurate.
<pq> 11:11:07> no, I literally meant replacing the queueing protocol calls with a queue implementation in the sink, so you don't use the queueing protocol anymore, but rely only on the feedback protocol to trigger attach+commits from the queue.
<pq> 11:12:27> the queue being a timestamp-ordered list of frame, just like in the weston implementation.

So, the way estimating the delay from wayland is not much accurate.
I turned to add a queue holding buffers before doing render() in the waylandsink

<Olivier Crête>
I'm a bit concerned about adding a queue in the sink that would increase the latency unnecessarily. I wonder if this could be done while queueing around 1 buffer there in normal streaming. Are we talking about queuing the actual frames or just information about the frames?

<Wonchul Lee>
I've queued reference of frames and tried to render based on the wayland presentation clock.
It could bring some delay depending on specific contents by adding a queue in the sink, It's not clear to me what specific factor cause delay yet, but yes, it would increase the latency at the moment.

The idea was disabling clock synchronization in gstbasesink and rendering(wayland commit/damage/attach) frames based on the calibrated wayland clock. I pushed the reference of gstbuffer to the queue and set the async clock callback to request render at a right time, and then rendered or dropped it depending on the adjusted timestamp.
This changes have issues that adjusted timestamp what requested to render is getting late than expected and it could cause dropping most of the frames at some cases since the adjusted timestamp was always late.
So I'm referring audiobasesink to adjust clock synchronization for the frames with wayland clock.

<Olivier Crête>
This work has two separate goals:

When the video has a different framerate than the display framerate, it should drops frames more or less evenly, so if you need to display 4 out of 5 frames, it should be something like 1,2,3,4,6,7,8,9,11,... Or if you need to display 30/60 frames it should display 1,3,5,7,9, etc .. Currently, GstBaseSink is not very clever about that.
And we have to be careful as this can be also caused by the compositor not being able to keep up. It's not because the display can do 60fps that the compositor is actually able to produce 60 new frames, it could be limited to a lower number, so we'll also have to make sure we're protected against that.
We want to guess the latency added by the display stack. The current GStreamer video sinks more or less assume that a buffer is rendered immediately when the render() vmethod returns, but this is not really how current display hardware work. Especially when you have double or triple buffering. So we want to know how much in advance to submit the buffer, but not too early to not display it one interval too early.
I just asked @nicolas a quick question about how he though we should do this, then we spent two hours whiteboarding ideas about this and we've barely been able to define the problem.

Here are some ideas we bounced around:

After submitting one frame (the first frame? the preroll frame?), we can have an idea of the upper bound of the latency for the live pipeline case. It should be the time between the moment a frame was submitted and when it was finally rendered + the "refresh". We can probably delay sending the async-done until the presented event of the first frame has arrived.
For the non-live case, we can probably find a way to submit the frame as early as possible before the next. Finding that time is the tricky part I think
@wonchul: could you summarize the different things your tried, what were the hypothesis and what were the results? It's important to keep these kinds of records for the Tax R&D filings (and so we can keep up with your work).

@pq or @daniels:

what is the logic behind the seq field, how do you expect it can be used? Do you know any example where it is used?
I'm also not sure how we can detect the case where the compositor cannot keep up? Or is the compositor is gnome-shell and has a gc that makes it miss a couple frames for no good reason?
From the info is the presented event (or any other way), is there a way we can evaluate when is the latest we can submit a buffer to have it arrive in time for a specific refresh? Or do we have to try and then do some kind of search to find what those deadlines are in practice?

<Pekka Paalanen>
seq field of wp_presentation_feedback.presented event:

No examples of use, I don't think. I didn't originally considerer it as needed, but it was added to allow implementing GLX_OML_sync_control on top of it. I do not think we should generally depend on seq unless you specifically care about the refresh count instead of timings. My intention with the design was that new code can work better with timestamps, while old code you don't want to port to timestamps could use seq as it has always done. Timestamps are "accurate", while seq may have been estimated from a clock in the kernel and may change its rate or may not have a constant rate at all.

seq comes from a time, when display refresh was a known guaranteed constant frequency, and you could use it as a clock by simply counting cycles. I believe all timing-sensitive X11 apps have been written with this assumption. But it is no longer exactly true, it has caveats (hard to maintain across video mode switches or display suspends, lacking hardware support, etc.), and with new display tech it will become even less true (variable refresh rate, self-refresh panels, ...).

seq is not guaranteed to be provided, it may be zero depending on the graphics stack used by the compositor. I'm also not sure what it means if you don't have both VSYNC and HW_COMPLETION in flags

The timestamp OTOH is always provided, but it may have some caveats which should be indicated by unset bits in flags.

Compositor not keeping up:

Maybe you could use the tv + refresh from presented event to guess when the compositor should be presenting your frame, and compare afterwards with what actually happened?

I can't really think of a good way to know if the compositor cannot keep up or why it cannot keep up. Hickups can happen and the compositor probably won't know why either. All I can say is collect statistics and analyze then over time. This might be a topic for further investigations, but to get more information about which steps take too much time we need some kernel support (explicit fencing) that is being developed, and make the compositor use that information.

Only hand-waving, sorry.

Finding the deadline:

I don't think there is a way to know really, and also the compositor might be adjusting its own schedules, so it might be variable.

The way I imaged it is that from presented event you compute the time of the next possible presentation, and if you want to hit that, submit a frame ASAP. This should get you just below one display-frame-cycle latency in any case, if your rendering is already complete.

If we really need the deadline, that would call for extending the protocol, so that the compositor could tell you when the deadline is. The compositor chooses the deadline based on how fast it thinks it can do a composition and hit the right vblank.

<Wonchul Lee>
About the latency, I tried to get latency added by the display stack from the wl commit/damage/attach to the present frame. It's a variable delay depending on the situation as pq mentioned before and could disturb targeting next present. The way we could assume optimal latency by accumulating it and observe a gap by the presentation feedback, maybe not always reliable.

I tried to synchronize GStreamer clock time with presentation feedback to render a frame on time and added a queue in GstWaylandSink to request render on each presentation feedback if there's a frame on time, similar to what George did. It's not well fit with GstBaseSink though, and GstWaylandSink needs to disable BaseSink time synchronization and computing itself. I faced unexpected underflow (consistently increasing delay) when playing with mpegts stream, so It also needs proper QOS handling to prevent underflow.

I would be good to get reliable latency from the display stack to make use of it when synchronizing presenting time whether computing it GstWaylandSink itself or not, there's a latency what we're missing anyway, though I'm not sure it's feasible.

<Pekka Paalanen>
@wonchul btw. what do you mean when you say "synchronize GStreamer clock time with presentation feedback"?

Does it mean something else than looking at what clock is advertised by wp_presentation.clock_id and then synchronizing GStreamer clock with clock_gettime() using the given clock id? Or does synchronizing mean something else than being able to convert a timestamp from one clock domain to the other domain?

<Nicolas Dufresne>
@pq I would need some clarification about submitting frame ASAP. If we blindly do that, frames will get displayed too soon on screen (in playback, decoders are much faster then the expected render speed). In GStreamer, we have infrastructure to wait until the moment is right. The logic (simplified) is to wait for the right moment minus the "currently expected" render latency, and submit. This is in playback case of course, and is to ensure the best possible A/V sync. In that case we expect the presentation information to be helpful in constantly correcting that moment. What we miss, is some semantic, as just blindly obey to the computed render delay of last frames does not seem like best idea. We expected to be able to calculate, or estimate, a submission window that will (most of the time) hit the screen at an estimated time.

For the live case, we're still quite screwed. Nothing seems to improve our situation. We need at start to pick a latency, and if later find that latency was too small (the latency is the window in which we are able to adapt), we end-up screwing up the audio (a glitch) to increase that latency window. So again, some semantic that we could use to calculate a pessimistic latency from the first presentation report would be nice.

<Olivier Crête>
I think that in the live case you can probably keep a 1 frame queue at the sink, so when a new frame arrives, you can decide if you want to present the queued one at the next refresh or replace it with a new one. Then the thread that talks to the compositor (and gets the events, etc), can pick the buffers from the "queue" to send to the compositor.

<Nicolas Dufresne>
Ok, that make sense for non-live. Would be nice to document the intended use, that was far from obvious. We keep thinking we need to look at the number, but we don't understand at first the the moment we get called back is important. You seem to assume that we can "pick" a frame, like if the sink was pulling whatever it wants randomly, that unfortunately not how things works. We can though introduce a small queue (some late queue) so we only start blocking upstream when that queue is full. And it would help making decisions

For live it's much more complex. The entire story about declared latency is because if we don't declare any latency, that queue will always be empty. Worst case, the report will always tell use that we have displayed the frame late. I'm quite sure you told me that the render pipeline can have multiple step, where submitting frame 1 2 3 at 1 blank distance, will render on blank 3 4 5 with effectively 3 blank latency. That latency is what we need to report for proper A/V sink in live pipeline, and changing is to be done with care as it breaks the audio. That we need some ideas, cause right now we have no clue.

Admin message

waylandsink: add support wayland presentation time interface

Submitted by Wonchul Lee

Description