rtpjitterbuffer: Fixing timer thread CPU overhead due to high wake-up rate
This bugs is related to #198 and like to #318, but the work I'd like to do goes beyond that scope. While testing performance of RIST implementation, it was found that the receiver (the one running a jitterbuffer) is an important overhead in comparison to the sender. What I found through perf analyses is that the timer thread occupy a much large space then we anticipated. Analysing the stack through the kernel raised our attention toward the wakeup rate. Most of the time seems to be spent in kernel context switching routine.
https://people.collabora.com/~nicolas/rist/2019-05-28-ristsrc-cpu-graph.svg
While tracing the wakeup, I notice in the default case (no drop, so just RTP/RTCP flow) all reschedule was done without advancing the time-out. So I thought it would not be too hard to fix, just need to properly check the timeout, but it turns out to be more difficult.
The Plan
So I collected a bit of information around to try and find what else was in progress. I don't have years in front of me, so I'm not going to finish everything everyone has ever started, but this plan is to try and get toward this goal. Here's the prior-art:
JBTimer
https://github.com/pexip/gst-plugins-good/commits/jbtimers
This is WIP to split-out one of the two timers from the GstElement code. I think this is a good idea, but I'd like to diverge a bit from it, and base the implementation on the TimerQueue.
Instead of copying the existing API, I'll write an API for it, and then port the jitterbuffer to use it. This is a bit of a shortcut I agree, but this is my style (if someone else was writing the code, it would be done differently for sure).
GstPriQueue
https://github.com/pexip/gstreamer/commit/03bfa045f9d81126e7a1e805fd4a18a6ed7f273a
That looks like the perfect data structure for the work, but I don't really have time to carry this work. Specially that I would need to actually read the code and the related papers. But as the extracted timer object will be based on queues, it should be easier to replace and benchmark. The benchmarking after my work should be much more accurate as each spurious wakeup causes an o(n) lookup each time. I also believe that in simple uses cases like RIST, I will be unable to make any gain after all from such a data structure.
The Plan
So I'll do like the JBTimer, which I have decided to name RTPTimerQueue. The RTPTimerData structure will be moved (and renamed RTPTimer, and will be turned into a GList compatible structure (like the RTPJitterBufferItem). This is a little hack to avoid some allocation and to allow zero-copy transfers from one queue to another. All required helpers will be moved. The workflow to use this will be different then the timers array, were we basically lookup the entire array each time.
So timers will be sorted at insertion base on the timeout. As new timers tends to have later timeout, the sorting won't really be a(n). For the RTX queue, the insertion will basically always be o(1).
Instead of saving bunch of random information about the current timer in priv, the access to the current timer will be done by accessing the head of the queue. The threading and the GstClockID will remains in the GstElement, at least for now. The new timer loop should only have two parts instead of 3. Top part will be to process all expired timer, and second part is to wait. The main issue with current implementation is that we wakeup for a saved current timer, and that's why we ended up having to wake it up all the time since we need to refresh this current timer.
Testing
- The usual unit test will of course have to pass
- I'll use the test_performance to make sure the results are good (at least don't regress)
- Move flamegraph will be produced with same RIST flow I had before
- I'll notify @pexip for candiates
Branch
Work will pushed to this branch (careful, I constantly rebase my work branches):
https://gitlab.freedesktop.org/ndufresne/gst-plugins-good/commits/jitterbuffer-timer-reschedule