CPU pegged on thread in Janus likely due to libnice

Copied from: https://github.com/meetecho/janus-gateway/issues/2015 (was asked to escalate here)

Hi all, I am an engineer working on the same project as @mqp was when this issue was opened: https://github.com/meetecho/janus-gateway/issues/1260

At the time, that issue was closed due to the fact the CPU pegging behavior we had seen seemed to discontinue, but in fact it hadn't, and we never followed up. In the interim, we decided to wait and see how things would fare when we upgraded janus and its various dependencies.

Unfortunately, after the upgrade, we still see the behavior of pegged threads. Some relevant info:

We have upgraded to janus 0.7.6 - we cannot fully upgrade to the latest version due to plugin incompatibilities that need to be addressed. I realize it's frowned upon to open bugs for older versions, but at least would like to know if there is reason to believe this problem is fixed in a newer version so we can prioritize the plugin migration process.
We have upgraded the various dependencies, notably, we have upgraded libnice to master HEAD as of a few days ago.

The behavior we see is occasionally an hloop thread gets pegged, reporting ~50% CPU utilization in ps and one of the CPUs becomes pegged at 100%. Attached is a perf trace perf.report.txt, showing that the time is spent polling in g_poll. These events seem relatively rare, and are hard to reproduce. We managed to reproduce it once under load testing but in general there's no systematic way to do it - it seems to be a very low probability event, though it does seem to happen more frequently against production traffic vs artificial load testing (our load test uses headless Chrome browsers.)

On one of our production nodes, we managed to grab all the relevant metadata from the admin API re: handles. The only fishy thing was one session had a relatively large number of handles (25 or so) compared to others and all of them but one had false flags around offer negotiation. Other sessions had similar non-negotiated handles but only this one 'degenerate' session had so many handles, and only a single successfully negotiated handle. Hard to say if this is useful info or not. Here is a dump of the handles: all_handles.json I did a bunch of ad hoc analysis to see if there were any telltale signs that a session was 'corrupted' based upon its metadata - the best I could come up with is that there is a session with only a single handle successfully negotiated with a lot of other non-negotiated ones - the other sessions all had at least two successfully negotiated handles.

Some things we tried/discovered:

Forcibly detaching the handles and/or destroying the sessions from the admin API failed to resolve the CPU being pegged.
The CPU pegging does not result in any other negative side effects other than capacity reduction - of course, once all the CPUs become pegged, the server becomes unavailable, but we are mitigating this problem for now by running servers with many CPUs and doing nightly restarts.
The problem manifests if a thread is spawned for each handle, or if we enable the new fixed thread feature (we are running 128 threads in production.)
Running lsof shows that each thread has the same number of open file handles, and the non-eventd ones are identical (files, etc.)
Digging through older issues, with the caveat that I realize these may be long resolved (could not determine it), I found a few relevant things:
- This issue with libnice: #14 - there did not seem to be a way to determine if/when this degenerate case occurs, though our stack traces from perf seem consistent. We are not running this under call grind so there's not a clear way to see non-sampled call counts.
- This issue from a year ago seems consistent as well, though again I can't be 100% certain: #72 (closed)

Thanks for any assist!

Edited Nov 18, 2020 by Olivier Crête