ts: implementations evaluation
Testing threadshare variants
ts-standalone
ts-standalone
is a tool designed to test the performance of the threadshare
runtime and the implementation variants.
The tool intantiates a pipeline with S
sub pipelines consisting in a source
and a sink. Each source pushes buffers at a fixed interval rate and the matching
sink receives the buffers and logs statistics such as the actual interval between
buffers and the duration between the buffer's generation and its reception. This
last measure will be refered to as the latency
in the rest of this description.
When compiled with the tuning
feature, the statistics also include the total
duration the threadshare scheduler has spent parked. Considered as a percent of
the processing duration, the parked duration gives a good approximation of
100% - CPU usage.
Statistics start after a ramp up period and stop early enough so as to avoid transient effects.
ts-standalone
already implements a sink for which the PadSinkHandler
passes
buffers to a ts runtime Task
via an asynchronous channel. This design is
convenient from a developer's perspective as they can take advantage of the
TaskImpl
functions accepting &mut self
, which means no synchronization
primitives are required in the hot path.
However, one drawback of using a Task
is that fields which are necessary to the
Task
loop implementation are not visible to the element (of course unless they
are shared as an Arc<Mutex<_>>
). Another drawback is that buffers need to go
through a channel, which may impact performance. The Task
loop implementation
itself relies on trait
functions. As of rustc 1.65.0, async
trait
functions
are not available, forcing the functions to return boxed
Future
s, which
comes with heap allocation overhead and increased cache misses.
ts-udpsink
ts-udpsink
used to be based on the Task
design. In this element, the sockets'
clients can be listed or modified via the element's properties. A command oriented
channel allowed to update the clients list along the usual buffers handling.
While it works in practice, this design adds a bit of complexity.
ts-standalone
Extending In this MR, I decided to evaluate the impact of using a Task
in a ts-sink
element. To this end, I implemented 2 additional sink elements which handle the
buffers and log statistics in the PadSinkHandler
:
- One sink uses the
sync
Mutex
fromstd
. Note that thisMutex
can not be kept locked accrossawait
points, which is a no-go for some implementations. - The other uses the
async
Mutex
from thefutures
crate.
I also tested the Mutex
from the async-lock
crate, but my tests didn't show
significant improvements over the one from the futures
crate.
The diagrams below compare the 2 new Mutex
-based sinks to the Task
sink.
According to these tests, the difference in parked duration between the sync
and async
Mutex
is marginal.
As the streams number increases, the parked duration for the Task
variant
drops significantly compared to the Mutex
variants. Note however that the
sink elements perform no async
i/o nor timers operations and the
statistics are computed and logged by only one element whatever the streams numbers.
See below for tests on a closer to reality use case. I believe this diagram is
still valuable as an evaluation of the intrinsic impact of the implementation.
Whatever the streams numbers used, the latency goes from around 1.5µs for the
both Mutex
variants to about 5ms for the Task
variant.
There might be better mechanisms for passing buffers than the async
channel.
We only need to pass one item at a time. The ts-standalone
should make it easy
to experiment alternative solutions to improve CPU usage and hopefully reduce
latency. Meanwhile, using a Mutex
variant when applicable seems like a better
choice.
ts-udpsink
back to a Mutex
variant
Migrating Based on the above results, I decided to migrate ts-udpsink
back to a Mutex
variant. For this element, we have no other choice but using an async
Mutex
since we need to hold the lock while await
ing for the sockets to send the
buffers.
I decided to create a model similar to ts-standalone
using the existing
udpsrc-benchmark-sender
and the benchmark
receiver under the ts examples
.
I also wanted to make sure the operating point was properly chosen for real life
data. For this reason, I created the ts-audiotestsrc
element, so that we
can listen to the buffers sent by ts-udpsink
. I tuned up a buffer duration
which plays well with this model (10ms of mono audio raw signed 16 bits samples
@44100 samples/s).
Here, the overhead of the Task
model still stands out, though it is not as
prevailing as for the ts-standalone
case. I believe this is due to the fact
that the udp-sender
has significant processing to perform, which flattens the
overhead of the runtime itself.
Additional details
For each test case, the tool is invoked 10 times so as to account for variations on the memory layouts and to flatten other external influences. Each iteration runs about 2mn worth of buffers.
Specs
- CPU: Core i5-7200U (2 physical cores)
- Performance governor
- Mem: 12 GB
- Linux kernel: 5.19.16-301.fc37.x86_64
- Airplane mode activated
ulimit -Sn 30000
- rustc 1.65.0 (897e37553 2022-11-02)
- gst-plugin-rs base @ 7ac29827
- gstreamer-rs @ efb85f416e0edb44e6543e0123d5706de43c529b
Commands
ts-standalone
export GST_DEBUG=ts-standalone*:4
export GST_DEBUG_NO_COLOR=1
export BUFFERS=5000
export STREAMS=7000
export FILE=~/temp/ts-standalone.${STREAMS}x${BUFFERS}.log; mv ${FILE} ${FILE}.old; for SINK in sync-mutex async-mutex task; do echo "-----------"; echo $SINK; echo "-----------"; for i in {1..10}; do echo $i; target/release/examples/ts-standalone -n $BUFFERS -s $STREAMS --sink $SINK >>${FILE} 2>&1; done; done;
Then a logs parser extracts the statistics and generates a csv file.
udp benchamrk
export GST_DEBUG=ts-audiotestsrc:4
export GST_DEBUG_NO_COLOR=1
export BUFFERS=10000
export FILE=~/temp/ts-udpsender.${BUFFERS}.log; mv ${FILE} ${FILE}.old; for STREAMS in 100 200 300 400 500; do echo "-----------"; echo $STREAMS; echo "-----------"; echo "-----------" >> ${FILE}; echo $STREAMS >> ${FILE}; echo "-----------" >> ${FILE}; for i in {1..10}; do echo $i; echo $i >> ${FILE}; target/release/examples/udpsrc-benchmark-sender ${STREAMS} test ${BUFFERS} >>${FILE} 2>&1; done; done;
Buffers are consumed in another terminal:
target/release/examples/benchmark 700 ts-udpsrc 1 20