rusticl: Clpeak gives wrong results with rusticl
When evaluating rusticl (version 23.1.9) in an embedded context (khadas vim3), I wanted to test clpeak (version 1.1.2). The results I got are out of scope:
# RUSTICL_ENABLE=panfrost clpeak
Platform: rusticl
Device: Mali-G52 (Panfrost)
Driver version : 23.1.8 (Linux ARM64)
Compute units : 2
Clock frequency : 800 MHz
Global memory bandwidth (GBPS)
float : 71392.41
float2 : 78033.56
float4 : 76695.85
float8 : 79418.77
float16 : 83886.08
Single-precision compute (GFLOPS)
float : 358910.94
float2 : 292838.69
float4 : 296204.62
float8 : 305329.44
float16 : 298261.62
No half precision support! Skipped
No double precision support! Skipped
Integer compute (GIOPS)
int : 450521.03
int2 : 272985.22
int4 : 285064.19
int8 : 262957.16
int16 : 276500.03
Integer compute Fast 24bit (GIOPS)
int : 212622.12
int2 : 242197.41
int4 : 528069.75
int8 : 252645.14
int16 : 250679.03
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 5.00
enqueueReadBuffer : 5.01
enqueueWriteBuffer non-blocking : 5.00
enqueueReadBuffer non-blocking : 5.01
enqueueMapBuffer(for read) : 7110.87
memcpy from mapped ptr : 4.95
enqueueUnmap(after write) : 7918.45
memcpy to mapped ptr : 4.74
Kernel launch latency : 0.00 us
I think these values are so high because the computed time is too short. Indeed, the executed code is the following one: (I haven't enabled the useEventTimer
option.)
(clpeak: src/clpeak.cpp)
if (useEventTimer)
{
...
}
else // std timer
{
Timer timer;
timer.start();
for (uint i = 0; i < iters; i++)
{
queue.enqueueNDRangeKernel(kernel, cl::NullRange, globalSize, localSize);
queue.flush();
}
queue.finish();
timed = timer.stopAndTime();
}
return (timed / static_cast<float>(iters));
}
The command queue is flushed at each iteration of the loop and then a finish()
call is made to wait until all the commands have been executed. However, when using rusticl, calling flush()
run the following code:
(src/gallium/frontends/rusticl/api)
pub fn flush_queue(command_queue: cl_command_queue) -> CLResult<()> {
// CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host command-queue.
command_queue.get_ref()?.flush(false)
}
(src/gallium/frontends/rusticl/core)
pub fn flush(&self, wait: bool) -> CLResult<()> {
let mut p = self.pending.lock().unwrap();
let events = p.clone();
// This should never ever error, but if it does return an error
self.chan_in
.send((*p).drain(0..).collect())
.map_err(|_| CL_OUT_OF_HOST_MEMORY)?;
if wait {
for e in events {
e.wait();
}
}
Ok(())
}
Each time flush is called, the commands contained in the command queue are sent using an mpsc sender (documentation). During this step, the ownership of the commands is passed from the command queue to the associated device which will execute the commands.
Therefore, the command queue is empty after each iteration of the loop.
After the loop, when the finish function is called, the following code is executed:
(src/gallium/frontends/rusticl/api)
pub fn finish_queue(command_queue: cl_command_queue) -> CLResult<()> {
// CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host command-queue.
let q = command_queue.get_ref()?;
for q in q.dependencies_for_pending_events() {
q.flush(false)?;
}
q.flush(true)
}
A new call to the flush function which is made with wait=true
. However, the command queue is empty because of the flush and there is no event to wait for. The program finishes without waiting for the commands that have been flushed, which leads to aberrant benchmark values.
When I removed the flush in the loop or replaced the flush with a finish, the values obtained were consistent.
I don't know if the problem should be fixed in clpeak or in rusticl. The documentation does not indicate if the finish command is supposed to wait after a flush call.