radeonsi: RustiCL (ACO) vs Clover (LLVM)
My last numbers for my poor
Intel Xeon X3470 (Nehalem), 3 GHz, 4/8 c/t, GFX8 (Polaris 20, 8 GB), PCIe 2 system:
clpeak
Platform: Clover
Device: AMD Radeon RX 580 Series (radeonsi, polaris10, ACO, DRM 3.59, 6.12.6-1.gfb072de-default)
Driver version : 25.0.0-devel (Linux x64)
Compute units : 36
Clock frequency : 1411 MHz
Global memory bandwidth (GBPS)
float : 2.64
float2 : 2.65
float4 : 2.65
float8 : 2.17
float16 : 2.05
Single-precision compute (GFLOPS)
float : 3209.49
float2 : 3208.70
float4 : 3205.00
float8 : 3193.70
float16 : 3158.80
No half precision support! Skipped
Double-precision compute (GFLOPS)
double : 403.84
double2 : 403.80
double4 : 403.25
double8 : 401.71
double16 : 390.15
Integer compute (GIOPS)
int : 1260.00
int2 : 1236.25
int4 : 1253.34
int8 : 1251.42
int16 : 1250.63
Integer compute Fast 24bit (GIOPS)
int : 5529.18
int2 : 5352.50
int4 : 5265.23
int8 : 5216.86
int16 : 5109.00
Integer char (8bit) compute (GIOPS)
char : 6093.27
char2 : 3527.38
char4 : 3490.04
char8 : 3268.79
char16 : 3262.62
Integer short (16bit) compute (GIOPS)
short : 6000.48
short2 : 3774.82
short4 : 3531.09
short8 : 3488.43
short16 : 3497.31
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 5.04
enqueueReadBuffer : 5.07
enqueueWriteBuffer non-blocking : 5.04
enqueueReadBuffer non-blocking : 5.07
enqueueMapBuffer(for read) : 3154.82
memcpy from mapped ptr : 5.05
enqueueUnmap(after write) : 3852.68
memcpy to mapped ptr : 5.03
Kernel launch latency : 240.69 us
Platform: rusticl
Device: AMD Radeon RX 580 Series (radeonsi, polaris10, ACO, DRM 3.59, 6.12.6-1.gfb072de-default)
Driver version : 25.0.0-devel (git-2bb6db3f) (Linux x64)
Compute units : 36
Clock frequency : 1411 MHz
Global memory bandwidth (GBPS)
float : 184.33
float2 : 180.36
float4 : 186.15
float8 : 174.00
float16 : 181.05
Single-precision compute (GFLOPS)
float : 6193.49
float2 : 6173.33
float4 : 5958.12
float8 : 5918.83
float16 : 5827.80
No half precision support! Skipped
Double-precision compute (GFLOPS)
double : 401.21
double2 : 401.20
double4 : 400.05
double8 : 398.89
double16 : 397.50
Integer compute (GIOPS)
int : 1249.90
int2 : 1243.64
int4 : 1242.55
int8 : 1241.08
int16 : 1240.84
Integer compute Fast 24bit (GIOPS)
int : 1246.70
int2 : 1240.93
int4 : 1240.65
int8 : 1239.58
int16 : 1241.88
Integer char (8bit) compute (GIOPS)
char : 1028.33
char2 : 5739.72
char4 : 5444.48
char8 : 5432.31
char16 : 5397.22
Integer short (16bit) compute (GIOPS)
short : 1009.90
short2 : 5577.88
short4 : 5304.85
short8 : 5393.45
short16 : 5353.42
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 4.68
enqueueReadBuffer : 4.76
enqueueWriteBuffer non-blocking : 4.73
enqueueReadBuffer non-blocking : 4.79
enqueueMapBuffer(for read) : 3.45
memcpy from mapped ptr : 4.89
enqueueUnmap(after write) : 4.85
memcpy to mapped ptr : 4.95
Kernel launch latency : 61.91 us