|
|
I did some microbenchmarking of the `image_bvh_intersect_ray` instruction. This is done using a shader that has a loop
|
|
|
with
|
|
|
|
|
|
```
|
|
|
image_bvh64_intersect_ray v[16:19], v[0:15], s[8:11]
|
|
|
s_waitcnt 0
|
|
|
```
|
|
|
|
|
|
(unrolled 32 times). Every node in a wave64 is in a different cacheline but the bvh ptr stays constant across iterations so the nodes are likely in cache.
|
|
|
|
|
|
# The Basics
|
|
|
|
|
|
To start with it looks like the 1 node per CU per cycle throughput from marketing materials seems correct. With an execution mask of all ones I get ~2.65 Ginsns/sec (or ~169 Gnodes/sec). With an execution mask of `0xAAA....` I get ~5.3 Ginsns/sec (still ~169 Gnodes/sec). Basically any exec mask with a decent number of nodes enabled gives perfect scaling.
|
|
|
|
|
|
# Node types
|
|
|
|
|
|
We have 2 node types that always just get `0xffffffff` outputs from the intersection instruction and for which we actually don't need the instruction: 6 & 7, or instance and aabb leaves. However, selecting these node types doesn't seem to make the instructions execute faster.
|
|
|
|
|
|
This would suggest the obvious improvement of surrounding the instruction with an `if (node_type < 6)` but a quick attempt resulted in no improvement on Q2RTX.
|
|
|
|
|
|
TODO: test mixed workloads.
|
|
|
|
|
|
# Few lanes enabled.
|
|
|
|
|
|
With one lane enabled we only get ~14 Ginsns/sec, which is a far cry of the 169 Ginsns/sec one would expect with perfect scaling. In fact below 12 lanes enabled you don't get any improvement. Even worse, it is 12-lanes per half in wave64 (unless 0 lanes in which case 1 half can be disabled). So a mask of `exec_lo = 0x1` and `exec_hi = 0x1` only give you ~7 Ginsns/sec.
|
|
|
|
|
|
However, the hardware doesn't seem to care how these are distributed over each half.
|
|
|
|