Bas Nieuwenhuizen · f0921637
--- a/RDNA2-intersection-performance.md
+++ b/RDNA2-intersection-performance.md
+I did some microbenchmarking of the `image_bvh_intersect_ray` instruction. This is done using a shader that has a loop
+with 
+
+```
+   image_bvh64_intersect_ray v[16:19], v[0:15], s[8:11]
+   s_waitcnt 0
+```
+
+(unrolled 32 times). Every node in a wave64 is in a different cacheline but the bvh ptr stays constant across iterations so the nodes are likely in cache.
+
+# The Basics
+
+To start with it looks like the 1 node per CU per cycle throughput from marketing materials seems correct. With an execution mask of all ones I get ~2.65 Ginsns/sec (or ~169 Gnodes/sec). With an execution mask of `0xAAA....` I get ~5.3 Ginsns/sec (still ~169 Gnodes/sec). Basically any exec mask with a decent number of nodes enabled gives perfect scaling.
+
+# Node types
+
+We have 2 node types that always just get `0xffffffff` outputs from the intersection instruction and for which we actually don't need the instruction: 6 & 7, or instance and aabb leaves. However, selecting these node types doesn't seem to make the instructions execute faster.
+
+This would suggest the obvious improvement of surrounding the instruction with an `if (node_type < 6)` but a quick attempt resulted in no improvement on Q2RTX.
+
+TODO: test mixed workloads.
+
+# Few lanes enabled.
+
+With one lane enabled we only get ~14 Ginsns/sec, which is a far cry of the 169 Ginsns/sec one would expect with perfect scaling. In fact below 12 lanes enabled you don't get any improvement. Even worse, it is 12-lanes per half in wave64 (unless 0 lanes in which case 1 half can be disabled). So a mask of `exec_lo = 0x1` and `exec_hi = 0x1` only give you ~7 Ginsns/sec.
+
+However, the hardware doesn't seem to care how these are distributed over each half.
+