Changes

Bas Nieuwenhuizen · ce67a5c3
--- a/Raytracing.md
+++ b/Raytracing.md
+## Hardware info
+See [RDNA2 hardware BVH](https://gitlab.freedesktop.org/bnieuwenhuizen/mesa/-/wikis/RDNA2-hardware-BVH)
+## Traversal performance challenges
+A simple traversal algorithm via DFS is something like this
+```
+void traverse(Node root, Ray R)
+{
+   std::stack<pair<Node, uint32_t>> stack;
+   stack.push({root, 0});
+   double t_best_num = inf;
+   float t_best_denom = 1;
+   float I_best_num;
+   float J_best_num;
+   uint32_t triangle_best;
+   uint32_t geometry_best;
+   while(!stack.empty()) {
+      par<Node, uint32_t cur = stack.top();
+      stack.pop();
+      if (isGeometry(cur.first)) {
+        if (cull(R, cur.first)) /* check cull mask */
+          continue;
+        cur.second = cur.first.geometry;
+        cur.first = cur.first.geometryRoot;
+      }
+      Result res = intersect(R, cur.first);
+      if (isTriangle(cur.first)) {
+         if (abs(t_best_num * res.t_denom) > abs(res.t_num * best_t_denom)) {
+            t_best_num = res.t_num;
+            t_best_denom = res.t_denom;
+            I_best = res.I_num;
+            J_best = res.J_num;
+            triangle_best = cur.first.triangle_id;
+         }
+      } else {
+         if (res.box3 != 0xffffffff)
+            stack.push_back({res.box3, cur.second});
+         if (res.box2 != 0xffffffff)
+            stack.push_back({res.box2, cur.second});
+         if (res.box1 != 0xffffffff)
+            stack.push_back({res.box1, cur.second});
+         if (res.box0 != 0xffffffff)
+            stack.push_back({res.box0, cur.second});
+      }
+   }
+   if (!std::isinf(t_best)) {
+      reportClosestHit(t_best_num/t_best_denom,
+                       I_best/t_best_denom,
+                       J_best/t_best_denom,
+                       geometry_best,
+                       triangle_best);
+   } else {
+      reportMiss();
+   }
+}
+```
+(also missing the leaf box handling, intersection and any-hit shaders)
+Now this has a couple of performance challenges:
+1. The stack depth is non-uniform so we're reading and writing non-uniformly indexed data. This is not supported for VGPRs, so we either have to have a select tree (or at similar cost move the entire stack up/down in VGPRs), a scalaring loop, use LDS or use VMEM. None of these solutions is particularly performant.
+2. There is a bunch of divergence of control flow  wrt.
+   a. How many iterations of the loop need to be done.
+   b. Whether the triangle/box/geometry path get taken.
+3. In the triangle case on hit we require an extra load which with divergent nodes could mean we are limited by the 1 cacheline/cycle throughput and hence the load could be as expensive as the intersection. (Unless the HW can make the load piggyback off the intersection in the same cycle)
+Note that in the inner loop we likely don't really have too much room to reshuffle as this could be ~50 ALU + ~50 VMEM cycles and a reshuffle is not trivial. (and for real benefits it needs to be cross-wavefront ...)
+Maybe splitting intersection & any-hit shaders into an outer loop and reshuffling there based on shader id would make sense though.