|
|
## Hardware info
|
|
|
|
|
|
See [RDNA2 hardware BVH](https://gitlab.freedesktop.org/bnieuwenhuizen/mesa/-/wikis/RDNA2-hardware-BVH)
|
|
|
|
|
|
## Traversal performance challenges
|
|
|
|
|
|
A simple traversal algorithm via DFS is something like this
|
|
|
```
|
|
|
void traverse(Node root, Ray R)
|
|
|
{
|
|
|
std::stack<pair<Node, uint32_t>> stack;
|
|
|
stack.push({root, 0});
|
|
|
|
|
|
double t_best_num = inf;
|
|
|
float t_best_denom = 1;
|
|
|
float I_best_num;
|
|
|
float J_best_num;
|
|
|
uint32_t triangle_best;
|
|
|
uint32_t geometry_best;
|
|
|
while(!stack.empty()) {
|
|
|
par<Node, uint32_t cur = stack.top();
|
|
|
stack.pop();
|
|
|
|
|
|
if (isGeometry(cur.first)) {
|
|
|
if (cull(R, cur.first)) /* check cull mask */
|
|
|
continue;
|
|
|
cur.second = cur.first.geometry;
|
|
|
cur.first = cur.first.geometryRoot;
|
|
|
}
|
|
|
|
|
|
Result res = intersect(R, cur.first);
|
|
|
if (isTriangle(cur.first)) {
|
|
|
if (abs(t_best_num * res.t_denom) > abs(res.t_num * best_t_denom)) {
|
|
|
t_best_num = res.t_num;
|
|
|
t_best_denom = res.t_denom;
|
|
|
I_best = res.I_num;
|
|
|
J_best = res.J_num;
|
|
|
triangle_best = cur.first.triangle_id;
|
|
|
}
|
|
|
} else {
|
|
|
if (res.box3 != 0xffffffff)
|
|
|
stack.push_back({res.box3, cur.second});
|
|
|
if (res.box2 != 0xffffffff)
|
|
|
stack.push_back({res.box2, cur.second});
|
|
|
if (res.box1 != 0xffffffff)
|
|
|
stack.push_back({res.box1, cur.second});
|
|
|
if (res.box0 != 0xffffffff)
|
|
|
stack.push_back({res.box0, cur.second});
|
|
|
}
|
|
|
}
|
|
|
|
|
|
if (!std::isinf(t_best)) {
|
|
|
reportClosestHit(t_best_num/t_best_denom,
|
|
|
I_best/t_best_denom,
|
|
|
J_best/t_best_denom,
|
|
|
geometry_best,
|
|
|
triangle_best);
|
|
|
} else {
|
|
|
reportMiss();
|
|
|
}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
(also missing the leaf box handling, intersection and any-hit shaders)
|
|
|
|
|
|
Now this has a couple of performance challenges:
|
|
|
|
|
|
1. The stack depth is non-uniform so we're reading and writing non-uniformly indexed data. This is not supported for VGPRs, so we either have to have a select tree (or at similar cost move the entire stack up/down in VGPRs), a scalaring loop, use LDS or use VMEM. None of these solutions is particularly performant.
|
|
|
|
|
|
2. There is a bunch of divergence of control flow wrt.
|
|
|
|
|
|
a. How many iterations of the loop need to be done.
|
|
|
b. Whether the triangle/box/geometry path get taken.
|
|
|
|
|
|
3. In the triangle case on hit we require an extra load which with divergent nodes could mean we are limited by the 1 cacheline/cycle throughput and hence the load could be as expensive as the intersection. (Unless the HW can make the load piggyback off the intersection in the same cycle)
|
|
|
|
|
|
Note that in the inner loop we likely don't really have too much room to reshuffle as this could be ~50 ALU + ~50 VMEM cycles and a reshuffle is not trivial. (and for real benefits it needs to be cross-wavefront ...)
|
|
|
|
|
|
Maybe splitting intersection & any-hit shaders into an outer loop and reshuffling there based on shader id would make sense though.
|
|
|
|