

## Hardware info






See [RDNA2 hardware BVH](https://gitlab.freedesktop.org/bnieuwenhuizen/mesa//wikis/RDNA2hardwareBVH)






## Traversal performance challenges






A simple traversal algorithm via DFS is something like this



```



void traverse(Node root, Ray R)



{



std::stack<pair<Node, uint32_t>> stack;



stack.push({root, 0});






double t_best_num = inf;



float t_best_denom = 1;



float I_best_num;



float J_best_num;



uint32_t triangle_best;



uint32_t geometry_best;



while(!stack.empty()) {



par<Node, uint32_t cur = stack.top();



stack.pop();






if (isGeometry(cur.first)) {



if (cull(R, cur.first)) /* check cull mask */



continue;



cur.second = cur.first.geometry;



cur.first = cur.first.geometryRoot;



}






Result res = intersect(R, cur.first);



if (isTriangle(cur.first)) {



if (abs(t_best_num * res.t_denom) > abs(res.t_num * best_t_denom)) {



t_best_num = res.t_num;



t_best_denom = res.t_denom;



I_best = res.I_num;



J_best = res.J_num;



triangle_best = cur.first.triangle_id;



}



} else {



if (res.box3 != 0xffffffff)



stack.push_back({res.box3, cur.second});



if (res.box2 != 0xffffffff)



stack.push_back({res.box2, cur.second});



if (res.box1 != 0xffffffff)



stack.push_back({res.box1, cur.second});



if (res.box0 != 0xffffffff)



stack.push_back({res.box0, cur.second});



}



}






if (!std::isinf(t_best)) {



reportClosestHit(t_best_num/t_best_denom,



I_best/t_best_denom,



J_best/t_best_denom,



geometry_best,



triangle_best);



} else {



reportMiss();



}



}



```






(also missing the leaf box handling, intersection and anyhit shaders)






Now this has a couple of performance challenges:






1. The stack depth is nonuniform so we're reading and writing nonuniformly indexed data. This is not supported for VGPRs, so we either have to have a select tree (or at similar cost move the entire stack up/down in VGPRs), a scalaring loop, use LDS or use VMEM. None of these solutions is particularly performant.






2. There is a bunch of divergence of control flow wrt.






a. How many iterations of the loop need to be done.



b. Whether the triangle/box/geometry path get taken.






3. In the triangle case on hit we require an extra load which with divergent nodes could mean we are limited by the 1 cacheline/cycle throughput and hence the load could be as expensive as the intersection. (Unless the HW can make the load piggyback off the intersection in the same cycle)






Note that in the inner loop we likely don't really have too much room to reshuffle as this could be ~50 ALU + ~50 VMEM cycles and a reshuffle is not trivial. (and for real benefits it needs to be crosswavefront ...)






Maybe splitting intersection & anyhit shaders into an outer loop and reshuffling there based on shader id would make sense though.



