Changes

Bas Nieuwenhuizen · 76fe3a84
--- a/Raytracing.md
+++ b/Raytracing.md
@@ -78,3 +78,40 @@ Note that in the inner loop we likely don't really have too much room to reshuff
 Maybe splitting intersection & any-hit shaders into an outer loop and reshuffling there based on shader id would make sense though.
+###  Solving Loop Iteration Divergence
+What might be faster than full-blown reshuffling algorithms is some kind of work-stealing.
+Two considerations here:
+1. To prevent work-stealing overhead we should try to steal something higher in the stack than a leaf node. That way we won't have to run the work-stealing algorithm every iteration.
+2. For inline raytracing we may not be able to use LDS so getting the merge back to work correctly can be a bit messy.
+### Solving Non-Uniform Stack Accesses
+Couple of solutions:
+1) select-tree of size N. Takes (M+1)N instructions for M reads with consecutive indices and (M+1)N instructions for M writes.  Prohibitive for large N.
+2) Move stack up & down on push/pop. Needs only 1 instruction for reads and MN for M writes but when pushing/popping multiple at a time this can result in code bloat & significant divergence. The NIR array lowering will not do this for us. Prohibitive for large N.
+3) LDS. Most flexible but can't be large with a good occupance. Also likely not allowed to use for inline raytacing as the app can use the full LDS.
+4) VMEM. Slow an cache-thrashing but otherwise ok.
+A good solution might be a combination of 1/2 + 4, keeping a small local part of the stack for frequent operations and then regularly pushing/popping big chunks into/from VMEM. Will need some work to avoid significant divergence.
+### Should we retest parent box nodes?
+The current algorithm can have effectively 3 nodes per BVH level on the stack which is quite inefficient and will blow up our stack size.
+Furthermore, if we have a hit in a child we can reduce the ray extent. If we hit in the closes child of a node maybe we do not need to visit the other children anymore. In that case it would be more efficient to test the parent instead of all the other child node.
+So if besides we the first child we put a node "retest parent from child offset M"  than we can (a) reduce the stack space to 1 node/level and (b) maybe avoid some ray/box intersection checks.
+Of course if we don't cull any boxes this way then this can be a net negative in number of box nodes processes.