... | ... | @@ -104,6 +104,8 @@ Couple of solutions: |
|
|
|
|
|
A good solution might be a combination of 1/2 + 4, keeping a small local part of the stack for frequent operations and then regularly pushing/popping big chunks into/from VMEM. Will need some work to avoid significant divergence.
|
|
|
|
|
|
Stackless traversal is possible, but (a) we might not be able to fit a parent pointer in the fp16 box nodes without significant overhead (b) would need 1 load per level (instead of 1 load + 1 store per M (M=4/8?) levels in the combined solution above) (c) needs a bunch of logic to figure out the next child which would probably involve doing the intersection test again, which leads us to ...
|
|
|
|
|
|
### Should we retest parent box nodes?
|
|
|
|
|
|
The current algorithm can have effectively 3 nodes per BVH level on the stack which is quite inefficient and will blow up our stack size.
|
... | ... | |