Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
A good solution might be a combination of 1/2 + 4, keeping a small local part of the stack for frequent operations and then regularly pushing/popping big chunks into/from VMEM. Will need some work to avoid significant divergence.
A good solution might be a combination of 1/2 + 4, keeping a small local part of the stack for frequent operations and then regularly pushing/popping big chunks into/from VMEM. Will need some work to avoid significant divergence.
Stackless traversal is possible, but (a) we might not be able to fit a parent pointer in the fp16 box nodes without significant overhead (b) would need 1 load per level (instead of 1 load + 1 store per M (M=4/8?) levels in the combined solution above) (c) needs a bunch of logic to figure out the next child which would probably involve doing the intersection test again, which leads us to ...
### Should we retest parent box nodes?
### Should we retest parent box nodes?
The current algorithm can have effectively 3 nodes per BVH level on the stack which is quite inefficient and will blow up our stack size.
The current algorithm can have effectively 3 nodes per BVH level on the stack which is quite inefficient and will blow up our stack size.