Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
A good solution might be a combination of 1/2 + 4, keeping a small local part of the stack for frequent operations and then regularly pushing/popping big chunks into/from VMEM. Will need some work to avoid significant divergence.
(for non-inline we might just do LDS + VMEM since that is way more efficient and we have 32 dwords/lane)
Stackless traversal is possible, but (a) we might not be able to fit a parent pointer in the fp16 box nodes without significant overhead (b) would need 1 load per level (instead of 1 load + 1 store per M (M=4/8?) levels in the combined solution above) (c) needs a bunch of logic to figure out the next child which would probably involve doing the intersection test again, which leads us to ...