... | ... | @@ -104,6 +104,8 @@ Couple of solutions: |
|
|
|
|
|
A good solution might be a combination of 1/2 + 4, keeping a small local part of the stack for frequent operations and then regularly pushing/popping big chunks into/from VMEM. Will need some work to avoid significant divergence.
|
|
|
|
|
|
(for non-inline we might just do LDS + VMEM since that is way more efficient and we have 32 dwords/lane)
|
|
|
|
|
|
Stackless traversal is possible, but (a) we might not be able to fit a parent pointer in the fp16 box nodes without significant overhead (b) would need 1 load per level (instead of 1 load + 1 store per M (M=4/8?) levels in the combined solution above) (c) needs a bunch of logic to figure out the next child which would probably involve doing the intersection test again, which leads us to ...
|
|
|
|
|
|
### Should we retest parent box nodes?
|
... | ... | |