radv: Optimize rt-pipeline traversal loop
I added some optimizations I found while writing !14565 (merged) to the rt-pipeline traversal loop:
- For some reason using an array as a stack is actually faster than shared memory.
- I moved a bit of code into an if statement.
rx6700xt, Q2RTX: 26fps -> 28fps