Skip to content

nak,nir: Convert to unstructured control flow in NIR

Faith Ekstrand requested to merge gfxstrand/mesa:nak/uniform-cf into main

NVIDIA Turing and later hardware really wants to be unstructured. Instead of providing structure, it provides these warp barriers which allow you to force re-convergence. Apart from those, however, it looks just like a CPU with a single (possibly predicated) branch instruction. Getting proper re-convergence with this scheme is tricky and the analysis required to do it is best done in NIR. Doing so requires us to first beef up NIR's unstructured control-flow support.

Details

The previous NAK approach was to leave control-flow structured and sprinkle through bar_set/break/sync_nv intrinsics to force re-convergence after if ladders and loops. Unfortunately, this solution was pretty fundamentally flawed and the correct fix is more complicated.

The problem lies entirely with bar_break_nv. We were calling this intrinsic right before the break or continue jump to break out of any nested control-flow structures. So, for a multiply nested divergent if ladder inside a loop, break would invoke bar_break_nv for each if and then for the loop before doing the actual jump. This sounds good in theory until you realize that the entire state for the break is stored in per-lane GPU registers.

If you look at bar_break_nv, you'll see that it takes a barrier and returns another barrier. The semantics are a sort of &= operation but that's the best way to represent it in SSA. There's then a phi at the top of the loop which we hope back-end RA is able to coalesce properly. However, because the breaks only happened on the break path in the control-flow, the result of the break never makes it back into the loop for the next iteration. This means the loop always tries to sync on the original lane set, not the lane set with the break lanes removed. Somehow this didn't hang and I still don't have a solid answer for why.

So how do we fix this? Well, that depends very much on what the barriers actually do. Our current mental model (which I believe is correct) of the barrier instructions is that they're just a fancy version of ballot. Importantly, I don't think that, apart from that ballot op, there is any cross-lane communication within a warp internal to the break instruction. The barrier register contains a lane mask per-lane and oeprations are per-lane as well. Importantly, if you do a bar_break_nv, only the active have their view updated. This means that, in order to break any lanes out of a barrier, all lanes involved in the barrier need to be present at the bar_break_nv.

What does this mean practically? It means that in order to break we actually need to do a sort of cascade:

loop {
   if %c {
      break;
   }
}

becomes

%loop_bar = bar_set_nv
loop {
   %cont_bar = phi %loop_bar %cont_brk_bar
   if %c {
      goto cont_sync;
   }
cont_sync:
   bar_sync_nv %cont_bar
   %cont_brk_bar = bar_break_nv %cont_bar %c
   break_if %c
}
bar_sync_nv %loop_bar

In other words, in order for the break to work properly, we have to first merge all loop lanes at the continue point, break out of our barrier and then let the break lanes actually break. This way, when the other lanes loop back around, they know which lanes broke out of the loop so they can properly sync.

Of course, this all gets more complicated when there are multiple levels of nesting. For that, we track the scope depth and each break op breaks to a given depth which is tracked in a reg. After we sync, we check if any lanes need to break to a higher scope and, if they do, we jump to the next scope. This cascade ensures that we sync at each level and that every lane knows what all lanes it's grouped with so we can sync properly.

Edited by Faith Ekstrand

Merge request reports