broadcom/compiler: track pending ldtmu count with each TMU lookup
And use this information when scheduling QPU to avoid merging a new TMU request into a previous ldtmu instruction when doing so may cause TMU output fifo overflow due to a stalling ldtmu.
The underlying issue is that if the ldtmu may stall, there is no guarantee that it will complete (and free the output FIFO space) before the new lookup completes, so both would race.