iris: a few local memory prep patches
Here's a bunch of patches from Mark from our internal tree which we could upstream today. These help us make the right decisions for uploading data to LMEM vs. SMEM, and also simply reduce the number of BOs per batch (by using larger ones), which should be more CPU-efficient.