anv: hw descriptor state mismatch in supertuxkart

Strap on your crash helmet, this one's painful.

I've found some kind of weird corner case where some of the hardware states related to descriptors (3DSTATE_CONSTANT_VS, flush_descriptor_sets()) get out of sync with the command buffer state or something. The method for reproducing is simple:

clone the debug branch I've made for this ticket (https://gitlab.freedesktop.org/zmike/mesa/-/commits/watkart) and build zink+anv like usual
run MESA_LOADER_DRIVER_OVERRIDE=zink supertuxkart --track="ravenbridge_mansion" -R
wait for craziness

What seems to be happening is that zink's caching of descriptor sets, combined with its weird outboard compute cmdbuf are screwing up the hardware states. The debug branch above is hacked to use up to 2 descriptor sets: one for samplers (only in GFX cmdbufs) and one for everything else (compute will always use one set in this branch). Compared to HEAD~2, there are noticeable rendering regressions that aren't reproducible on RADV or lavapipe.

This is a tough problem, so I've added a bunch of debugging facilities to the branch to aid with testing, all in the form of environment variables which can be toggled to change runtime behavior:

ZINK_ONE_SET forces zink to go back to using a single descriptor set at all times while using all the same codepaths; this restores previous (good) behavior, though it also doesn't actually do much reusing of descriptor sets
ZINK_ALWAYS_UPDATE effectively disables caching, forcing vkUpdateDescriptorSets to be called even if no descriptors have changed; this has no effect other than to prove updating doesn't resolve the issue
ZINK_NO_COMPUTE disables compute extension support, forcing supertuxkart to use a different renderer; the good behavior is restored in this case, which proves that (a) the caching/reuse is not an issue (b) this is likely somehow triggered by the compute batch's existence

Furthermore, the branch will print to stdout all the binding values for sampler descriptors along with the descriptor set. Using diff on the outputs will reveal that they are identical save for the descriptor set. Similarly, all shader output is identical with and without ZINK_ONE_SET.

Red herrings:

don't bother checking validation errors, there's a bunch of them but none are new in the commit triggering the issue (HEAD~1) or related to the issue
barriers seem good and I've tried jamming in tons of manual ones to verify just for hahas
fencing is also fine, as this branch forces an explicit fence for every single scanout frame

Solutions I've found (not actual solutions, but ones which mitigate/resolve the issue):

forcing cmd_buffer->state.descriptors_dirty = VK_SHADER_STAGE_VERTEX_BIT|VK_SHADER_STAGE_FRAGMENT_BIT on no-op graphics pipeline update (i.e., old_pipeline == new_pipeline)
forcing cmd_buffer->state.descriptors_dirty = VK_SHADER_STAGE_VERTEX_BIT|VK_SHADER_STAGE_FRAGMENT_BIT on descriptor set binding
forcing VK_SHADER_STAGE_VERTEX_BIT in this block from genX_cmd_buffer.c mitigates the issue somewhat but doesn't fully resolve it:

   /* We emit the binding tables and sampler tables first, then emit push
    * constants and then finally emit binding table and sampler table
    * pointers.  It has to happen in this order, since emitting the binding
    * tables may change the push constants (in case of storage images). After
    * emitting push constants, on SKL+ we have to emit the corresponding
    * 3DSTATE_BINDING_TABLE_POINTER_* for the push constants to take effect.
    */
   uint32_t dirty = 0;
   if (descriptors_dirty) {
      dirty = flush_descriptor_sets(cmd_buffer,
                                    &cmd_buffer->state.gfx.base,
                                    descriptors_dirty,
                                    pipeline->shaders,
                                    ARRAY_SIZE(pipeline->shaders));
      cmd_buffer->state.descriptors_dirty &= ~dirty;
      /* wat */
      dirty |= VK_SHADER_STAGE_VERTEX_BIT;
   }

I think that's everything I know at this point. I'm on GEN11 in case that matters.

Edited Feb 27, 2021 by Mike Blumenkrantz

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information