Skip to content

amd: implement a universal optimized compute image clear/blit shader and MSAA-resolving pixel shader

Marek Olšák requested to merge mareko/mesa:compute-blit into main

This MR optimizes all image clears, blits, and MSAA resolving. Most of the MR are radeonsi changes, but the last 3 commits move a lot of that work to amd/common, including the computation of compute dispatch parameters.

General ideas:

  • The CB_RESOLVE path, compute_image_fast_clear path, and compute blit path (supporting image clears and blits) get a new fail_if_slow parameter. If any of them fail, the next method is tried. If all fail, the pixel shader path is the fastest option.
  • The compute image clear/blit shader itself is optimized for Navi31. The selection of the fastest path is optimized for all generations (gfx6-11).

Compute blit shader design:

  • It supports any non-scaled blit, including upside down flipping and resolving. It also supports clears.
  • Clearing/blitting 16B per lane usually has the best performance, which means the compute shader must clear/blit multiple pixels per lane. That's 1024B per wave64. For MSAA resolving, 32B read and 8B written per lane is one of the options. The best options are in the code.
  • The block of pixels that's processed by each lane can be 4x1 or 2x2, or 1x2, or even 2x1x2. Different options have difference performance. The best options are in the code.
  • If a workgroup touches a 256B block, it must store the whole block and it must be the only workgroup touching that block. The code has logic that does this.
  • The workgroup size is variable, but the most common options are 8x8, 4x4x4, and 64x1, but it could be anything if the clear/blit area is narrow.
  • VMEM clauses are a must. D16 and A16 are recommended for lower VGPR usage if allowed. Resolving in FP16 is recommended if allowed. Not loading/storing/resolving components that the format doesn't have is recommended.
  • The compute shader is obviously procedurally generated and its generated variants are cached.
  • Depth/stencil is not supported since a draw-based approach likely beats compute.

Performance findings on the top GPUs of each gfx version (lower chips may differ):

  • If the tiling is thin, the pixel shader path for image clears, blits, and MSAA resolving usually outperforms all paths including the complicated compute shader in this MR.
  • In a few cases, the pixel shader path also outperforms the DCC comp_to_single clear.
  • The pixel shader path almost always outperforms the fixed-func CB_RESOLVE path.
  • If the tiling is thick or linear or the copy is L2T or T2L, the compute shader path usually outperforms all paths.
  • Both the compute shader and the pixel shader trade blows, with the pixel shader being faster in the more common cases.

This depends on !28845 (merged).

Edited by Marek Olšák

Merge request reports