Skip to content

gallium: compute pbos

Mike Blumenkrantz requested to merge zmike/mesa:cs-pbo into main

I was waiting on various other things to merge before putting this up, but the time has finally come to talk about doing pbo downloads with compute shaders.

As seen in #4735 (pbobench MR pending), this provides a performance boost for many cases, and at larger texture sizes the improvement becomes much more noticeable (2x-10x faster).

Here's some quick answers to questions I'm expecting:

What's the unit test status of this?

It passes KHR-GL33.packed_pixels* in CTS as well as 99.9% of piglit tests.

The lone failure I've seen is the GL_ARB_texture_cube_map_array case in spec@arb_pixel_buffer_object@pbo-getteximage, which somehow isn't reading any data from the texture. Probably this has something to do with coordinate components or something since I'm trying to read it as a 2d array? No idea, but I'm guessing it's trivial to fix once someone spots it.

How can I test this?

This MR branch is hardcoded to use the new codepath, so you can just build it and run stuff to see if it blows up. I've used iris and radeonsi for my testing.

Your driver must have support for 8bit and 16bit ALUs and storage, however, as one of my goals here was to remain within a single vec4 of data for the constant buffer.

Why ubershaders?

Initially I started with much more specialized shaders, but this became a massive problem when running something like CTS, where the shader compile times were taking 10-20 minutes because there were so many. I'm not sure what exactly we want to do here, but at the least, the ubershaders are still fast enough to make this method a compelling alternative, so it seems to me like this is, conceptually, a reasonable starting point.

Any work remaining?

As mentioned, this is currently hardcoded to use the new codepath. I had imagined that this would be used through the following mechanism:

  • new mode added to use compute codepath
  • possibly also new mode for hybrid codepath which uses heuristics to determine the optimal mechanism (e.g., use blit for small texture sizes and compute for large texture sizes) ?

There's also the GL_ARB_texture_cube_map_array thing.

Otherwise, this should work without issues as-is.


I wrote this a few months ago, and it's still a bit rough. Feedback would be appreciated.

depends on !11982 (merged) !11983 (merged)

Edited by Mike Blumenkrantz

Merge request reports