Skip to content

WIP: Optimize setting unifroms.

Ian Romanick requested to merge idr/mesa:wip/uniform-setter into main

I started working on this branch five years ago. At the time, I was trying to improve CPU performance on some of early Atom CPUs with GPUs supported by the i965 driver. This was an exercise in pure frustration. Every change that I made had a random performace impact on the order of ±10%. This includes changes that weren't executed by a particular benchmark. I finally determined that those Atom CPUs a really, really sensitive to how functions and loops are aligned. Each change would affect the alignment of dozens of functions throughout the driver.

I tried the changes on Core systems, and I wasn't able to measure a significant change. It seemed to help, but all of the changes were in the range of test deviation noise.

So, I abandoned it for a long, long time. Now we have this new, shiny Iris driver that uses a lot less CPU, so there's some chance this might actually help. I've really just dusted off the original series by rebasing on over 17,000 new commits. I also fixed a couple bugs that caused some piglit / CI failures. I have not done any new performance testing.

There are three things that happen in this series:

  1. Specialize some of the commonly used glUniform variants. Previously there were a small number of core functions that handled everything. This resulted in a log of redundant or non-sense checks. For example, the glUniform4iv path had checks to handle setting sampler uniforms, but that is not possible on that path. Specializing the functions grows the code, but deletes things from the execution paths.

  2. Implement a uniform cache. A shocking number of applications will repeatedly set uniforms to the values they already have. Sometimes this happens many times per frame. Each time that happens, a lot of unnecessary flushing, etc. occurs. The uniform cache just checks that new value against the old value and elides the work of changing the uniform when the values are the same.

  3. Add SSE2 optimizations. The SSE2 optimizations exist primarily to help the uniform cache checks. More could possibly be done here, but I don't know that it's worth the effort. I thought I had some changes that used later versions of SSE, but I cannot find that code now.

There is still one piglit failure on i965 and a small handful on Iris.

Merge request reports