Skip to content

NIR / glsl: Optimize soft fp64

Ian Romanick requested to merge idr/mesa:review/optimize-soft64 into master

The main purpose of this series it to improve the compilation performance of fp64 tests on platforms that use the soft-fp64 path. A lot of these tests generate a lot of instructions with soft-fp64. The OpenGL CTS KHR-GL46.gpu_shader_fp64.builtin.inverse_dmat4 compiles to around 97,676 instructions on Gen11. Because the tests are very large and have a large number of basic blocks, they spill a lot. The aforementioned test has 398 spills and 2,041 fills on Gen11. These factors cause the compiler to spend a huge amount of time in the register allocator.

The key obversvation that motivates this series is (spoiler alert!) flow control is bad for GPUs. Flow control is removed two ways. First, the soft-fp64 code is modified to be more GPU friendly. This code is a C-to-GLSL port of the Berkeley SoftFloat library. That library is highly optimized for CPUs, but GPUs require different optimization strategies. Nearly all of the series that modifies the file src/compiler/glsl/float64.glsl and is prefixed with soft-fp64 makes these source-level optimizations.

Second, a number of micro optimizations are added to nir_opt_algebraic. These optimizations try to reduce the size of some smaller basic blocks so that they will be flattened by nir_opt_peephole_select.

These combine to accomplish some pretty significant results. On Gen11, KHR-GL46.gpu_shader_fp64.builtin.inverse_dmat4 is reduced from 97,676 instructions / 398 spills / 2,041 fills to 67,545 insructions / 368 spills / 1,155 fills.

For more complete data on the fp64 tests, I extract shaders from KHR-GL46.gpu_shader_fp64.builtin.*, KHR-GL46.gpu_shader_fp64.fp64.conversions, and KHR-GL46.gpu_shader_fp64.fp64.operators, and I ran shader-db on the 306 collected shaders. Step-by-step results are included in each commit. The results across the whole MR are:

Tiger Lake
total instructions in shared programs: 936596 -> 658127 (-29.73%)
instructions in affected programs: 930744 -> 652275 (-29.92%)
helped: 178
HURT: 0
helped stats (abs) min: 1 max: 31351 x̄: 1564.43 x̃: 348
helped stats (rel) min: 1.33% max: 86.36% x̄: 25.43% x̃: 29.65%
95% mean confidence interval for instructions value: -2030.85 -1098.02
95% mean confidence interval for instructions %-change: -27.51% -23.34%
Instructions are helped.

total cycles in shared programs: 7323908 -> 5344600 (-27.03%)
cycles in affected programs: 7273874 -> 5294566 (-27.21%)
helped: 169
HURT: 0
helped stats (abs) min: 2 max: 205170 x̄: 11711.88 x̃: 2424
helped stats (rel) min: 0.66% max: 91.19% x̄: 27.99% x̃: 27.87%
95% mean confidence interval for cycles value: -15101.19 -8322.58
95% mean confidence interval for cycles %-change: -30.93% -25.05%
Cycles are helped.

total spills in shared programs: 635 -> 445 (-29.92%)
spills in affected programs: 635 -> 445 (-29.92%)
helped: 3
HURT: 0

total fills in shared programs: 2064 -> 1323 (-35.90%)
fills in affected programs: 2064 -> 1323 (-35.90%)
helped: 3
HURT: 0


Ice Lake
total instructions in shared programs: 930721 -> 653447 (-29.79%)
instructions in affected programs: 925439 -> 648165 (-29.96%)
helped: 178
HURT: 0
helped stats (abs) min: 1 max: 30131 x̄: 1557.72 x̃: 349
helped stats (rel) min: 1.35% max: 88.37% x̄: 25.45% x̃: 29.68%
95% mean confidence interval for instructions value: -2014.52 -1100.92
95% mean confidence interval for instructions %-change: -27.51% -23.39%
Instructions are helped.

total cycles in shared programs: 7359162 -> 5341227 (-27.42%)
cycles in affected programs: 7309062 -> 5291127 (-27.61%)
helped: 169
HURT: 0
helped stats (abs) min: 2 max: 207447 x̄: 11940.44 x̃: 2462
helped stats (rel) min: 0.66% max: 91.24% x̄: 28.12% x̃: 28.24%
95% mean confidence interval for cycles value: -15385.29 -8495.60
95% mean confidence interval for cycles %-change: -31.02% -25.22%
Cycles are helped.

total spills in shared programs: 426 -> 368 (-13.62%)
spills in affected programs: 426 -> 368 (-13.62%)
helped: 3
HURT: 0

total fills in shared programs: 2089 -> 1155 (-44.71%)
fills in affected programs: 2089 -> 1155 (-44.71%)
helped: 3
HURT: 0

Now, the point of the series is to improve the run-time of the tests. I tested this on a Gen8 system using a debug build with -O2 and no other special optimization flags. I ran the same set of tests mentioned above by:

NIR_VALIDATE=false INTEL_DEBUG=soft64 ./glcts --deqp-caselist-file=caselist.txt

The results across the whole MR are:

x Mesa master
+ this MR
+--------------------------------------------------------------------------+
|+                                                                         |
|+                                                                         |
|+                                                                      xx |
|++                                                                     xxx|
|A|                                                                     |A||
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        178.34        180.53        179.35       179.332    0.86392708
+   5        103.49        104.32        103.74       103.818    0.32713911
Difference at 95.0% confidence
	-75.514 +/- 0.952682
	-42.1085% +/- 0.531239%
	(Student's t, pooled s = 0.653219)

I did not collect step-by-step results for this.

If we wish to improve soft-fp64 further, I would suggest the following steps, in order of increasing difficulty:

  1. Implement special "compare with zero" functions for all of the relational operators: <, <=, >, >=, and ==. != should be implemented as !(x == 0). In lower_doubles_instr_to_soft, we would have to detect that one of the operands is the constant 0 and emit different function calls.

  2. Implement some special fused functions. For example, in the Intel driver we emit special, optimized code for constant * fsign(x). There may be some cases that are worth doing that for fp64. This would require similar changes to lower_doubles_instr_to_soft. Some things that come to mind are __fadd3_64 or maybe __fadd64_lt_zero (add and compare result for less-than zero).

  3. Implement real __ffma64. The current function just calls __fmul64 and __fadd. This is really inefficient. The Berkeley SoftFloat library might have an FMA implementation, so this might not be that much effort. I suspect that a direct port would be less efficient than the current implementation. Optimizations similar to those in this MR would likely need to be applied there.

@mattst88 @majanes @craftyguy @hopetech

Edited by Ian Romanick

Merge request reports