Skip to content

gl: optimize glcolorbalance by precomputing shader math

Precompute the rgb -> yuv conversion and color balance adjustment math so that the shader does minimal work per pixel.

Merging these 15+ steps into 3 steps let us jump from choppy 360p video to smooth 720p video on our underpowered embedded system.

The basic observations that enabled these optimizations are:

  1. The individual dot products of yuv <-> rgb conversions can be merged into matrix multiplication + addition
  2. The brightness contrast steps can be merged into a single matrix multiplication + addition
  3. The hue saturation steps can be merged into a single 2x2 matrix multiplication + addition
  4. Individual clamp() operations can be applied on a vector as a whole instead of once per component
  5. rgb -> yuv, contrast brightness, hue saturation steps can be merged into a single matrix multiplication + addition

The detailed math:

If luma_to_narrow(luma) = luma * 219.0/256.0 + 16.0 * 219.0 / 256.0 / 256.0; say x*a + b;
   where a = 219.0/256.0 and b = 16.0 * 219.0 / 256.0 / 256.0
And luma_to_full(luma)  = luma * 256.0 / 219.0 - 16.0 / 256.0; say y*1/a+ c;
   where c =  - 16.0 / 256.0
Then,
  luma_to_narrow(luma_to_full(x)*contrast) + brightness
  => ((luma_to_full(x)*contrast)*1/a + c) + brightness
  => (x*a +b)*contrast*1/a + c + brightness
  => x*contrast + contrast*b/a + brightness + c

If yuva = rgb_to_yuv(rgba) (preserve alpha channel) + vec4(0, -0.5, -0.5, 0) = rgba * yuva_matrix + yuva_constant
And yuva.x = luma_to_narrow(luma_to_full(x)*contrast) + brightness
We can mix them into a single operation: rgba * yuva_contrast_matrix + yuva_contrast_constant

where yuva_contrast_matrix = matrix4x4(0.256816 * contrast, -0.148246,  0.439271, 0,
                                       0.504154 * contrast, -0.29102,  -0.367833, 0,
                                       0.0979137 * contrast, 0.439266, -0.071438, 0,
                                       0,                    0,         0,        1)

and yuva_contrast_constant: vector4d(0.0625 * contrast + contrast * ((16.0 * 219.0 / 256.0 / 256.0) / (219.0 / 256.0)) + brightness - (16.0 / 256.0),
                                     0.5,
                                     0.5,
                                     0)

If vec2 uv = yuv.yz
And yuv.y = 0.5 + (((uv.x - 0.5) * hue_cos + (uv.y - 0.5) * hue_sin) * saturation);
    yuv.z = 0.5 + (((0.5 - uv.x) * hue_sin + (uv.y - 0.5) * hue_cos) * saturation);

    => yuv.yz = vec2(0.5) + saturation*vec2(dot(vec2(yuv.y - 0.5, yuv.z - 0.5), vec2(hue_cos, hue_sin)),
                                            dot(vec2(yuv.z - 0.5, 0.5 - yuv.y), vec2(hue_cos, hue_sin)));
    => yuv.yz = vec2(0.5) + saturation*vec2(dot(vec2(yuv.y - 0.5, yuv.z - 0 .5), vec2(hue_cos, hue_sin)),
                                            dot(vec2(yuv.y - 0.5, yuv.z - 0.5), vec2(hue_minus_sin, hue_cos)));
    => yuv.yz = vec2(0.5) + uv*mat2(hue_cos * saturation, hue_sin * saturation
                                   -hue_sin * saturation, hue_cos * saturation); // Where vec2 uv = yuv.yz - vex2(0.5)
In order to get yuva.yz = yuva.yz - vec2(0.5),
yuva_contrast_constant becomes =  Qt.vector4d(0.0625 * contrast + contrast * ((16.0 * 219.0 / 256.0 / 256.0) / (219.0 / 256.0)) + brightness - (16.0 / 256.0),
                                              0,
                                              0,
                                              0)

I have used a 4x4 matrix instead of a 3x3 matrix so as to be able to do:

gl_FragColor = yuva * from_yuv_coeff_mat + from_yuv_bt601_offset * from_yuv_coeff_mat

instead of:

gl_FragColor.rgb = yuv * from_yuv_coeff_mat + from_yuv_bt601_offset * from_yuv_coeff_mat;
gl_FragColor.a = rgba.a;

If we can remove the clamp() step inside the shader, or apply it after rgba conversion, there are more performance benefits to reap. But I am not sure what the side effects will be in that case.

Merge request reports