gl: optimize glcolorbalance by precomputing shader math
Precompute the rgb -> yuv conversion and color balance adjustment math so that the shader does minimal work per pixel.
Merging these 15+ steps into 3 steps let us jump from choppy 360p video to smooth 720p video on our underpowered embedded system.
The basic observations that enabled these optimizations are:
- The individual dot products of yuv <-> rgb conversions can be merged into matrix multiplication + addition
- The brightness contrast steps can be merged into a single matrix multiplication + addition
- The hue saturation steps can be merged into a single 2x2 matrix multiplication + addition
- Individual clamp() operations can be applied on a vector as a whole instead of once per component
- rgb -> yuv, contrast brightness, hue saturation steps can be merged into a single matrix multiplication + addition
The detailed math:
If luma_to_narrow(luma) = luma * 219.0/256.0 + 16.0 * 219.0 / 256.0 / 256.0; say x*a + b;
where a = 219.0/256.0 and b = 16.0 * 219.0 / 256.0 / 256.0
And luma_to_full(luma) = luma * 256.0 / 219.0 - 16.0 / 256.0; say y*1/a+ c;
where c = - 16.0 / 256.0
Then,
luma_to_narrow(luma_to_full(x)*contrast) + brightness
=> ((luma_to_full(x)*contrast)*1/a + c) + brightness
=> (x*a +b)*contrast*1/a + c + brightness
=> x*contrast + contrast*b/a + brightness + c
If yuva = rgb_to_yuv(rgba) (preserve alpha channel) + vec4(0, -0.5, -0.5, 0) = rgba * yuva_matrix + yuva_constant
And yuva.x = luma_to_narrow(luma_to_full(x)*contrast) + brightness
We can mix them into a single operation: rgba * yuva_contrast_matrix + yuva_contrast_constant
where yuva_contrast_matrix = matrix4x4(0.256816 * contrast, -0.148246, 0.439271, 0,
0.504154 * contrast, -0.29102, -0.367833, 0,
0.0979137 * contrast, 0.439266, -0.071438, 0,
0, 0, 0, 1)
and yuva_contrast_constant: vector4d(0.0625 * contrast + contrast * ((16.0 * 219.0 / 256.0 / 256.0) / (219.0 / 256.0)) + brightness - (16.0 / 256.0),
0.5,
0.5,
0)
If vec2 uv = yuv.yz
And yuv.y = 0.5 + (((uv.x - 0.5) * hue_cos + (uv.y - 0.5) * hue_sin) * saturation);
yuv.z = 0.5 + (((0.5 - uv.x) * hue_sin + (uv.y - 0.5) * hue_cos) * saturation);
=> yuv.yz = vec2(0.5) + saturation*vec2(dot(vec2(yuv.y - 0.5, yuv.z - 0.5), vec2(hue_cos, hue_sin)),
dot(vec2(yuv.z - 0.5, 0.5 - yuv.y), vec2(hue_cos, hue_sin)));
=> yuv.yz = vec2(0.5) + saturation*vec2(dot(vec2(yuv.y - 0.5, yuv.z - 0 .5), vec2(hue_cos, hue_sin)),
dot(vec2(yuv.y - 0.5, yuv.z - 0.5), vec2(hue_minus_sin, hue_cos)));
=> yuv.yz = vec2(0.5) + uv*mat2(hue_cos * saturation, hue_sin * saturation
-hue_sin * saturation, hue_cos * saturation); // Where vec2 uv = yuv.yz - vex2(0.5)
In order to get yuva.yz = yuva.yz - vec2(0.5),
yuva_contrast_constant becomes = Qt.vector4d(0.0625 * contrast + contrast * ((16.0 * 219.0 / 256.0 / 256.0) / (219.0 / 256.0)) + brightness - (16.0 / 256.0),
0,
0,
0)
I have used a 4x4 matrix instead of a 3x3 matrix so as to be able to do:
gl_FragColor = yuva * from_yuv_coeff_mat + from_yuv_bt601_offset * from_yuv_coeff_mat
instead of:
gl_FragColor.rgb = yuv * from_yuv_coeff_mat + from_yuv_bt601_offset * from_yuv_coeff_mat;
gl_FragColor.a = rgba.a;
If we can remove the clamp() step inside the shader, or apply it after rgba conversion, there are more performance benefits to reap. But I am not sure what the side effects will be in that case.