New cairo overlay_rectangle premultiply function sometimes has pathological runtimes on Windows
When processing ARGB frames, we need to premultiply the RGB with alpha because that's the format that cairo expects, and when we output frames, we then need to unpremultiply the alpha. This should be really fast, and it is really fast on Linux. However, on Windows the behaviour is very variable.
On my i5-4590 3.3GHz, premultiply_3 on a single 1920x1080 frame sometimes takes 0.75s, sometimes 4.7s, and upto 9.8s. Same for unpremultiply_3. This makes the overlay element totally useless with ARGB frames, and is a regression from 1.14 where we were output incorrect frames but at least they were outputted on time. The same behaviour is observed with both the MinGW and MSVC compilers.
The contents of the frame do not correlate with the time taken. Memsetting the frame to 0x0 or 0xff before calling premultiply does not seem to correlate with any changes in how much time it takes.
The assembly generated also looks correct and efficient, so more investigation is needed. For instance, it could be related scheduling priority (can be checked by raising the priority of that thread).
This issue is also very difficult to debug currently because all of Cerbero is still built with MinGW which makes it impossible to use any profiling tools on the code. I have meson ports for all the dependencies of gstcairo, but SIMD optimizations are not correctly enabled in them, so further investigation is blocked on that. Even after enabling SIMD optimizations the problems are still seen. Needs someone to investigate.