DRM/KMS PWL API Thoughts and Questions
(I'm sending this as an email as lowest common denominator but feel an issue on the color-and-hdr repo would be a better interface for productive discussion. Please pop over to < > if you agree. Hopefully we can drive the discussion there.)
I've wanted to start a thread to discuss the use of PWL APIs that were introduced by Uma a year ago and for which Bhanuprakash provided IGT tests. I have come to like the API but as we're getting closer to a real-world use of it I have a few questions and comments. As with a lot of complex APIs the devil is in the details. Some of those details are currently underspecified, or underdocumented and it's important that we all interpret the API the same way.
The API
The original patches posted by Uma: https://patchwork.freedesktop.org/series/90822/ https://patchwork.freedesktop.org/series/90826/
The IGT tests for PWL API: https://patchwork.freedesktop.org/series/96895/
I've rebased the kernel patches on a slightly more recent kernel, along with an AMD implementation: https://gitlab.freedesktop.org/hwentland/linux/-/tree/color-and-hdr
I've also rebased them on an IGT tree, but that's not too up-to-date: https://gitlab.freedesktop.org/hwentland/igt-gpu-tools/-/tree/color-and-hdr
Why do I like the API?
In order to allow HW composition of HDR planes in linear space we need the ability to program at least a per-CRTC regamma (inv_EOTF) to go from linear to wire format post-blending. Since userspace might want to apply corrections on top of a simple transfer function (such as PQ, BT.709, etc.) it would like a way to set a custom LUT.
The existing CRTC gamma LUT defines equally spaced entries. As Pekka shows in [1] equally-spaced LUTs have unacceptable error for regamma/inv_EOTF. Hence we need finer granularity of our entries near zero while coarse granularity works fine toward the brighter values.
[1] !9 (closed)
HW (at least AMD and Intel HW) implements this ability as segmented piece-wise linear LUTs. These define segments of equally spaced entries. These segments are constrained by the HW implementation. I like how the PWL API allows different drivers to specify the constraints imposed by different HW while allowing userspace a generic way of parsing the PWL. This also avoids complex calculations in the kernel driver, which might be required for other APIs one could envision. If anyone likes I can elaborate on some ideas for an alternate API, though all of them will require non-trivial transformations by the kernel driver in order to program them to HW.
Nitpicks
The API defines everything inside the segments, including flags and values that apply to the entire PWL, such as DRM_MODE_LUT_GAMMA, DRM_MODE_LUT_REFLECT_NEGATIVE, input_bpc, and output_bpc. If these don't stay constant for segments it might complicate the interpretation of segments. I suggest we consider these as effectively applying to the entire PWL. We could encode them in an overall drm_color_lut struct that includes an array of drm_color_lut_range but that's probably not necessary, hence why I called this out as a nitpick. I would just like us to be aware of this ambiguity and document that these values applies to the entire PWL.
How to read the PWL
Let me first give a summary for how this LUT is used in userspace. If you're familiar with this please review and comment if I got things wrong. As I mentioned, a lot of this is underspecified at the moment so you're reading my interpretation.
You can see this behavior in plane_degamma_test [2] in the kms_color.c IGT test suite. I suggest the plane_degamma_test here here instead of the test_pipe_gamma test as the latter still has Intelisms (assumptions around Intel driver/HW behavior) and will not work for other drivers.
Iterate over all enums in PLANE_DEGAMMA_MODE and find a suitable one. How do we find the suitable one? More on that below.
Once we have the right PLANE_DEGAMMA_MODE we read the blob for the blob ID associated with the PLANE_DEGAMMA_MODE enum. We interpret the blob as an array of drm_color_lut_range. See get_segment_data [3].
We can think of our LUT/PWL as f(x) = y. For a traditional equally spaced LUT with 1024 entries x would be 0, 1, 2, ..., 1023. For a PWL LUT we need to parse the segment data provided in drm_color_lut_range.
Let's look at the 2nd-last entry of the nonlinear_pwl definition for the AMD driver [4] (I've correct it here and dropped the DRM_MODE_LUT_REUSE_LAST but it's still incorrect in the link) and simplify it to 4 entries for sake of readability:
{ .flags = (DRM_MODE_LUT_GAMMA | DRM_MODE_LUT_REFLECT_NEGATIVE | DRM_MODE_LUT_INTERPOLATE | DRM_MODE_LUT_NON_DECREASING), .count = 4, .input_bpc = 13, .output_bpc = 18, .start = 1 << 12, .end = 1 << 13, .min = 0, .max = 1UL << 35 },
We see we have 16 entries in the region from (1 << 12) to (1 << 13). We see input_bpc is 13, so our 1.0 float value is 1 << 13.
(Is it sensible to use input_bpc as our 1.0 floating point reference? Why?)
Since this segment is not reusing the last entry (doesn't have DRM_MODE_LUT_REUSE_LAST) we divide the region between 1 << 12 and 1 << 13 into 4 equally spaced sections.
step_size = (segment->end - segment->start) / count
In our case our segment spans from 4096 to 8192 and yields a step_size of 1024.
Note that we need to calculate this in floating point, otherwise we're not guaranteed equal spacing.
This gives us these X entries for this particular segment: 4096, 5120, 6144, 7168
And normalized to 1 << 13 (input_bpc) for our 1.0 float value we get 0.5, 0.625, 0.75, 0.875
If the segment had the REUSE_LAST flag the values would look like 4096, 5461, 6826, 8192
and normalized to 0.5, 0.666626, 0.833252, 1
Though in the case of REUSE_LAST a more sensible definition might be with a count of 5 entries in this segment, instead of 16. But ultimatly that's up to the driver.
I attached a simple C program that parses a PWL and helps illustrate my interpretation. You'll just need to copy-paste or include your PWL definition (instead of color_gamma.h), and point the PWL define to your PWL variable. To build it you'll need to copy the drm-uapi folder from my IGT repo into the same directory as pwl-parser.c and then just build it with "gcc -Idrm-uapi pwl-parser.c".
The above entries come from running pwl-parser with the nonlinear_pwl from [5].
We should probably document the above in detail in the DRM/KMS API docs.
[2] https://gitlab.freedesktop.org/hwentland/igt-gpu-tools/-/blob/color-and-hdr/tests/kms_color.c#L978
[6] https://gitlab.freedesktop.org/hwentland/color-and-hdr/-/tree/precision/octave
How to provide PWL entries?
To provide the PWL values for each entry we'll have to sample our (custom) curve at the respective points specified by our PWL entries and providing those samples in a blob that is passed to PLANE_DEGAMMA_LUT. It's not much different from how we provide values for an equally spaced (legacy) LUT. As for sampling our curve, I remember seeing that Weston uses an LCMS function to sample the curve at required points. Last I checked it samples the curve at evenly spaced intervals but the LCMS function seemed to provide arbitrary sampling.
How to pick a suitable PWL definition?
Picking the right PWL definition out of the respective _MODE enum isn't trivial. Without further information a userspace implementer has to understand the distribution of entries in all segments and perform a bunch of math to estimate the error for given curves.
A simpler approach might be if we defined common naming for our PWL enums. We can define the commonly expected cases. The two important parameters are within-range vs overrange entries, and linear vs non-linear outputs.
Within-range PWLs would cover [0.0, 1.0] entries and overrange [0.0, 128.0] to cover PQ and probably anything else. One could think of the within-range PWL as intended for SDR content and the over-range PWL for HDR content.
Linear PWLs would be intended for linear luminance outputs (or near-linear), and can be represented by equally spaced LUTs, such as a single-segment definitions. Non-linear PWLs would be intended for linear to non-linear transforms; Linear PWLs for non-linear to linear transforms or OOTFs.
This gives us four enums, plus one for bypass: DRM_MODE_LUT_BYPASS DRM_MODE_LUT_LINEAR_SDR DRM_MODE_LUT_NONLINEAR_SDR DRM_MODE_LUT_LINEAR_HDR DRM_MODE_LUT_NONLINEAR_HDR
Drivers can provide appropriate PWLs based on the HW caps. Userspace can pick an appropriate one if it's available or fall back to either a sub-optimal PWL or to using a GPU transform instead.
Thoughts?
Thanks, Harry