Commit dd834d41 authored by Harry Wentland's avatar Harry Wentland
Browse files

Add section on power considerations



v3:
- Add mention that AMD display pipeline color-processing block's impact
  on power are negligible in comparison to memory bandwidth and GPU
  power (Sebastian)

v2:
- Add link to README.md (Pekka)
- Make it clear that compositors don't currently do color space conversions.
  Applications will perform their own color space conversion when required,
  such as when playing back HDR videos. (Sebastian)
- Avoid some generalizations about color-managed compositor behavior while
  highlighting that a color-managed compositor will likely have a larger GPU
  and memory bandwidth usage (Pekka)
- Clarify the importance of 3D LUT for high-quality tone-, and gamut-mapping
  (Sebastian)
- Soften language of power-saving implications of RGB 10bpc scanout instead of
  FP16 scanout (Harry)
- Add blurb on de-linearizing content post-blending when needed (Sebastian)
- Use unambiguous language to refer to the display controller, instead of the
  actual display (Pekka)
- Spell out PWL, and TF acronyms when first introduced (Pekka)
- Clarify the de-linearization and linearization performed on drm_plane and
  drm_crtc respectively for the video use case (Pekka)
Signed-off-by: Harry Wentland's avatarHarry Wentland <harry.wentland@amd.com>
parent 4ec91e45
......@@ -33,6 +33,9 @@ specifications. All content must follow the [license](LICENSE).
- [Tools](doc/tools.rst) contains a list of tools that can be used to inspect
and modify color management and display attributes on Linux.
- [Power](doc/power.rst) talks about the power implications of color-managed
and HDR
## History
Originally this documentation was started to better explain how
......
.. Copyright 2022 Advanced Micro Devices, Inc.
Power
=====
One of the goals of modern compositors is power efficiency. A compositor
constructs a scene graph of the compositor layers and then blends them together
to achieve the final result. The blending usually happens via GPU shaders. At
each branch of the scene graph the shader reads two input buffers and writes one
output buffer. As a consequence each composition step consumes **memory
bandwidth** and **GPU power**.
A compositor only needs to compose surfaces that change. If no composition
surface changes the compositor will not need to recompose the output. The
display engine, on the other hand, will scan out the framebuffer each frame,
consuming memory bandwidth. If no composition occurs the GPU can remain off.
Two of the most taxing use-cases for a compositor are video playback or gaming
when the video/game windows can't be promoted to direct scanout because they're
not full-screen or because elements are overlaid over top. Such elements might
be an in-game overlay, or video subtitles. Any composition layer that is
constantly changing will benefit from optimizations that avoid GPU composition.
If we can use a separate DRM/KMS plane for each of these elements we can avoid
GPU composition and are able to leave the GPU powered off when no other client
uses it, saving both memory bandwidth, and GPU power. In the gaming scenario we
free up GPU bandwidth for use by the game.
Composition of a color-managed desktop
--------------------------------------
Composing a color-managed desktop makes our GPU usage and memory bandwidth
worse. To understand why let's first look at the composition of a
non-color-managed desktop, i.e. at what most Linux compositors are doing
currently.
Non-color-managed compositors and applications are assumed to use sRGB buffers.
Applications that are displaying non-sRGB content will do their own conversion
to sRGB. Compositors simply blend the sRGB buffers in sRGB space, without
linearizing/de-linearizing the buffers before/after blending.
A color-managed compositor has to ensure all buffers are scaled and blended in a
well-defined scaling/blending space. This might involve additional operations,
such as de-linearization/linearization of content, color space conversion, and
tone-, and gamut-mapping. It might also involve pixel formats with larger
bit-depth, such as P010 instead of NV12, or FP16, instead of XRGB8888. The
effect tends to be higher GPU usage, and possible larger memory bandwidth
requirements.
Scanout formats and memory bandwidth
------------------------------------
Different pixel formats have different memory bandwidth usage. This is a list of
common formats.
.. list-table::
:header-rows: 1
* - Pixel format
- Average Bandwidth (per pixel, in bits)
* - P010
- 24
* - NV12
- 12
* - ARGB8888
- 32
* - ARGB2101010
- 32
* - FP16
- 64
On non-color-managed desktops display scanout uses ARGB8888 or ARGB2101010
buffers, both using 32 bpp. To support blending linear luminance buffers and
present HDR or wide-gamut content without artifacts color-managed compositors
are using FP16 formats. This effectively doubles the bandwidth to 64 bpp.
Display scanout happens every frame at the refresh rate of the display. 60 times
a second for a 60 Hz display, 120 Hz for a 120 Hz display. Composition on the
other hand happens less frequently in most scenarios. If we can avoid scanout of
FP16 buffers we can potentially save a large amount of memory bandwidth, even if
SW composition uses FP16 buffers.
To avoid FP16 scanout of the framebuffer and avoid quantization artifacts, a
compositor will need to de-linearize the buffer and pack it in a 10-bpc RGB
format for HDR content, or 8-bpc for SDR content.
If two planes are blended in the display controller, a compositor might want to
use drm_plane degamma to linearize it after scanout.
The compositor will need to program the drm_crtc's gamma LUT to de-linearize the
content if a linear FP16 buffer is scanned out or a drm_plane degamma LUT is
programmed to linearize the buffer before blending.
Display Pipeline Color-processing Blocks
----------------------------------------
The power impact of enabling color-processing blocks in the display pipeline is
negligible for AMD HW. It is usually better to use DRM/KMS color properties if
it means lower GPU or memory bandwidth utilization.
This might differ on other display HW.
Video Playback
--------------
The primary power-saving goal for the video playback use-case is being able to
quiet the GPU while video playback is happening, i.e. we avoid using the GPU for
composing the video buffer and the rest of the desktop, including video overlays
such as subtitles or logos.
The secondary power-saving goal for the video playback use-case is reducing the
display scanout bandwidth as much as possible.
We can achieve this by using a DRM/KMS plane to directly scanout the video
buffer after it is decoded, and using the display engine's blending HW to
blend it with any overlay or rest of desktop. We will call these planes video
and desktop plane, respectively.
We will assume below that the compositor wants to blend the planes in linear
space, in the display's color space, with tone-mapping applied pre-blending.
Other scenarios can be imagined and can be adapted from the outline below.
The desktop plane can be provided as an FP16 buffer, or an ARGB2101010 buffer.
An FP16 buffer provides fine-grained alpha precision and doesn't require the
compositor to de-linearize the buffer after performing GPU composition on the
scene graph that make up the rest-of-desktop buffer (assuming the compositor
uses FP16 for GPU composition). It does come at a cost since the display engine
scanout uses 64 bpp of memory bandwidth with each scanout instead of the 32 bpp
for ARGB2101010 or ARGB8888.
An ARGB2101010 buffer has very little precision in the alpha channel and
requires the compositor to encode the buffer using a non-linear transfer
function to avoid quantization artifacts in the dark areas of the image. It
makes up for that by reducing the memory bandwidth by half, to 32 bits per
pixel, as opposed to the FP16 format, and will have a positive impact on overall
power consumption. If the desktop plane is presented using a non-linear
ARGB2101010 buffer the compositor would need to provide a degamma piece-wise
linear LUT (PWL) or transfer function (TF) on the drm_plane to perform the
inverse of the non-linear TF used to encode the buffer, i.e. to linearize the
buffer again.
An HDR10 video is encoded as a P010 buffer, accompanied by HDR metadata,
including the transfer function used to encode the content, the color space
(primaries and white space), and the mastering luminance. We want to use this
information to convert the video to a linear encoding for blending, map to the
display's color space, and perform tone mapping if desired. In order to quiet
the GPU we have to pass the raw buffer as a drm_framebuffer to video plane and
use relevant DRM/KMS properties to transform the content into the blending
space.
Linearization (for blending in linear space) will happen via the degamma PWL or
TF properties of the drm_plane. De-linearization (for transmission to a display)
will happen via the gamma LUT property of the drm_crtc.
Rudimentary color space conversion can be done by using the CTM property of the
drm_plane. This will clip color values if the input space is larger than the
output space and will lead to wrong hues in the highlights in this scenario. A
better way to do color space conversion is through the use of a 3D LUT.
The gamma PWL on the drm_plane can be used for simple tone-mapping of the
content. Note that tone- and gamut-mapping via CTM and 1D curves is rudimentary
at best and will often lead to undesirable results. A 3D LUT will provide the
ability to tone-, and gamut-map content at a high quality but as of now we don't
yet have a DRM/KMS API proposal for 3D LUT functionality. We should consider
designing and prototyping such an API.
Gaming
------
The ideal scenario to handle games, whether with HDR or without HDR support, is
via direct scanout. This works well unless the game runs in a window or we have
an overlay displayed over top of the game.
For the gaming use-case quieting the GPU is not a major consideration. We still
want to avoid GPU composition to allow the game to use as much of the GPU as
possible, and to reduce latency between the game render and display of the
frame.
Games with HDR support will render FP16 frames internally and tone-map their
content as desired. In order to reduce latency and GPU usage we should let the
game present their buffer as FP16 and feed that directly to the display engine
as a DRM/KMS plane. Most likely we won't need to do any further tone-mapping on
the DRM/KMS plane as games tend to do that internally. We will likely need to
evaluate this experimentally once FP16 support for games with direct scanout is
enabled.
A rest-of-desktop, or overlay plane can be presented as described above in the
video section via another FP16 plane.
If power is a concern for gaming it might be worthwhile to avoid FP16 for the
desktop plane, or both game and desktop planes, in order to reduce the memory
bandwidth of the display scanout. Whether FP16 or XRGB2101010 scanout is more
efficient for HDR games will need to be evaluated.
3D LUT
------
High-quality tone-, and gamut-mapping requires the use of a 3D LUT.
TBD discuss and define a 3DLUT DRM/KMS interface
Scaling and alpha values
------------------------
TBD discuss scaling and pre-multiplied alpha
References
----------
- RFC for DRM/KMS API for per-plane PWLs and CTM https://patchwork.freedesktop.org/series/90826/
- RFC for IGT tests for per-plane PWLs and CTM https://patchwork.freedesktop.org/series/96895/
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment