Allow scanning out fullscreen surfaces on secondary GPUs
This series implements a big enhancement for multi-GPU setups: allowing discrete GPUs (dGPU) to directly scan out fullscreen applications running on the dGPU. This helps realize some of the motivation for why linux dmabuf v4 was created, and requires the application to support dynamic modifier renegotiation when the compositor advertises scanout feedback to that surface.
The target use case is to avoid overhead on dual-GPU laptops, the most common being Nvidia Optimus laptops ("Optimus" is just a product name for laptops with a discrete Nvidia card). Although I developed this series for laptops with an Nvidia discrete GPU, nothing in this MR is Nvidia-specific and is applicable to non-Nvidia dual-GPU laptops. If anyone has a non-Nvidia dual-GPU system testing would be much-appreciated as I do not own any.
In the case of dual-GPU laptops there are a couple scenarios this impacts. On most laptops the dGPU does not drive the integrated display, but drives external displays through the HDMI port on the sides/back of the laptop. Plugging in an external display and fullscreening an application on it is what this MR helps. In future plans we are also looking at "internal display muxing" for laptops with appropriate hardware (the product name is confusingly "Nvidia Advanced Optimus"), in that case fullscreening a dGPU application on the integrated display could flip the display mux to the dGPU and perform direct scanout. This MR is a prerequisite to advanced features such as display muxing.
Sway pull request: https://github.com/swaywm/sway/pull/7509
Thanks to everyone who has answered questions along the way and given design feedback!
Flow of dynamic surface reallocation
This concept isn't an original idea from these changes, it comes from heavy discussion in the MR introducing v4 of the linux_dmabuf protocol. From 10,000ft in the air surface promotion when fullscreen works like this:
- App enters fullscreen
- wlroots sends updated dmabuf feedback to this surface, including the scanout modifiers for the output.
- This happens in
wlr_linux_dmabuf_feedback_v1_init_with_options
called from sway'sset_fullscreen
specifying ascanout_primary_output
.
- This happens in
- App sees new dmabuf feedback and reallocates its buffers, choosing a scanout-capable modifier if possible.
- wlroots gets the new buffer the App shared.
- wlroots sees that the format/mod are included in the list of possible scanout mods for this output and performs direct scanout.
- wlroots sends updated dmabuf feedback to this surface, including the scanout modifiers for the output.
- App leaves fullscreen
- wlroots sends updated dmabuf feedback to this surface, scanout mods are no longer present
- wlroots may have to redraw a frame using a scanout buffer incompatible with the primary device that the App has already committed.
- wlroots requests this buffer's
wlr_texture_set
for a texture compatible with the primarywlr_renderer
- the texture set blits the buffer to a new buffer with linear layout, and imports that on the primary
- rendered composition can now happen
- If the App tries to commit a scanout buffer again wlroots will return an error through linux_dmabuf.
- wlroots requests this buffer's
- App reallocates buffers, no scanout mods this time.
- wlroots imports the newly committed buffer and continues with rendered composition.
Wlroots-specific Design
There are two problems that any compositor has to solve before it can try to scan out surfaces on the dGPU: accessing a graphics context for each GPU and fetching a texture resident on a particular GPU. These are integral in properly handling promotion and fallback to rendered composition after scanout.
wlr_multi_gpu
In the first case, we need a graphics context (for wlroots this is a wlr_renderer
) for each GPU so that we can perform cross-GPU copies. Currently wlroots does create a renderer for each GPU, but they are not all easily accessible from any location in the codebase. wlr_multi_gpu
is an object that allows access to all renderers for easy access from places such as wlr_linux_dmabuf_v1
.
This is done by each wlr_renderer
having a wl_list
link corresponding to the list in wlr_multi_gpu
. The wlr_multi_gpu
object lives in wlr_multi_backend
.
wlr_texture_set
The wlr_texture_set
object does all of the heavy lifting for this feature: given a buffer it provides a way to query a wlr_texture
that is compatible with a specific wlr_renderer
. This allows the rest of wlroots to not have to understand the multi-GPU and fullscreen promotion situation and instead simply call wlr_texture_set_get_tex_for_renderer
to get a valid texture. When a texture set is first created wlr_texture_set
will find the renderer that can directly import a buffer, no copies are done until a texture on a non-resident renderer is requested.
wlr_texture_set
handles cross-GPU copies internally by creating a new linear buffer and blitting the original into it. This linear texture can then be imported into the requested GPU. This behavior is cached using an array of renderer/texture "pairs" which indicate which GPUs have had a texture requested. The "native" pair is the pair where we were able to directly import the buffer into the renderer (i.e. the buffer is resident on that GPU).
Testing
As mentioned earlier all testing done on a laptop with an AMD integrated GPU and a Nvidia discrete GPU. If anyone has additional testing feedback please let me know.
-
weston-simple-dmabuf-feedback
to check that the proper scanout tranches are advertised - Manually verified:
- entering/leaving fullscreen on secondary GPU does use direct scanout
- entering fullscreen on primary GPU falls back to composition
- entering fullscreen with a buffer allocated on the secondary GPU but cannot be scanned out results in a fallback to composition
- some common desktop usage, such as playing around in firefox
Caveats
Both of the following issues should be taken care of in the coming months but I wanted to go ahead and post this MR to get feedback and avoid holding these changes up. The first issue in particular is just waiting for a future release, and I plan on looking at implementing the Xwayland fix next.
- If running on Nvidia Optimus laptops as I am, there are some Nvidia internal driver changes coming in the future 535 driver release that are needed for scanout promotion to work. Most importantly linux_dmabuf v4 support.
- For Xwayland clients to take full advantage of this Issue 1380 needs to be resolved.
Breaking Changes
There is only one change to the public API which affects compositor developers: wlr_client_buffer
now has a reference to a wlr_texture_set
instead of a wlr_texture
. This isn't terribly disruptive, if a compositor directly uses wlr_client_buffer->texture
it should instead use wlr_client_buffer->texture_set
. Below is an example from the sway PR of querying a wlr_texture
from the texture set of a client buffer:
struct wlr_texture *texture = wlr_texture_set_get_tex_for_renderer(
saved_buf->buffer->texture_set, wlr_output->renderer);
if (!texture) {
continue;
}