Skip to content

ply-device-manager: De-activate and close renderer on device hot unplug

Hans de Goede requested to merge jwrdegoede/plymouth:amdgpu-fix into master

When a device gets hot unplugged, we should de-activate and close the renderer before freeing it.

Speficially this fixes a problem where plymouth's display "freezes" when used with AMD GPUs which can be handled by both the radeon or the amdgpu kms driver and the user has added kernel commandline options to prefer the amdgpu driver. In this case the following happens:

  1. radeon driver gets loaded, creates /dev/dri/card0 generating a udev "add /dev/dri/card0" event.
  2. radeon driver's "load" method checks if it should bind, comes to the conclusion it should not, causing the probe of the device to be aborted and /dev/dri/card0 to be unregistered generating a udev "remove /dev/dri/card0" event
  3. amdgpu driver loads, creates /dev/dri/card0 generating another udev "add /dev/dri/card0" event.
  4. plymouth loads while udev is still catching up with kernel events, plymouth sees the first "add" event opens /dev/dri/card0 (which is already managed by amdgpu at this time) finds outputs, shows splash
  5. plymouth gets the "remove" events tears down the renderer, but since deactivate and close were not called before this commit, plymouth keeps the master rights attached to the old-fd which is leaked at this point
  6. plymouth gets the second "add" event brings everything back up, but it cannot get master rights, since there can only be 1 master and the leaked fd is the master. This leads to lots of: "Couldn't set scan out buffer" errors and plymouth appears frozen because of this, this is esp. bad with the disk encryption unlock screen
  7. When gdm starts, it deactivates plymouth, but does not yet tell it to quit to avoid flickering. The deactivate makes plymouth drop its master rights, but this happens on the new broken fd which never got the master rights. This also makes it impossible for gdm to get master rights, also breaking gdm, with the following error in the logs: "(EE) wl_drm@4: error 0: authenicate failed"

Note in theory calling ply_renderer_close() on teardown (from ply_device_manager_free()) should be fine too, but for some reason this causes the framebuffer to be all messed up (looks like wrong pitch / tiling) when this is done after gdm has taken over the framebuffer.

Where as just exiting without calling drmModeRmFB, letting the kernel clean up behind us, does not trigger this. This feels like it is a kernel bug bug and I will file a bug for this. But for now we should not call ply_renderer_close() on teardown to avoid this issue (which would be a regression).

BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1490490 Signed-off-by: Hans de Goede hdegoede@redhat.com

Merge request reports