The devices weren't removed from dev_list.
Instead of just fixing the issue by adding:
if (*node) *node = dev->next;
after the while loop, use this opportunity to use a clearer control flow.
Fixes: 7275ef8e ("amdgpu: add amdgpu_device_initialize2")
Pierre-Eric Pelloux-Prayer (c7c3c14b) at 27 Mar 08:03
amdgpu: fix deinit logic
Tested-by: Mike Lothian <mike@fireburn.co.uk>
Reviewed-by: Michel Dänzer <mdaenzer@redhat.com>
See !354 (merged)
The devices weren't removed from dev_list.
Instead of just fixing the issue by adding:
if (*node) *node = dev->next;
after the while loop, use this opportunity to use a clearer control flow.
Fixes: 7275ef8e ("amdgpu: add amdgpu_device_initialize2")
Not sure why I can't reproduce this. Which command line / app are you using to trigger it?
Also, I think I'll change the code to become:
/* Remove dev from dev_list, if it was added there. */
for (amdgpu_device_handle node = dev_list, prev = NULL; node; prev = node, node = node->next) {
if (node == dev) {
if (prev)
prev->next = node->next;
else
dev_list = node->next;
break;
}
}
It looks clearer... WDYT?
I'll test this tonight, thanks
Thanks for the fix!
I starred a long time at the code without seeing this obvious mistake.. I'll open a MR.
E.g. this fixes it, and fixes similar symptoms as reported above by Mike for me:
diff --git a/amdgpu/amdgpu_device.c b/amdgpu/amdgpu_device.c
index 97dbfb44e..2d4beedd3 100644
--- a/amdgpu/amdgpu_device.c
+++ b/amdgpu/amdgpu_device.c
@@ -100,6 +100,8 @@ static void amdgpu_device_free_internal(amdgpu_device_handle dev)
/* Remove dev from dev_list, if it was added there. */
while (*node != dev && *node && (*node)->next)
node = &(*node)->next;
+ if (*node)
+ *node = dev->next;
close(dev->fd);
if ((dev->flink_fd >= 0) && (dev->fd != dev->flink_fd))
This no longer actually unlinks dev
from the list, does it?
Reverting that one commit gets things working
7275ef8eba7248fbad7fee079d25eff8716124ee is the first bad commit
commit 7275ef8eba7248fbad7fee079d25eff8716124ee
Author: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Date: Fri Feb 17 14:41:48 2023 +0100
amdgpu: add amdgpu_device_initialize2
Allows to opt-out from the device deduplication logic. This is not the
recommended way of using dev handles, but it's necessary for native context:
in this situation one process (eg: Qemu) will init many devices and we
want independent devices to make sure guest applications are isolated from
each other.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
amdgpu/amdgpu-symbols.txt | 1 +
amdgpu/amdgpu.h | 14 ++++++++++++++
amdgpu/amdgpu_device.c | 45 +++++++++++++++++++++++++++++++++------------
3 files changed, 48 insertions(+), 12 deletions(-)
This is strange; I'll take a look. Thanks for letting us know.
And lots of these in my journalctl:
Mar 25 16:36:01 axion.fireburn.co.uk plasmashell[1179]: _amdgpu_device_initialize: amdgpu_get_auth (2) failed (-1)
Mar 25 16:36:05 axion.fireburn.co.uk plasmashell[1179]: _amdgpu_device_initialize: amdgpu_get_auth (2) failed (-1)
Mar 25 16:36:50 axion.fireburn.co.uk kwin_wayland_wrapper[581]: _amdgpu_device_initialize: amdgpu_get_auth (2) failed (-1)
Mar 25 16:36:50 axion.fireburn.co.uk kwin_wayland_wrapper[581]: amdgpu: amdgpu_device_initialize failed.
Mar 25 16:36:51 axion.fireburn.co.uk plasmashell[1464]: _amdgpu_device_initialize: amdgpu_get_auth (2) failed (-1)
Mar 25 16:36:51 axion.fireburn.co.uk plasmashell[1464]: amdgpu: amdgpu_device_initialize failed.
Mar 25 16:36:51 axion.fireburn.co.uk plasmashell[1662]: _amdgpu_device_initialize: amdgpu_get_auth (2) failed (-1)
Mar 25 16:36:51 axion.fireburn.co.uk plasmashell[1662]: amdgpu: amdgpu_device_initialize failed.
Mar 25 16:36:51 axion.fireburn.co.uk plasmashell[1730]: _amdgpu_device_initialize: amdgpu_get_auth (2) failed (-1)
Mar 25 16:36:51 axion.fireburn.co.uk plasmashell[1730]: amdgpu: amdgpu_device_initialize failed.
Mar 25 16:36:51 axion.fireburn.co.uk plasmashell[1777]: _amdgpu_device_initialize: amdgpu_get_auth (2) failed (-1)
Mar 25 16:36:51 axion.fireburn.co.uk plasmashell[1777]: amdgpu: amdgpu_device_initialize failed.
Hi, I'm seeing some strange behaviour on my machine since this landed (confirmed by reverting)
#0 0x00007f5d1b9f5e95 n/a (libdrm_amdgpu.so.1 + 0x3e95)
#1 0x00007f5d1b9f9814 amdgpu_device_deinitialize (libdrm_amdgpu.so.1 + 0x7814)
#2 0x00007f5d19fc6aa2 n/a (radeonsi_dri.so + 0x17c6aa2)
#3 0x00007f5d19bf6598 n/a (radeonsi_dri.so + 0x13f6598)
#4 0x00007f5d18f34b0e n/a (radeonsi_dri.so + 0x734b0e)
#5 0x00007f5d200bf3bd n/a (libEGL_mesa.so.0 + 0x2f3bd)
#6 0x00007f5d200c035d n/a (libEGL_mesa.so.0 + 0x3035d)
#7 0x00007f5d200ab592 n/a (libEGL_mesa.so.0 + 0x1b592)
#8 0x00007f5d24a41b3b _ZN15QtWaylandClient34QWaylandEglClientBufferIntegrationD1Ev (libQt6WaylandEglClientHwIntegration.so.6 + 0x9b3b)
#9 0x00007f5d24a41b59 _ZN15QtWaylandClient34QWaylandEglClientBufferIntegrationD0Ev (libQt6WaylandEglClientHwIntegration.so.6 + 0x9b59)
#10 0x00007f5d239bd738 _ZN15QtWaylandClient19QWaylandIntegrationD2Ev (libQt6WaylandClient.so.6 + 0xb6738)
#11 0x00007f5d239bd759 _ZN15QtWaylandClient19QWaylandIntegrationD0Ev (libQt6WaylandClient.so.6 + 0xb6759)
#12 0x00007f5d2d310f90 _ZN22QGuiApplicationPrivateD2Ev (libQt6Gui.so.6 + 0x310f90)
#13 0x00007f5d320ecde9 _ZN19QApplicationPrivateD0Ev (libQt6Widgets.so.6 + 0x2ecde9)
#14 0x00007f5d2cd5c9cc _ZN7QObjectD2Ev (libQt6Core.so.6 + 0x35c9cc)
#15 0x00007f5d2d30b18a _ZN15QGuiApplicationD1Ev (libQt6Gui.so.6 + 0x30b18a)
#16 0x00007f5d320eeb95 _ZN12QApplicationD2Ev (libQt6Widgets.so.6 + 0x2eeb95)
#17 0x00005621824b1048 n/a (yuzu + 0xc41048)
#18 0x00007f5d2c456f46 n/a (libc.so.6 + 0x25f46)
#19 0x00007f5d2c456ff9 __libc_start_main (libc.so.6 + 0x25ff9)
#20 0x00005621822e9c25 n/a (yuzu + 0xa79c25)
I should probably add that this is a PRIME machine
Matt Turner (c45ffb1e) at 25 Mar 15:50
symbols-check: Add _fbss, _fdata, _ftext
... and 1 more commit
Pierre-Eric Pelloux-Prayer (7275ef8e) at 25 Mar 09:35
amdgpu: add amdgpu_device_initialize2
... and 3 more commits