BUG: Null pointer dereferencing when unbinding

changed the description

Can you clarify exactly what you are doing? The system comes up and amdgpu loads on both GPUs. Then you are unbinding amdgpu from one of the GPUs so you can use it for passthrough. Does the issue happen at unbinding time or are you trying to bind the driver back to the device again later and that's where the issue happens?

I am going to pick up this issue because I have the exact same problem. System loads 2 gpus, the 7700s gets unbinded by virt-manager/virsh nodedev-detach and binds to vfio-pci. GPU successfully gets binded to the VM and the dmesg errors happen upon shutdown of the vm and the release to the host. In my case, i need a hard reboot to fix and this issue happens like 2/3 times.

@tacsist i see something close on my setup (6800XT) when unbinding while my display manager (gnome / wayland is running)

If I follow the following steps I avoid the crash

Stop GDM / Display manager
Unbind GPU from amdgpu via sysfs
Start VM that binds the 6800Xt to vfio-pci (libvirt) * VM starts and runs as expected
Stop VM and rebind to amdgpu

I see that the following logs are related to the crash when the unbind happens

[ 253.194496] amdgpu 0000:03:00.0: amdgpu: failed to clear page tables on GEM object close (-19) [ 253.194504] amdgpu 0000:03:00.0: amdgpu: leaking bo va (-19) [ 253.194539] amdgpu 0000:03:00.0: amdgpu: failed to clear page tables on GEM object close (-19) [ 253.194547] amdgpu 0000:03:00.0: amdgpu: leaking bo va (-19) [ 253.194579] amdgpu 0000:03:00.0: amdgpu: failed to clear page tables on GEM object close (-19) [ 253.194585] amdgpu 0000:03:00.0: amdgpu: leaking bo va (-19) [ 253.194605] amdgpu 0000:03:00.0: amdgpu: failed to clear page tables on GEM object close (-19) [ 253.194607] amdgpu 0000:03:00.0: amdgpu: leaking bo va (-19)

This is on 6.13.3

@agd5f in my case seems the issue is starts at the unbind, with

[ 253.194496] amdgpu 0000:03:00.0: amdgpu: failed to clear page tables on GEM object close (-19) [ 253.194504] amdgpu 0000:03:00.0: amdgpu: leaking bo va (-19)

iv attached my kernel log for a similar issue

kernel.log

@bigbeeshane Thank you for chipping in. I do as well get the same logs about bo va leaks. I will see if I can verify that this happens even with successful VM boot and display or your errors occur with failed attempts only. This could be 2 separate bugs, especially because I have noticed GDM not allowing dGPU to enter d3cold if the laptop boots with a display attached to it's output (issue doesn't happen afterwards on re/connections of the display).

if you have an iGPU as well, you can try not having monitors connected to the actual GPU on boot and seeing if you still encounter bo va leaks

@tacsist yeah I have a 7950X so have a iGPU available.

I'll give that run this evening, but to be clear the steps you are suggesting are

Start with a display only connected to the iGPU.
Connect display to dGPU after desktop manager has fully started.
check if bo va leaks happen on unbind

@tacsist unfortunately here even with no display attached my dGPU / 6800Xt cannot enter d3cold. See screenshot attached

amdgpu: leaking bo va

Still happens for me when unbinding with no display attached to the dGPU after boot (or one attached after boot)

@bigbeeshane thank you, that's about right. I just sat down to test it myself. From my experience, GDM somewhat uses the monitors.xml (/var/lib/gdm/.config/monitors.xml + maybe just ~/.config/monitors.xml) to choose how to display the login screen. My assumption is that in your case it has defaulted to expect the monitor connected to the dGPU and because of that initializes it as primary. See if you get same behavior with d3cold with GDM disabled. But there are like a 100 other things that could keep your dGPU hot, I have spend too much time on weeding out all the problematic applications/configs.

@tacsist as an aside i tested tested the same process with hyprland window manager and that could rebind sometimes (it worked 2 times out of 8) vs gdm/gnome where it never works.

After digging through the rebind logs it seems like when we see the va/bo leaks amdgpu doesn't clean up everything, then on rebind of the same pci-id we see conflicts and eventually a crash

Aug 13 12:14:19.245991 Telperion kernel: [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x7480 0xF111:0x0007 0xC1). Aug 13 12:14:19.246012 Telperion kernel: [drm] register mmio base: 0x90D00000 Aug 13 12:14:19.246031 Telperion kernel: [drm] register mmio size: 1048576 Aug 13 12:14:19.246056 Telperion kernel: sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/ip_discovery'

from the original posts logs

@bigbeeshane yes, just finished my own testing. The leak errors are probably unrelated to both this bug and GDM. What I have found out is that the leaks happen when the GPU gets unbinded while a compositor is using it to display. Gnome/mutter handles this gracefully (but leaking) while hyprland/aquamarine crashes/exits to gdm and also leaks. No matter what test case i did, all of them ended with the main errors in this issue. What I also noticed that windows doesn't even need to load his own drivers. If libvirt unbinds and and immediately rebinds the GPU back to host (failed start, like missing a device req), the main error still occurs.

@tacsist unfortunately here even with no display attached my dGPU / 6800Xt cannot enter d3cold. See screenshot attached

also your GPU might just not support d3cold and uses d3hot instead, so you could be good on that front.

@tacsist it may be worth trying to ignore the error here https://elixir.bootlin.com/linux/v6.13.4/source/drivers/gpu/drm/amd/amdgpu/amdgpu_preempt_mgr.c#L100

it fails crating that sysfs entry as well, and that halts the init of the card in amdgpu_ttm_init

maybe on error try append to the target file (example mem_info_preempt_used becomes mem_info_preempt_used_1) just to see if there is anything further down the line thats not cleaned up.

I will try this weekend but wont get chance to beforehand

@agd5f it's a little late obviously (sorry ) but I'll answer in case it still matters.

The system comes up and amdgpu loads on both GPUs. Then you are unbinding amdgpu from one of the GPUs so you can use it for passthrough. Does the issue happen at unbinding time or are you trying to bind the driver back to the device again later and that's where the issue happens?

Yes and Yes (to the first two sentences). I'm pretty sure this crash happened either when amdgpu was unbound from the dGPU or when vfio was bound to it. I can't be sure which event triggered the crash because the rebinding was being done by virt-manager (or one of the tools it uses) but it seems more likely that unbinding was the trigger. In case it's not clear, I'm not trying to do anything with the iGPU.

FWIW I've never been able to get un/binding to work in a stable manner. Most of the time I'm able to unbind amdgpu, bind vfio, and play games in Windows using the GPU. But I'm never able to shut down correctly. An actual kernel bug/oops is rare; most of the time everything is fine until I try to shut down and then the kernel gets into some weird state where I have to use magic sysrq keys to recover. I didn't make a bug report for that because "my system gets weird" didn't seem like a valuable report and since it hangs during shutdown (when all the systemd service shutdown messages are scrolling the tty) I can't think of any way to debug it short of attaching a hardware debugger (which my laptop probably doesn't support). My solution is to have two bootloader entries, one of which boots normally (and binds amdgpu to both GPUs) and another that uses pciestub to block that so I can use my Windows VM and pcie passthrough without the kernel throwing a fit.

BTW I'm using SDDM and KDE.

[...] In my case, i need a hard reboot to fix and this issue happens like 2/3 times.

@tacsist That sounds exactly like what happens to me most of the time, except I don't think I've ever been able to shutdown without a hard reboot after starting the VM.

if you have an iGPU as well, you can try not having monitors connected to the actual GPU on boot and seeing if you still encounter bo va leaks

Is this something I can do with my laptop? I rarely use external monitors so the only monitor that's connected is the internal one. I'm using a custom configured kernel, compiled-in command line, and fully custom initramfs so I have a lot of control over the boot process.

@ethan.reesor Nice to see you back, haha

That sounds exactly like what happens to me most of the time, except I don't think I've ever been able to shutdown without a hard reboot after starting the VM.

I am also only using virt-manager with no extra configuration/scripts. The difficulty to shut down seems to me to stem from v-m and/or libvirtd. For me, upon guest shutdown, the GUI hard hangs while showing the guest is at "Shutting Down" and never recovers. During host shutdown, the system tries really hard to terminate the connection between libvirtd and journalctl but never succeeds. It seems that libvirtd can't recover from the failed GPU rebind to host and hangs both itself and virt-manager and journalctl. Recently, I cannot correctly shutdown 100% of times, so either I had some rare config or my mind is playing tricks on me.

Is this something I can do with my laptop? I rarely use external monitors so the only monitor that's connected is the internal one. I'm using a custom configured kernel, compiled-in command line, and fully custom initramfs so I have a lot of control over the boot process.

Honestly: no clue. I do not do anything special, the only note is that I do not have early KMS enabled, so amdgpu doesn't load in within initramfs. If I don't have a display connected to the dGPU until I am in a compositor, I need to manually wake up the dGPU (with GTK vulkan bug or just opening and app that uses the dGPU) for the compositor to start using the dGPU (hyprland and I am 70% sure same behavior happens with gnome too). This could also happen with SDDM, I do not know how it decides on what gpus/displays to use.

I'm pretty sure this crash happened either when amdgpu was unbound from the dGPU or when vfio was bound to it.

That is interesting, because for me the amdgpu unbinds from host without a problem and 0 errors, but I get the exact same errors as you only when amdgpu binds back to host. I'll get a full dmesg log in a bit.

For me, upon guest shutdown, the GUI hard hangs while showing the guest is at "Shutting Down" and never recovers.

I'm questioning my memory now. It's been a while since I tried this, I have a desktop that I dual boot (though I'd like to switch to virtualized windows) so I only use Windows on my laptop when I'm not at home and I want to play games, which is not often.

I also recall the GUI (as in the entire desktop environment) hard hanging when I try to shut down the guest. systemctl restart sddm kind of helps but the system doesn't seem healthy until a reboot. And now I'm thinking I remember saying to my friend, "The only time it works is if I shutdown Linux while the guest is still running." So now I'm thinking I was misremembering when I said Linux shutdown always hangs.

That is interesting, because for me the amdgpu unbinds from host without a problem and 0 errors, but I get the exact same errors as you only when amdgpu binds back to host. I'll get a full dmesg log in a bit.

I'll edit my statement: I'm pretty sure this crash happened when I shutdown the guest. My previous statement about whether the crash happened during binding/unbinding was ultimately conjecture. What I'm confident of is that the crash occurred as a result of shutting down the guest. I don't recall if I shutdown via virt-manager or via the guest (via Window's shutdown button).

I see now, I should try shutting down host while guest is up.

I don't recall if I shutdown via virt-manager or via the guest (via Window's shutdown button).

For me this make 0 difference, both end up exactly with same problem.

@agd5f do you know of any intended scenario where these sysfs resources are held through a unbind -> bind cycle of a pcie dGPU ?

I think the problem here is related to object lifetimes in the kernel. E.g., if the driver has shared a buffer with some other process that still holds a reference to the buffer, it can't be freed until that reference is dropped.

Is it feasible to maintain a ‘dead buffer’ list? If the shared buffer is in system memory, could the driver simply just keep it in a list of “some process has a reference to this buffer but it’s dead so we can’t do anything with it”? If the buffer is in GPU memory maybe that’s not possible.

@agd5f from some more testing it does seem the core issue is that when we see the buffer leaks amdgpu doesn't clean up the rest of the device ip blocks (for example at least 2 sysfs entries are not cleaned up)

This causes the rebind to fail.

Although the leaks are not good, I don't think they should block a rebind of the GPU

BUG: Null pointer dereferencing when unbinding

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Attached files:

Log files (for system lockups / game freezes / crashes)

Designs

Child items ...

Activity

Admin message

Admin message

BUG: Null pointer dereferencing when unbinding

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Attached files:

Log files (for system lockups / game freezes / crashes)

Activity