Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
Sometimes when plugging in a second display (by means of a hardware DisplayPort switch), the system will bug with a kernel NULL pointer dereference and lock up. The problem appears to have started happening with kernel 6.4.
Type of Display Connection: HDMI plugged in, DP plugging in
System information:
Distro name and Version: Debian Sid
Kernel version: 6.4.7 (first seen with 6.4.3, earlier versions not tested)
How to reproduce the issue:
Not always reproducible. The crash is always preceded by amdgpu 0000:0d:00.0: amdgpu: failed to get a new IB (-512) (see log below for more context) but this error also sometimes appears when plugging in a display without being followed by the NULL pointer dereference.
Log files (for system lockups / game freezes / crashes)
amdgpu_bug.log (log from 6.4.3 but problem still exists on 6.4.7; 6.4.8 does not appear likely to contain a fix)
Designs
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Do you have some high memory pressure when this is happening? Since you say it's "new" to 6.4, is it possible that when you previously used 6.3.y that you didn't happen to hotplug the display under a high memory pressure circumstance?
Yes, there is definitely memory pressure as I have two Firefox profiles open at all times. I have reverted to using 6.3.x as of writing this report and I have yet to observe the failed to get a new IB error with or without the crash since so I am relatively confident the issue is new to 6.4.
For the record I've been swapping this display in and out quite aggressively on kernel 6.3 yesterday while running another 3D application in the background on top of the two Firefox profiles to create even more pressure, and I don't have a single failed to get a new IB error in my dmesg. I'm very certain the problem was introduced by 6.4 now.
Here are the faddr2line outputs for the relevant calls:
./scripts/faddr2line ./debian/build/build_amd64_none_amd64/drivers/gpu/drm/amd/amdgpu/amdgpu.ko sdma_v3_0_vm_write_pte+0x26/0xb0sdma_v3_0_vm_write_pte+0x26/0xb0:sdma_v3_0_vm_write_pte at debian/build/build_amd64_none_amd64/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c:963
./scripts/faddr2line ./debian/build/build_amd64_none_amd64/drivers/gpu/drm/amd/amdgpu/amdgpu.ko amdgpu_vm_pde_update+0x42/0x100amdgpu_vm_pde_update+0x42/0x100:amdgpu_vm_pde_update at debian/build/build_amd64_none_amd64/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c:757
I'm not on that computer at the moment, but I have run into this as well. I don't think it's memory related. I have a 64gb memory box that I use for vfio/ML and I was doing neither.
I have mostly seen this when failing to wake up from DPMS. It's never happened when I was on the computer. I have noticed something weird that might be related. When coming back from dpms sometimes my windows have moved a bit - maybe when my monitor turns off a hotplug event is triggered or something?
Actually, reading Anders's message, this has happened to me as well at least once. Both displays were plugged in and turned off by my window manager; when I moved the mouse, there was no response and I found the crash via SSH from another machine. My 6.4.7 log above may actually be from this incident but I can't remember anymore.
@SimonPilkington could you copy the exact code pointing at the below 2 line numbers.
sdma_v3_0_vm_write_pte at debian/build/build_amd64_none_amd64/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c:963
amdgpu_vm_pde_update at debian/build/build_amd64_none_amd64/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c:757
Yeah, that explanation and the fix look reasonable to me as well.
We should probably make the suballocator interruptible at some point, but for now just changing the parameter to get the old behavior should be enough.
Well it looks like the patch never made it to the mailing list (At least I can't find it).
Please shorten the subject line (something like "don't wait for IBs interruptible"), add a commit message describing which commit this fixes and why, add you Signed-of-by tag and then either re-submit to the mailing list or attach again to this report here.
I'm going to pick it up from either locations then.
Hi Sorry, I didn't get time to look into this issue, I was completely busy with other tasks. I went through the code and thought there should be some problem with drm_suballoc_new() since it failed with error code -512. I was looking for a Polaris card to set up the debugging environment. @SimonPilkington Thanks for debugging this issue.