kernel NULL pointer dereference when plugging in display

added TTM label

Do you have some high memory pressure when this is happening? Since you say it's "new" to 6.4, is it possible that when you previously used 6.3.y that you didn't happen to hotplug the display under a high memory pressure circumstance?

Yes, there is definitely memory pressure as I have two Firefox profiles open at all times. I have reverted to using 6.3.x as of writing this report and I have yet to observe the failed to get a new IB error with or without the crash since so I am relatively confident the issue is new to 6.4.

@arunpravin24 can you take a look?

For the record I've been swapping this display in and out quite aggressively on kernel 6.3 yesterday while running another 3D application in the background on top of the two Firefox profiles to create even more pressure, and I don't have a single failed to get a new IB error in my dmesg. I'm very certain the problem was introduced by 6.4 now.

@superm1 sure, I will check.

@SimonPilkington Could you run the below line and please let me know the line number to find the NULL pointer dereference on your source code.

./scripts/faddr2line drivers/gpu/drm/amd/amdgpu/amdgpu.ko sdma_v3_0_vm_write_pte+0x26/0xb0

While doing this I noticed that the crash from the 6.4.7 log is different although the bug is still on a page fault so I am attaching that as well.

amdgpu_bug_647.log

Here are the faddr2line outputs for the relevant calls:

./scripts/faddr2line ./debian/build/build_amd64_none_amd64/drivers/gpu/drm/amd/amdgpu/amdgpu.ko sdma_v3_0_vm_write_pte+0x26/0xb0
sdma_v3_0_vm_write_pte+0x26/0xb0:
sdma_v3_0_vm_write_pte at debian/build/build_amd64_none_amd64/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c:963

./scripts/faddr2line ./debian/build/build_amd64_none_amd64/drivers/gpu/drm/amd/amdgpu/amdgpu.ko amdgpu_vm_pde_update+0x42/0x100
amdgpu_vm_pde_update+0x42/0x100:
amdgpu_vm_pde_update at debian/build/build_amd64_none_amd64/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c:757

added hang/freeze label

I'm not on that computer at the moment, but I have run into this as well. I don't think it's memory related. I have a 64gb memory box that I use for vfio/ML and I was doing neither.

I have mostly seen this when failing to wake up from DPMS. It's never happened when I was on the computer. I have noticed something weird that might be related. When coming back from dpms sometimes my windows have moved a bit - maybe when my monitor turns off a hotplug event is triggered or something?

Actually, reading Anders's message, this has happened to me as well at least once. Both displays were plugged in and turned off by my window manager; when I moved the mouse, there was no response and I found the crash via SSH from another machine. My 6.4.7 log above may actually be from this incident but I can't remember anymore.

@SimonPilkington could you copy the exact code pointing at the below 2 line numbers.

sdma_v3_0_vm_write_pte at debian/build/build_amd64_none_amd64/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c:963 amdgpu_vm_pde_update at debian/build/build_amd64_none_amd64/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c:757

It's very unlikely that this is patched at all by Debian. I think it should point to this:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c?h=v6.4.7#n963 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c?h=v6.4.7#n757

Confirming, Debian does nothing special here.

Any updates here? Kernel 6.5 is still bad (produces failed to get a new IB errors when the display configuration changes).

I performed a bisect. This is the bad commit at least for the failed to get a new IB errors: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.4.y&id=c103a23f2f297c6ab2e5e74e39b655439f3524a6

I don't yet know if backing it out also fixes the NULL pointer dereference errors but it reverts cleanly from 6.5 so I will be testing it.

@superm1 @arunpravin24

The -512 error is this one: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/dma-buf/dma-fence.c?h=linux-6.4.y&id=c103a23f2f297c6ab2e5e74e39b655439f3524a6#n899

The call to dma_fence_wait_any_timeout() in amdgpu_sa_bo_new() (now drm_suballoc_new()) was changed when the suballocator was extracted from amdgpu:

                        spin_unlock(&sa_manager->wq.lock);
-                       t = dma_fence_wait_any_timeout(fences, count, false,
+                       t = dma_fence_wait_any_timeout(fences, count, intr,
                                                       MAX_SCHEDULE_TIMEOUT,
                                                       NULL);

And the call to drm_suballoc_new() in the new amdgpu_sa_bo_new() is now made with intr set to true:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c?h=linux-6.4.y&id=c103a23f2f297c6ab2e5e74e39b655439f3524a6#n84

But amdgpu_ib_get() is clearly not prepared to deal with this. I see no indication that this change was intentional.

In any case changing intr back to false makes the errors go away. Here's a patch 0001-drm-amd-Revert-unintentional-change-to-interruptible.patch

@ckoenig

This looks like the right fix. Can you generate a proper git patch with your signed-off-by?

Yeah, that explanation and the fix look reasonable to me as well.

We should probably make the suballocator interruptible at some point, but for now just changing the parameter to get the old behavior should be enough.

~~If this solution is good for now then here's a proper patch (I hope, I haven't submitted kernel patches before).~~

Can you send it to the mailing list please? https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Sorry, I don't know how to do this properly. I haven't used mailing lists before.

I tried setting up git send-email. Can you confirm it worked @superm1?

I just found out about checkpatch.pl and it doesn't like my patch. Should I resubmit?

Well it looks like the patch never made it to the mailing list (At least I can't find it).

Please shorten the subject line (something like "don't wait for IBs interruptible"), add a commit message describing which commit this fixes and why, add you Signed-of-by tag and then either re-submit to the mailing list or attach again to this report here.

I'm going to pick it up from either locations then.

I tried sending again, can you see it now? Do I need to be subscribed to the list to be able to post to it?

I don't see the email in the archives and I don't know why because I got the CC so in the interest of saving time please pick up the patch from here.

0001-drm-amd-Make-fence-wait-in-suballocator-uninterrupti.patch

Reviewed and pushed to drm-misc-fixes. I added a CC stable tag so that Greg should pick it up for 6.4.x in the next few days.

Thanks for the help.

Thanks for handling the patch. I believe this resolves the issue so closing.

FWIW, that patch also fixes a similar problem with Starfield, namely hundreds of

amdgpu: failed to get a new IB (-512)
amdgpu: failed to clear page tables on GEM object close (-512)

on exit. (It also speeds up exiting that game significantly.)

(Tested on 6.5.2.)

Hi Sorry, I didn't get time to look into this issue, I was completely busy with other tasks. I went through the code and thought there should be some problem with drm_suballoc_new() since it failed with error code -512. I was looking for a Polaris card to set up the debugging environment. @SimonPilkington Thanks for debugging this issue.

No problem, this has been educational for me.

closed

mentioned in issue #2769 (closed)

mentioned in commit mwa/kernel@e2884fe8

mentioned in issue #2362 (closed)

mentioned in commit nouveau@ecccfc53

mentioned in commit agd5f/linux@8dc01531

kernel NULL pointer dereference when plugging in display

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Log files (for system lockups / game freezes / crashes)

Designs

Child items 0

Activity

Admin message

Admin message

kernel NULL pointer dereference when plugging in display

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Log files (for system lockups / game freezes / crashes)

Activity