amdgpu no-retry page fault under 5.19

Can you bisect?

Yeah sure, I'll try! It might take a long time since it takes about 3 hours for the machine to lock up, so it will be hard to know if a commit is good.

Same identical issue on a 4500U. My system Xorg just died with a similar error:

ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:40 vmid:4 pasid:32769, for process Xorg >
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000800103628000 from IH client 0x1b (UTCL2)
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00441051
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x1
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x1
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:40 vmid:4 pasid:32769, for process Xorg >
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000800103628000 from IH client 0x1b (UTCL2)
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB (0x0)
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:40 vmid:4 pasid:32769, for process Xorg >
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000800103628000 from IH client 0x1b (UTCL2)
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB (0x0)
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:40 vmid:4 pasid:32769, for process Xorg >
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000800103628000 from IH client 0x1b (UTCL2)
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB (0x0)
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
ago 12 17:04:43 *** kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
ago 12 17:04:53 *** kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
ago 12 17:04:53 *** kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

Switching terminals just gave me a cursor on the screen, but no terminal in sight. Only way out was a reboot.

This also happens on my Vega 56 for what it's worth, but it happens seemingly at random, this time it took over 12 hours for it to occur and I had been gaming for several hours within that time period (though not at the time of the crash)

Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:40 vmid:4 pasid:32783, for process firefox pid 62950 thread firefox:cs0 pid 63047)
Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x000080010a0cb000 from IH client 0x1b (UTCL2)
Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00441051
Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x1
Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x1

journalK.log

Hey,

I am slowly but surely bisecting this. As mentioned before, without a reproducible test case it's pretty tedious waiting long enough to be sure a commit is bad!

However, during an attempt to shut down the machine after a lockup, I noticed this in the journal:

Sep 03 22:29:06 w7700 kernel: INFO: task signal-des:sh8:87996 blocked for more than 120 seconds.
Sep 03 22:29:06 w7700 kernel:       Not tainted 5.18.0-rc5-custom #1
Sep 03 22:29:06 w7700 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 03 22:29:06 w7700 kernel: task:signal-des:sh8  state:D stack:    0 pid:87996 ppid: 87945 flags:0x00004002
Sep 03 22:29:06 w7700 kernel: Call Trace:
Sep 03 22:29:06 w7700 kernel:  <TASK>
Sep 03 22:29:06 w7700 kernel:  __schedule+0x30a/0x13f0
Sep 03 22:29:06 w7700 kernel:  ? __slab_free+0xbf/0x310
Sep 03 22:29:06 w7700 kernel:  schedule+0x58/0xc0
Sep 03 22:29:06 w7700 kernel:  schedule_timeout+0x115/0x150
Sep 03 22:29:06 w7700 kernel:  ? amdgpu_sync_get_fence+0x68/0x100 [amdgpu]
Sep 03 22:29:06 w7700 kernel:  ? preempt_count_add+0x7c/0xc0
Sep 03 22:29:06 w7700 kernel:  dma_fence_default_wait+0x177/0x200
Sep 03 22:29:06 w7700 kernel:  ? dma_fence_free+0x30/0x30
Sep 03 22:29:06 w7700 kernel:  dma_fence_wait_timeout+0xe5/0x110
Sep 03 22:29:06 w7700 kernel:  drm_sched_entity_fini+0xf9/0x270 [gpu_sched]
Sep 03 22:29:06 w7700 kernel:  amdgpu_ctx_mgr_entity_fini+0xc6/0x1c0 [amdgpu]
Sep 03 22:29:06 w7700 kernel:  amdgpu_ctx_mgr_fini+0x32/0xc0 [amdgpu]
Sep 03 22:29:06 w7700 kernel:  amdgpu_driver_postclose_kms+0x1d3/0x2c0 [amdgpu]
Sep 03 22:29:06 w7700 kernel:  drm_file_free.part.0+0x1da/0x230 [drm]
Sep 03 22:29:06 w7700 kernel:  drm_close_helper.isra.0+0x65/0x70 [drm]
Sep 03 22:29:06 w7700 kernel:  drm_release+0x6a/0x110 [drm]
Sep 03 22:29:06 w7700 kernel:  __fput+0x9f/0x260
Sep 03 22:29:06 w7700 kernel:  ____fput+0xe/0x20
Sep 03 22:29:06 w7700 kernel:  task_work_run+0x64/0xa0
Sep 03 22:29:06 w7700 kernel:  do_exit+0x33b/0xab0
Sep 03 22:29:06 w7700 kernel:  do_group_exit+0x35/0xa0
Sep 03 22:29:06 w7700 kernel:  get_signal+0x99c/0x9c0
Sep 03 22:29:06 w7700 kernel:  arch_do_signal_or_restart+0x37/0x7a0
Sep 03 22:29:06 w7700 kernel:  ? do_futex+0x12f/0x1d0
Sep 03 22:29:06 w7700 kernel:  exit_to_user_mode_prepare+0xf2/0x1a0
Sep 03 22:29:06 w7700 kernel:  syscall_exit_to_user_mode+0x26/0x50
Sep 03 22:29:06 w7700 kernel:  do_syscall_64+0x69/0x80
Sep 03 22:29:06 w7700 kernel:  ? do_syscall_64+0x69/0x80
Sep 03 22:29:06 w7700 kernel:  ? do_syscall_64+0x69/0x80
Sep 03 22:29:06 w7700 kernel:  ? exc_page_fault+0x87/0x180
Sep 03 22:29:06 w7700 kernel:  ? asm_exc_page_fault+0x8/0x30
Sep 03 22:29:06 w7700 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Sep 03 22:29:06 w7700 kernel: RIP: 0033:0x7fbd5d166ad3
Sep 03 22:29:06 w7700 kernel: RSP: 002b:00007fbd3e073850 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
Sep 03 22:29:06 w7700 kernel: RAX: fffffffffffffe00 RBX: 00003fb8002b6ac8 RCX: 00007fbd5d166ad3
Sep 03 22:29:06 w7700 kernel: RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00003fb8002b6af4
Sep 03 22:29:06 w7700 kernel: RBP: 00003fb8002b6aec R08: 0000000000000001 R09: 0000000000000000
Sep 03 22:29:06 w7700 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00003fb8002b6af4
Sep 03 22:29:06 w7700 kernel: R13: 0000000000000000 R14: 00003fb8002b6aa0 R15: 00000000000000e7
Sep 03 22:29:06 w7700 kernel:  </TASK>

I expect it's probably not useful, but just in case, here it is.

Only about 10 more commits to test!

Bruce

Nearly there... I'm down to these three commits to test:

bffa91dadf59 (refs/bisect/bad) drm/amdkfd: start using tlb_seq from the VM subsystem
5255e146c99a (HEAD) drm/amdgpu: rework TLB flushing
e997b82745a5 drm/amdgpu: simplify VM update tracking a bit

Bruce

Hello. I'm a Vega 56 user and I've had these no-retry pagefaults aswell. The issue first occurred at the time I jumped from kernel 5.18.7 (released 25/Jun) to 5.18.14 (rel. 23/Jul) ~~or after the update of linux-firmware to from 20220708 to 20220815 (rel. 15/Aug).~~

~~The file that changed between the two firmware versions is /lib/firmware/amdgpu/vega10_asd.bin~~

I am cautiously optimistic in saying that I might have been able work around it (5 days of continuous uptime so far, while the bug usually triggered before day 3), by downgrading said firmware. As of now, I can't be 100% sure about success however, as

I haven't found any means to deliberately and reliably trigger the bug. So far, I've seen it happen with firefox, steamwebhelper and mpv and vlc.

This is how the bug usually presented itself in dmesg.

@bwduncan thanks for combing through this!

Update: 17.Sept I had the no-retry pagefaults again after 5,5 days. The downgrade of linux-firmware doesn't seem to make a difference after all. I've had lots(!) of YT tabs open with their videos paused, then opened another one, pressed play and boom, screen hangs (switching to tty still worked though), same error messages as before in dmesg.

It seems like every time this occured for me, applications using hw accelerated video decoding were involved... Maybe too many handles on the decoder, or still VM related??

5255e146c99a677d4d55fdb988544bd20c539a0b is the first bad commit
commit 5255e146c99a677d4d55fdb988544bd20c539a0b
Author: Christian König <christian.koenig@amd.com>
Date:   Tue Mar 15 15:27:45 2022 +0100

    drm/amdgpu: rework TLB flushing
    
    Instead of tracking the VM updates through the dependencies just use a
    sequence counter for page table updates which indicates the need to
    flush the TLB.
    
    This reduces the need to flush the TLB drastically.
    
    v2: squash in NULL check fix (Christian)
    
    Signed-off-by: Christian König <christian.koenig@amd.com>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c   |  8 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c  |  6 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 20 -----------
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h |  2 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c   | 57 ++++++++++++++++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h   | 15 +++++++++
 6 files changed, 76 insertions(+), 32 deletions(-)

Trying to revert only this commit on the latest mainline produces a bunch of conflicts which I am not capable of resolving. Is it obvious to the authors whether this commit is responsible for the page faults and whether it is fixable?

Thanks! Bruce

Thank you so much for doing this, Bruce!

@ckoenig any ideas?

Unfortunately not really. This must be some corner case we missed.

It might be worth pointing out that I still don't have a reliable test case for this. It often happens that a teams call will reproduce it after about ten minutes, but it's possible that I didn't leave my laptop running for enough days to be sure that a commit was really good. If there is some other debugging we could do I'd be happy to try that.

Especially, if you could create a patch which allows me to revert that commit, I would have some confidence after a few weeks that this really is the cause.

Thanks,

Bruce

That won't be easily possible since we have a lot of dependencies on top of that.

What we could maybe do is to add the old tracking in parallel and then print a warning whenever the old behavior would have flushed the TLB while the new one doesn't.

Going to give that a try, but it might take me a day or two.

As a quick test. Can you guys try this change here and see if that makes a difference:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 59cac347baa3..0cab7ac93140 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -972,7 +972,7 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev, struct amdgpu_bo_va *bo_va,
        dma_addr_t *pages_addr = NULL;
        struct ttm_resource *mem;
        struct dma_fence **last_update;
-       bool flush_tlb = clear;
+       bool flush_tlb = true;
        struct dma_resv *resv;
        uint64_t vram_base;
        uint64_t flags;

Kernel compiled and running. Will let you know how it goes next week!

Sadly no...

[mmhub0] no-retry page fault (src_id:0 ring:40 vmid:5 pasid:32785, for process teams pid 3818 thread teams:cs0 pid 3909)
  in page starting at address 0x0000800108420000 from IH client 0x12 (VMC)
VM_L2_PROTECTION_FAULT_STATUS:0x00540050
         Faulty UTCL2 client ID: MP1 (0x0)
         MORE_FAULTS: 0x0
         WALKER_ERROR: 0x0
         PERMISSION_FAULTS: 0x5
         MAPPING_ERROR: 0x0
         RW: 0x1

Mhm, that narrows down the problem quite a bit. Give me a moment to finish my other testing patch.

Here is another patch to try. If this doesn't work I'm pretty much running out of ideas and we really need to do the big patch to hunt down this bug.

0001-drm-amdgpu-use-a-cb-to-inc-tlb-seq-on-PDE-updates-as.patch

Thanks! This patch doesn't apply to the latest mainline or to v5.19.10. Can you tell me which branch to apply it to?

:) bduncan@w7700:~/tmp/mainline$ git apply ~/Downloads/0001-drm-amdgpu-use-a-cb-to-inc-tlb-seq-on-PDE-updates-as.patch
error: patch failed: drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:702
error: drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c: patch does not apply
:( 1 bduncan@w7700:~/tmp/mainline$ git describe
v6.0-rc6-45-gdc164f4fb00a

Here is the same patch rebased on top of v5.19.10.

0001-drm-amdgpu-use-a-cb-to-inc-tlb-seq-on-PDE-updates-as.patch

Hi Christian,

It compiled and booted on the latest mainline 6.0-rc6, but the result was the same:

[mmhub0] no-retry page fault (src_id:0 ring:40 vmid:1 pasid:32782, for process teams pid 4055 thread teams:cs0 pid 4135)
  in page starting at address 0x000080010de20000 from IH client 0x12 (VMC)
VM_L2_PROTECTION_FAULT_STATUS:0x00140050
         Faulty UTCL2 client ID: MP1 (0x0)
         MORE_FAULTS: 0x0
         WALKER_ERROR: 0x0
         PERMISSION_FAULTS: 0x5
         MAPPING_ERROR: 0x0
         RW: 0x1

Bruce

amdgpu no-retry page fault under 5.19

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Attached files:

Log files (for system lockups / game freezes / crashes)

Designs

Child items ...

Activity

Admin message

Admin message

amdgpu no-retry page fault under 5.19

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Attached files:

Log files (for system lockups / game freezes / crashes)

Activity