rx79000xtx crashes
Brief summary of the problem:
When using gpu intensive apps the gpu hangs and then resets. In my case it's ROCm pytorch which uses most of the resources. And you might be inclined to say that it's ROCm that is the issue, especially that it refuses to do anything after the reset. But I honestly think it's amdgpu's fault and it is triggered somehow by the high use of rocm. It also might be related to xwayland, but if that were to be the case I'd still consider it a driver issue, since no program should crash a driver/card.
I don't know programs that will just use the resources in such a way that it might trigger the issue. If you have any ideas or you want any specific logs I will make sure to provide them to you as soon as the next crash happens.
Hardware description:
OS: Manjaro Linux x86_64 Kernel: 6.7.0-rc4-1-amd-drm-fixes-ga4236c4b4108 Uptime: 42 mins Packages: 1581 (pacman) Shell: bash 5.2.21 Resolution: 3840x2160 DE: Plasma 5.27.10 WM: kwin Terminal: konsole CPU: AMD Ryzen 9 5900X (24) @ 3.700GHz GPU: AMD ATI Radeon RX 7900 XT/7900 XTX Memory: 7324MiB / 64206MiB Type of Display Connection: DP(HDMI for the second screen, but I avoid using it since wayland can't recover after the crash with 2 screens)
How to reproduce the issue:
In my case just train a model with lots of batches. And then use firefox or even wayland.
Log files (for system lockups / game freezes / crashes)
After the crash in dmesg I find this. In this instance it says XWayland, but other times it is firefox
[ 1347.333848] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=84047, emitted seq=84049 [ 1347.334160] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xwayland pid 2639 thread Xwayland:cs0 pid 2651 [ 1347.334370] amdgpu 0000:0a:00.0: amdgpu: GPU reset begin! [ 1347.463359] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 1347.463519] amdgpu 0000:0a:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002 [ 1347.463521] amdgpu 0000:0a:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset [ 1347.463522] amdgpu 0000:0a:00.0: amdgpu: Failed to evict queue 1 [ 1347.463523] amdgpu: Failed to evict process queues [ 1347.463524] amdgpu: Failed to suspend process 0x8012 [ 1348.470162] amdgpu 0000:0a:00.0: amdgpu: IP block:gfx_v11_0 is hung! [ 1348.470645] amdgpu 0000:0a:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0, for process pid 0 thread pid 0) [ 1348.470652] amdgpu 0000:0a:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 [ 1348.470655] amdgpu 0000:0a:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00040B53 [ 1348.470656] amdgpu 0000:0a:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5) [ 1348.470658] amdgpu 0000:0a:00.0: amdgpu: MORE_FAULTS: 0x1 [ 1348.470660] amdgpu 0000:0a:00.0: amdgpu: WALKER_ERROR: 0x1 [ 1348.470661] amdgpu 0000:0a:00.0: amdgpu: PERMISSION_FAULTS: 0x5 [ 1348.470662] amdgpu 0000:0a:00.0: amdgpu: MAPPING_ERROR: 0x1 [ 1348.470664] amdgpu 0000:0a:00.0: amdgpu: RW: 0x1 [ 1348.470668] amdgpu 0000:0a:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0, for process pid 0 thread pid 0) [ 1348.470671] amdgpu 0000:0a:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 [ 1348.470673] amdgpu 0000:0a:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 1348.470674] amdgpu 0000:0a:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0) [ 1348.470675] amdgpu 0000:0a:00.0: amdgpu: MORE_FAULTS: 0x0 [ 1348.470676] amdgpu 0000:0a:00.0: amdgpu: WALKER_ERROR: 0x0 [ 1348.470677] amdgpu 0000:0a:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 1348.470678] amdgpu 0000:0a:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 1348.470680] amdgpu 0000:0a:00.0: amdgpu: RW: 0x0 [ 1348.470684] amdgpu 0000:0a:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0, for process pid 0 thread pid 0) [ 1348.470686] amdgpu 0000:0a:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 [ 1348.470687] amdgpu 0000:0a:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 1348.470688] amdgpu 0000:0a:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0) [ 1348.470690] amdgpu 0000:0a:00.0: amdgpu: MORE_FAULTS: 0x0 [ 1348.470691] amdgpu 0000:0a:00.0: amdgpu: WALKER_ERROR: 0x0 [ 1348.470692] amdgpu 0000:0a:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 1348.470693] amdgpu 0000:0a:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 1348.470694] amdgpu 0000:0a:00.0: amdgpu: RW: 0x0 [ 1348.824603] Failed to wait all pipes clean [ 1348.824606] amdgpu 0000:0a:00.0: amdgpu: soft reset failed, will fallback to full reset! [ 1349.136398] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 1349.136557] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 1349.264619] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 1349.264772] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 1349.392805] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 1349.392955] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 1349.521012] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 1349.521162] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 1349.649214] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 1349.649363] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 1349.777439] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 1349.777591] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 1349.905669] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 1349.905820] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 1350.033888] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 1350.034039] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 1350.162108] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 1350.162258] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 1350.428901] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx