recovery issues on sm8350
from mesa/mesa#9969 (comment 2120483)
2023-10-09 15:00:44.668273: [ 5482.080907] adreno 3d00000.gpu: [drm:a6xx_recover] *ERROR* cx gdsc didn't collapse
2023-10-09 15:00:44.668282: [ 5482.098765] platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0
2023-10-09 15:00:44.668292: [ 5483.106978] [drm:adreno_idle] *ERROR* A660: timeout waiting to drain ringbuffer 0 rptr/wptr = 0/9
2023-10-09 15:00:44.668301: [ 5483.126568] platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0
2023-10-09 15:00:44.668310: [ 5484.134984] [drm:adreno_idle] *ERROR* A660: timeout waiting to drain ringbuffer 0 rptr/wptr = 0/9
Not sure if this is a6xx gen4 specific, or just sm8350 specific
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
Another example, https://gitlab.freedesktop.org/mesa/mesa/-/jobs/51064962 .. possibly
KHR-GL46.shader_image_load_store.non-layered_binding
is where it starts to go badCollapse replies Hmm, I noticed the a660 runners are on 6.4.12 kernel, so they are missing some possibly relevant patches, like:
- Rob Clark mentioned in merge request mesa/mesa!25498 (merged)
mentioned in merge request mesa/mesa!25498 (merged)
So a bit more of the relevant part of dmesg from the job linked from mesa/mesa#9969:
[0m2023-10-09 13:54:01.043902: [1m[ 1479.055278] *** gpu fault: ttbr0=000000021edd6000 iova=00000003d0247cc0 dir=READ type=TRANSLATION source=UCHE (0,0,0,2)[0m [0m2023-10-09 13:54:01.044025: [1m[ 1479.066410] *** gpu fault: ttbr0=000000021edd6000 iova=00000003d0247cc0 dir=READ type=UNKNOWN source=UCHE (0,0,0,2)[0m [0m2023-10-09 13:54:01.044053: [1m[ 1479.077146] *** gpu fault: ttbr0=000000021edd6000 iova=00000003d0247cc0 dir=READ type=UNKNOWN source=UCHE (0,0,0,2)[0m [0m2023-10-09 13:54:01.044068: [1m[ 1479.087882] *** gpu fault: ttbr0=000000021edd6000 iova=00000003d0247d00 dir=READ type=TRANSLATION source=UCHE (0,0,0,2)[0m [0m2023-10-09 13:54:01.044083: [1m[ 1479.098981] *** gpu fault: ttbr0=000000021edd6000 iova=00000003d0247d00 dir=READ type=UNKNOWN source=UCHE (0,0,0,2)[0m [0m2023-10-09 13:54:01.044098: [1m[ 1479.109714] *** gpu fault: ttbr0=000000021edd6000 iova=00000002c0290b80 dir=READ type=TRANSLATION source=UCHE (0,0,0,2)[0m [0m2023-10-09 13:54:01.044113: [1m[ 1479.112346] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0] *ERROR* Message HFI_H2F_MSG_GX_BW_PERF_VOTE id 591 timed out waiting for response[0m [0m2023-10-09 13:54:01.044129: [1m[ 1479.120803] *** gpu fault: ttbr0=000000021edd6000 iova=00000002c0290b80 dir=READ type=UNKNOWN source=UCHE (0,0,0,2)[0m [0m2023-10-09 13:54:01.044144: [1m[ 1479.234945] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0] *ERROR* Unexpected message id 591 on the response queue[0m [0m2023-10-09 13:54:01.044159: [1m[ 1479.356024] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0] *ERROR* The HFI response queue is unexpectedly empty[0m [0m2023-10-09 13:54:01.044174: [1m[ 1479.367846] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0] *ERROR* Unexpected message id 593 on the response queue[0m [0m2023-10-09 13:54:01.044188: [1m[ 1479.475017] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0] *ERROR* The HFI response queue is unexpectedly empty[0m [0m2023-10-09 13:54:01.044203: [1m[ 1479.486400] platform 3d6a000.gmu: [drm:a6xx_hfi_stop] *ERROR* HFI queue 1 is not empty[0m [0m2023-10-09 13:54:01.044220: [1m[ 1480.502160] platform 3d6a000.gmu: [drm:a6xx_rpmh_start] *ERROR* Unable to power on the GPU RSC[0m [0m2023-10-09 13:54:01.044235: [1m[ 1480.521117] platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0[0m [0m2023-10-09 13:54:01.044282: [1m[ 1481.530970] [drm:adreno_idle] *ERROR* A660: timeout waiting to drain ringbuffer 0 rptr/wptr = 0/9[0m [0m2023-10-09 13:54:01.044300: [1m[ 1481.550224] platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0[0m
I can repro something similar on x1-85 with blender plus an older mesa (with a since fixed bug causing iova faults):
Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=TRANSLATION source=CCU (3736059565,3736059565,3736059565,3736059565) Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565) Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565) Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565) Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565) Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565) Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565) Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565) Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565) Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565) Dec 13 07:58:48 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0 [msm]] *ERROR* Message HFI_H2F_MSG_GX_BW_PERF_VOTE id 2942 timed out waiting for response Dec 13 07:58:51 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1: hangcheck detected gpu lockup rb 2! Dec 13 07:58:51 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1: completed fence: 7728543 Dec 13 07:58:51 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1: submitted fence: 7728551 Dec 13 07:58:51 yoda kernel: msm_dpu ae01000.display-controller: [drm:recover_worker [msm]] *ERROR* 67.5.12.1: hangcheck recover! Dec 13 07:58:51 yoda kernel: msm_dpu ae01000.display-controller: [drm:recover_worker [msm]] *ERROR* 67.5.12.1: offending task: blender (/usr/bin/blender) Dec 13 07:58:51 yoda kernel: revision: 0 (67.5.12.1) Dec 13 07:58:51 yoda kernel: rb 0: fence: 593789/593794 Dec 13 07:58:51 yoda kernel: rptr: 36 Dec 13 07:58:51 yoda kernel: rb wptr: 528 Dec 13 07:58:51 yoda kernel: rb 1: fence: -256/-256 Dec 13 07:58:51 yoda kernel: rptr: 0 Dec 13 07:58:51 yoda kernel: rb wptr: 0 Dec 13 07:58:51 yoda kernel: rb 2: fence: 7728544/7728551 Dec 13 07:58:51 yoda kernel: rptr: 120 Dec 13 07:58:51 yoda kernel: rb wptr: 833 Dec 13 07:58:51 yoda kernel: rb 3: fence: -256/-256 Dec 13 07:58:51 yoda kernel: rptr: 0 Dec 13 07:58:51 yoda kernel: rb wptr: 0 Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0 Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0 Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG2: 0 Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0 Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG4: 3736059565 Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG5: 3736059565 Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG6: 3736059565 Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG7: 3736059565 Dec 13 07:58:51 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0 [msm]] *ERROR* Unexpected message id 2942 on the response queue Dec 13 07:58:51 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0 [msm]] *ERROR* The HFI response queue is unexpectedly empty Dec 13 07:58:51 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_rpmh_start [msm]] *ERROR* Unable to power on the GPU RSC Dec 13 07:58:51 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob [msm]] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0 Dec 13 07:58:51 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob [msm]] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0 Dec 13 07:58:54 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1: hangcheck detected gpu lockup rb 2! Dec 13 07:58:54 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1: completed fence: 7728544 Dec 13 07:58:54 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1: submitted fence: 7728552 Dec 13 07:58:54 yoda kernel: msm_dpu ae01000.display-controller: [drm:recover_worker [msm]] *ERROR* 67.5.12.1: hangcheck recover! Dec 13 07:58:54 yoda kernel: msm_dpu ae01000.display-controller: [drm:recover_worker [msm]] *ERROR* 67.5.12.1: offending task: blender (/usr/bin/blender) Dec 13 07:58:54 yoda kernel: revision: 0 (67.5.12.1) Dec 13 07:58:54 yoda kernel: rb 0: fence: 593789/593794 Dec 13 07:58:54 yoda kernel: rptr: 0 Dec 13 07:58:54 yoda kernel: rb wptr: 991 Dec 13 07:58:54 yoda kernel: rb 1: fence: -256/-256 Dec 13 07:58:54 yoda kernel: rptr: 0 Dec 13 07:58:54 yoda kernel: rb wptr: 0 Dec 13 07:58:54 yoda kernel: rb 2: fence: 7728545/7728552 Dec 13 07:58:54 yoda kernel: rptr: 0 Dec 13 07:58:54 yoda kernel: rb wptr: 1509 Dec 13 07:58:54 yoda kernel: rb 3: fence: -256/-256 Dec 13 07:58:54 yoda kernel: rptr: 0 Dec 13 07:58:54 yoda kernel: rb wptr: 0 Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0 Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0 Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG2: 0 Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0 Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG4: 0 Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG5: 0 Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG6: 0 Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG7: 0 Dec 13 07:58:54 yoda gnome-character[27057]: JS LOG: Characters Application exiting Dec 13 07:58:54 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob [msm]] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0 Dec 13 07:58:54 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob [msm]] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0 Dec 13 07:58:56 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1: hangcheck detected gpu lockup rb 2! Dec 13 07:58:56 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1: completed fence: 7728545 Dec 13 07:58:56 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1: submitted fence: 7728553 Dec 13 07:58:56 yoda kernel: msm_dpu ae01000.display-controller: [drm:recover_worker [msm]] *ERROR* 67.5.12.1: hangcheck recover! Dec 13 07:58:56 yoda kernel: msm_dpu ae01000.display-controller: [drm:recover_worker [msm]] *ERROR* 67.5.12.1: offending task: blender (/usr/bin/blender) Dec 13 07:58:56 yoda kernel: revision: 0 (67.5.12.1) Dec 13 07:58:56 yoda kernel: rb 0: fence: 593789/593794 Dec 13 07:58:56 yoda kernel: rptr: 0 Dec 13 07:58:56 yoda kernel: rb wptr: 1454 Dec 13 07:58:56 yoda kernel: rb 1: fence: -256/-256 Dec 13 07:58:56 yoda kernel: rptr: 0 Dec 13 07:58:56 yoda kernel: rb wptr: 0 Dec 13 07:58:56 yoda kernel: rb 2: fence: 7728546/7728553 Dec 13 07:58:56 yoda kernel: rptr: 0 Dec 13 07:58:56 yoda kernel: rb wptr: 2173 Dec 13 07:58:56 yoda kernel: rb 3: fence: -256/-256 Dec 13 07:58:56 yoda kernel: rptr: 0 Dec 13 07:58:56 yoda kernel: rb wptr: 0 Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0 Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0 Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG2: 0 Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0 Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG4: 0 Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG5: 0 Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG6: 0 Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG7: 0 Dec 13 07:58:57 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] *ERROR* cx gdsc didn't collapse Dec 13 07:58:57 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob [msm]] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0 Dec 13 07:58:57 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob [msm]] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0 Dec 13 07:58:58 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1: hangcheck detected gpu lockup rb 2!
The common pattern is that it starts with iova faults, and then
HFI_H2F_MSG_GX_BW_PERF_VOTE
timeout. Since I could reproduce it easily enough, I did some poking around and realized that thenHFI_H2F_MSG_GX_BW_PERF_VOTE
is triggered from devfreq.So I think what is happening is:
- We take an smmu fault, and leave translation suspended while
fault_worker
collects GPU state for devcoredump - Devfreq sampling period elapses and it decides update the GPU freq, leading to sending
HFI_H2F_MSG_GX_BW_PERF_VOTE
to the GMU. - I guess the GMU's context bank also has translation suspended, leaving the GMU in suspended animation and eventually leading to the timeout.
- We take an smmu fault, and leave translation suspended while