recovery issues on sm8350

Another example, https://gitlab.freedesktop.org/mesa/mesa/-/jobs/51064962 .. possibly KHR-GL46.shader_image_load_store.non-layered_binding is where it starts to go bad

Hmm, I noticed the a660 runners are on 6.4.12 kernel, so they are missing some possibly relevant patches, like:

mentioned in merge request mesa/mesa!25498 (merged)

mentioned in issue #65

So a bit more of the relevant part of dmesg from the job linked from mesa/mesa#9969:

[0m2023-10-09 13:54:01.043902: [1m[ 1479.055278] *** gpu fault: ttbr0=000000021edd6000 iova=00000003d0247cc0 dir=READ type=TRANSLATION source=UCHE (0,0,0,2)[0m
[0m2023-10-09 13:54:01.044025: [1m[ 1479.066410] *** gpu fault: ttbr0=000000021edd6000 iova=00000003d0247cc0 dir=READ type=UNKNOWN source=UCHE (0,0,0,2)[0m
[0m2023-10-09 13:54:01.044053: [1m[ 1479.077146] *** gpu fault: ttbr0=000000021edd6000 iova=00000003d0247cc0 dir=READ type=UNKNOWN source=UCHE (0,0,0,2)[0m
[0m2023-10-09 13:54:01.044068: [1m[ 1479.087882] *** gpu fault: ttbr0=000000021edd6000 iova=00000003d0247d00 dir=READ type=TRANSLATION source=UCHE (0,0,0,2)[0m
[0m2023-10-09 13:54:01.044083: [1m[ 1479.098981] *** gpu fault: ttbr0=000000021edd6000 iova=00000003d0247d00 dir=READ type=UNKNOWN source=UCHE (0,0,0,2)[0m
[0m2023-10-09 13:54:01.044098: [1m[ 1479.109714] *** gpu fault: ttbr0=000000021edd6000 iova=00000002c0290b80 dir=READ type=TRANSLATION source=UCHE (0,0,0,2)[0m
[0m2023-10-09 13:54:01.044113: [1m[ 1479.112346] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0] *ERROR* Message HFI_H2F_MSG_GX_BW_PERF_VOTE id 591 timed out waiting for response[0m
[0m2023-10-09 13:54:01.044129: [1m[ 1479.120803] *** gpu fault: ttbr0=000000021edd6000 iova=00000002c0290b80 dir=READ type=UNKNOWN source=UCHE (0,0,0,2)[0m
[0m2023-10-09 13:54:01.044144: [1m[ 1479.234945] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0] *ERROR* Unexpected message id 591 on the response queue[0m
[0m2023-10-09 13:54:01.044159: [1m[ 1479.356024] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0] *ERROR* The HFI response queue is unexpectedly empty[0m
[0m2023-10-09 13:54:01.044174: [1m[ 1479.367846] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0] *ERROR* Unexpected message id 593 on the response queue[0m
[0m2023-10-09 13:54:01.044188: [1m[ 1479.475017] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0] *ERROR* The HFI response queue is unexpectedly empty[0m
[0m2023-10-09 13:54:01.044203: [1m[ 1479.486400] platform 3d6a000.gmu: [drm:a6xx_hfi_stop] *ERROR* HFI queue 1 is not empty[0m
[0m2023-10-09 13:54:01.044220: [1m[ 1480.502160] platform 3d6a000.gmu: [drm:a6xx_rpmh_start] *ERROR* Unable to power on the GPU RSC[0m
[0m2023-10-09 13:54:01.044235: [1m[ 1480.521117] platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0[0m
[0m2023-10-09 13:54:01.044282: [1m[ 1481.530970] [drm:adreno_idle] *ERROR* A660: timeout waiting to drain ringbuffer 0 rptr/wptr = 0/9[0m
[0m2023-10-09 13:54:01.044300: [1m[ 1481.550224] platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0[0m

I can repro something similar on x1-85 with blender plus an older mesa (with a since fixed bug causing iova faults):

Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=TRANSLATION source=CCU (3736059565,3736059565,3736059565,3736059565)
Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565)
Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565)
Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565)
Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565)
Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565)
Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565)
Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565)
Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565)
Dec 13 07:58:48 yoda kernel: *** gpu fault: ttbr0=0000000e3529b000 iova=0000000000000040 dir=WRITE type=UNKNOWN source=CCU (3736059565,3736059565,3736059565,3736059565)
Dec 13 07:58:48 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0 [msm]] *ERROR* Message HFI_H2F_MSG_GX_BW_PERF_VOTE id 2942 timed out waiting for response
Dec 13 07:58:51 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1: hangcheck detected gpu lockup rb 2!
Dec 13 07:58:51 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1:     completed fence: 7728543
Dec 13 07:58:51 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1:     submitted fence: 7728551
Dec 13 07:58:51 yoda kernel: msm_dpu ae01000.display-controller: [drm:recover_worker [msm]] *ERROR* 67.5.12.1: hangcheck recover!
Dec 13 07:58:51 yoda kernel: msm_dpu ae01000.display-controller: [drm:recover_worker [msm]] *ERROR* 67.5.12.1: offending task: blender (/usr/bin/blender)
Dec 13 07:58:51 yoda kernel: revision: 0 (67.5.12.1)
Dec 13 07:58:51 yoda kernel: rb 0: fence:    593789/593794
Dec 13 07:58:51 yoda kernel: rptr:     36
Dec 13 07:58:51 yoda kernel: rb wptr:  528
Dec 13 07:58:51 yoda kernel: rb 1: fence:    -256/-256
Dec 13 07:58:51 yoda kernel: rptr:     0
Dec 13 07:58:51 yoda kernel: rb wptr:  0
Dec 13 07:58:51 yoda kernel: rb 2: fence:    7728544/7728551
Dec 13 07:58:51 yoda kernel: rptr:     120
Dec 13 07:58:51 yoda kernel: rb wptr:  833
Dec 13 07:58:51 yoda kernel: rb 3: fence:    -256/-256
Dec 13 07:58:51 yoda kernel: rptr:     0
Dec 13 07:58:51 yoda kernel: rb wptr:  0
Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG2: 0
Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG4: 3736059565
Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG5: 3736059565
Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG6: 3736059565
Dec 13 07:58:51 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG7: 3736059565
Dec 13 07:58:51 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0 [msm]] *ERROR* Unexpected message id 2942 on the response queue
Dec 13 07:58:51 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0 [msm]] *ERROR* The HFI response queue is unexpectedly empty
Dec 13 07:58:51 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_rpmh_start [msm]] *ERROR* Unable to power on the GPU RSC
Dec 13 07:58:51 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob [msm]] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0
Dec 13 07:58:51 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob [msm]] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0
Dec 13 07:58:54 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1: hangcheck detected gpu lockup rb 2!
Dec 13 07:58:54 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1:     completed fence: 7728544
Dec 13 07:58:54 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1:     submitted fence: 7728552
Dec 13 07:58:54 yoda kernel: msm_dpu ae01000.display-controller: [drm:recover_worker [msm]] *ERROR* 67.5.12.1: hangcheck recover!
Dec 13 07:58:54 yoda kernel: msm_dpu ae01000.display-controller: [drm:recover_worker [msm]] *ERROR* 67.5.12.1: offending task: blender (/usr/bin/blender)
Dec 13 07:58:54 yoda kernel: revision: 0 (67.5.12.1)
Dec 13 07:58:54 yoda kernel: rb 0: fence:    593789/593794
Dec 13 07:58:54 yoda kernel: rptr:     0
Dec 13 07:58:54 yoda kernel: rb wptr:  991
Dec 13 07:58:54 yoda kernel: rb 1: fence:    -256/-256
Dec 13 07:58:54 yoda kernel: rptr:     0
Dec 13 07:58:54 yoda kernel: rb wptr:  0
Dec 13 07:58:54 yoda kernel: rb 2: fence:    7728545/7728552
Dec 13 07:58:54 yoda kernel: rptr:     0
Dec 13 07:58:54 yoda kernel: rb wptr:  1509
Dec 13 07:58:54 yoda kernel: rb 3: fence:    -256/-256
Dec 13 07:58:54 yoda kernel: rptr:     0
Dec 13 07:58:54 yoda kernel: rb wptr:  0
Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG2: 0
Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG4: 0
Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG5: 0
Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG6: 0
Dec 13 07:58:54 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG7: 0
Dec 13 07:58:54 yoda gnome-character[27057]: JS LOG: Characters Application exiting
Dec 13 07:58:54 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob [msm]] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0
Dec 13 07:58:54 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob [msm]] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0
Dec 13 07:58:56 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1: hangcheck detected gpu lockup rb 2!
Dec 13 07:58:56 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1:     completed fence: 7728545
Dec 13 07:58:56 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1:     submitted fence: 7728553
Dec 13 07:58:56 yoda kernel: msm_dpu ae01000.display-controller: [drm:recover_worker [msm]] *ERROR* 67.5.12.1: hangcheck recover!
Dec 13 07:58:56 yoda kernel: msm_dpu ae01000.display-controller: [drm:recover_worker [msm]] *ERROR* 67.5.12.1: offending task: blender (/usr/bin/blender)
Dec 13 07:58:56 yoda kernel: revision: 0 (67.5.12.1)
Dec 13 07:58:56 yoda kernel: rb 0: fence:    593789/593794
Dec 13 07:58:56 yoda kernel: rptr:     0
Dec 13 07:58:56 yoda kernel: rb wptr:  1454
Dec 13 07:58:56 yoda kernel: rb 1: fence:    -256/-256
Dec 13 07:58:56 yoda kernel: rptr:     0
Dec 13 07:58:56 yoda kernel: rb wptr:  0
Dec 13 07:58:56 yoda kernel: rb 2: fence:    7728546/7728553
Dec 13 07:58:56 yoda kernel: rptr:     0
Dec 13 07:58:56 yoda kernel: rb wptr:  2173
Dec 13 07:58:56 yoda kernel: rb 3: fence:    -256/-256
Dec 13 07:58:56 yoda kernel: rptr:     0
Dec 13 07:58:56 yoda kernel: rb wptr:  0
Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG2: 0
Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG4: 0
Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG5: 0
Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG6: 0
Dec 13 07:58:56 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG7: 0
Dec 13 07:58:57 yoda kernel: adreno 3d00000.gpu: [drm:a6xx_recover [msm]] *ERROR* cx gdsc didn't collapse
Dec 13 07:58:57 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob [msm]] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0
Dec 13 07:58:57 yoda kernel: platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob [msm]] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0
Dec 13 07:58:58 yoda kernel: msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 67.5.12.1: hangcheck detected gpu lockup rb 2!

The common pattern is that it starts with iova faults, and then HFI_H2F_MSG_GX_BW_PERF_VOTE timeout. Since I could reproduce it easily enough, I did some poking around and realized that then HFI_H2F_MSG_GX_BW_PERF_VOTE is triggered from devfreq.

So I think what is happening is:

We take an smmu fault, and leave translation suspended while fault_worker collects GPU state for devcoredump
Devfreq sampling period elapses and it decides update the GPU freq, leading to sending HFI_H2F_MSG_GX_BW_PERF_VOTE to the GMU.
I guess the GMU's context bank also has translation suspended, leaving the GMU in suspended animation and eventually leading to the timeout.

recovery issues on sm8350

Designs

Child items 0

Activity

Admin message

Admin message

recovery issues on sm8350

Activity