vangogh: Spurious SDMA hang when launching Elden Ring - suspected firmware issue
Hi,
When launching Elden Ring on Steam Deck, there is a small (~10%) chance that it triggers an SDMA ring hang:
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=71859, emitted seq=71860
There are no other errors logged (i.e. no page faults or the like). The command buffer currently in-flight looks valid as well (verified with umr ring dumps).
One can prevent the ring from timing out by either causing more things to be submitted on-top (e.g. by mashing the Steam button to repeatedly bring up/hide the Steam UI), or simply by accessing/reading the SDMA0_GFX_RB_RPTR
or SDMA0_GFX_RB_WPTR
registers, for example via umr.
I suspect this is a race condition/deadlock in the SDMA firmware: I'd guess the hang happens whenever the ring's wptr is updated/the doorbell is written exactly when the command processor is done parsing the previous set of commands and is going idle.
Setting ring->use_doorbell
to false
in sdma_v5_2.c
seems to work around the issue, although I suspect this is because the non-doorbell path writes two registers ( (see #3440 (comment 2461511)).GFX_RB_WPTR
and GFX_RB_WPTR_HI
) separately, and thus there are two possible wakeup points for the firmware and the chance that both registers are written in the timeframe where the CP won't register the writes is even lower.
In the same file, there is this comment (https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c#L1642):
SDMA 5.2.3 (RMB) FW doesn't seem to properly disallow GFXOFF in some cases leading to hangs in SDMA. Disallow GFXOFF while SDMA is active.
Is it possible these issues are related, and the reason GFXOFF is erroneously allowed is because the SDMA engine thinks it's idle?