[amdgpu] Steam Deck *ERROR* ring sdma0 timeout
Brief summary of the problem:
The Steam Deck's gamescope session crashes with a ring sdma0 timeout error. The game was sometimes frozen on the screen for a while. Opening the Steam OS menus and clicking the buttons again sometimes stopped the game. The gamescope session would recover later after some timeout.
Reproducing the issue was quite difficult. It would occur once every two or three days or after 3-5 hours of use. I was able to find a way to reproduce this after thinking of one particular scenario in which this crash occurred. This involved picking up the Steam Deck from its case, booting it (not waking it from standby) and starting the game. Shutting down the Steam Deck, turning it on and trying to start the game seemed to be a good way to trigger the bug. I was able to trigger it once every 3-10 attempts.
amdgpu crashed after going through this sequence of steps several times. This seemed to be a good way to trigger this crash.
I've grabbed Valve's kernel sources to build a kernel for Steam OS with several patches applied to it. The final kernel which still runs into this failure quite often included the following patches:
- https://gitlab.freedesktop.org/drm/amd/uploads/ecfb67b0ae46e95d7ab30c49c932c95f/0001-drm-amdgpu-add-wmb-barrier-for-sdma-timeout-issue-te.patch
- https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.6.8&id=3aae4ef4d799fb3d0381157640fdb251008cf0ae
- https://gitlab.freedesktop.org/drm/amd/uploads/b77399cdff3f6e7206dba43527804978/0001-drm-amdgpu-adjust-SDMA-timeout.patch
- https://gitlab.freedesktop.org/drm/amd/uploads/73b85010a2d5bcca0ad2ae90325ae0f2/0002-drm-amdgpu-adjust-max-usec-timeout.patch
The provided sdma0 ring dump was grabbed while running the fully patched kernel. Please let me know if you wish me to send you the kernel package I've built.
USB devices connected to the Steam Deck didn't make a difference. amdgpu crashed with a keyboard connected, with the Dock and without any USB devices connected to it.
The crash occurred on battery and while connected to the Valve provided charger. The presence or absence of the charger didn't seem to make a difference.
Reproducing the bug after a cold boot seems to be easier. amdgpu didn't seem to crash again after the first initial crash after a cold boot. Rebooting may also not help with reproducing the bug. It's much easier to crash amdgpu after a cold boot than after a reboot. Crashing it after running a game for a while may be possible.
Something else which may be relevant is that this crash occurred with the beta Steam OS 3.5.0 on a different Steam Deck unit. One such crash on that device had as a side effect persistent corruption of the image. The OS still ran properly. The alternating bars pattern went away after a reboot and a power cycle. The reboot on its own didn't seem to help that unit.
Hardware description:
- HW: Steam Deck LCD
- CPU: Steam Deck's APU
- GPU: Steam Deck's RDNA2 iGPU
- System Memory: 16 GB
- Display(s): Steam Deck's integrated LCD display
- Type of Display Connection: -
System information:
- Distro name and Version: SteamOS 3.5.13
- Kernel version: 6.1.52.valve14
- Custom kernel: 6.1.52.valve14 (with the mentioned patches)
How to reproduce the issue:
- grab a Steam Deck with Steam OS 3.5.13
- set up the password for the deck user & enable ssh
- install Elden Ring
- configure the game to use Proton Experimental (it's what I was using, probably doesn't matter)
- start the game once or twice
- make sure it runs
- shut down the Steam Deck
- turn on the Steam Deck
- start Elden Ring from the gamescope session without running any other app/game
- if it crashes after loading the game, the screen freezes with an in game image, sometimes crashes to a black screen when gamescope goes down
- if it crashes while loading, the screen freezes on a black screen (with a frame counter or complete mangohud overlay if enabled)
- connect via SSH to check if amdgpu has crashed
- repeat from step 6 if it didn't crash at all
Attached files:
Log files (for system lockups / game freezes / crashes)
Log from the patched kernel (identical to the unpatched one)
[ 20.158985] [drm] Failed to add display topology, DTM TA is not initialized.
[ 49.163676] [drm] Failed to add display topology, DTM TA is not initialized.
[ 81.648713] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=5623, emitted seq=5627
[ 81.649066] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[ 81.649297] amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
[ 81.742639] amdgpu 0000:04:00.0: amdgpu: MODE2 reset
[ 81.752816] amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
These are the logs generated when recovery was disabled:
[ 42.268799] [drm] Failed to add display topology, DTM TA is not initialized.
[ 69.609641] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=5518, emitted seq=5522
[ 69.610193] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[ 69.610702] amdgpu 0000:04:00.0: amdgpu: GPU recovery disabled.
There are no other relevant log messages, errors related to amdgpu or stacktraces. The messages which are missing are the usual amdgpu initialization messages for the Steam Deck's RDNA2 iGPU.