Steam Deck - amdgpu does't load on newer 6.0.18 - 6.1.X kernels
Brief summary of the problem:
amdgpu fails to load on Steam Deck properly. The screen freezes up locking up the terminal and preventing X11/Wayland from initializing, however I can still use SSH to log into the system.
--- UPDATE --- It seems that this issue is specifically tied to Steamdeck bios 113. Users with bios 110 do not seem to have this issue.
Hardware description:
512GB Steam Deck (Purchased late October of 2022)
- CPU: Custom APU 0405
- GPU: [AMD/ATI] VanGogh [AMD Custom GPU 0405] (rev ae)
- System Memory: 16 GB LPDDR5
- Display(s): Built-In
- Type of Display Connection: Connection Internal
System information:
- Distro name and Version: Tested on Linux Mint Cinnamon 21.1, Arch/Manjaro 22.01, and Fedora Kinoite 37.20230131.0
- Kernel version: Tested on 6.0.18 - 6.1.8 (log below captured from 6.1.8-200.fc37.x86_64)
- Custom kernel: No
How to reproduce the issue:
Attempt to boot any version of linux with a 6.0.18 kernel or later and load the graphics driver. This includes any installation medium such as a recent image of manjaro. I have to use SSH to capture any logs as I cannot use the display. Booting grub with the nomodeset
parameter does allow me terminal access on some distros, but prevents the amdgpu module from loading altogether.
Origin Of Problem
NOTE: Throughout all of my testing, the original SteamOS installation has been left untouched. It runs on an earlier kernel, and appears to function as intended.
The problem did not present itself at first. I originally installed manjaro 22.01 on an SD card as a proof of concept for creating a work environment that I could use on my steamdeck. At first, everything seemed to work. I followed some outdated information on the official Arch docs and installed the linux-steamos-neptune
and linux-firmware-neptune
packages and rebooted. At this point, I was unable to initialize the amdgpu drivers properly. I originally thought that I might have corrupted the operating system in some way, so I attempted a reinstall manjaro, using the same USB flash drive, however the installer interface could no longer initialize. I tried re-flashing the install image onto a different USB drive with the same results. Several other distro installers yielded the same failure. The rest is as described. It is unclear as to whether or not valve's firmware had anything to do with it, or if the timing is purely coincidental.
Some Things That I Tried and Further Thoughts
- I was able to install both linux mint 22.01 and Fedora Kinoite because they both use an older kernel version (which is also missing full support for some Steam Deck hardware). Thanks to Ubuntu's mainline kernel repository, I was able to easily test various kernel versions. Version 6.0.9 and lower seem to load the amdgpu module without issue. Unfortunately, the mainline kernel repository has a gap in its package versions, so the next highest package I was able to test was 6.0.18, which is where the issue seems to begin. Every kernel 6.0.18 and newer seem to result in the issue logged below.
- I tried pulling the latest
linux-firmware
git repository and copying the latest/amdgpu/
contents to my local/usr/lib/firmware/amdgpu/
installation. I also tried pulling all of the/amdgpu/vangogh_*
files from the factory SteamOS image into my local installation as well to see of that might work. - Others online are not having this issue, could it be a newer hardware revision? My deck was purchased recently.
- Is it possible that the hardware has entered some persistent state that needs to be cleared or otherwise re-initialized?
- Are there any module parameters that might help with the problem?
- Could this indicate a hardware problem that has developed?
Log files (for system lockups / game freezes / crashes)
$: dmesg |grep amdgpu
...
[ 4.446414] amdgpu 0000:04:00.0: amdgpu: Will use PSP to load VCN firmware
[ 4.565631] amdgpu 0000:04:00.0: amdgpu: SMU is initialized successfully!
[ 4.960230] amdgpu 0000:04:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring vcn_dec_0 test failed (-110)
[ 4.960702] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init of IP block <vcn_v3_0> failed -110
[ 4.961246] amdgpu 0000:04:00.0: amdgpu: amdgpu_device_ip_init failed
[ 4.961249] amdgpu 0000:04:00.0: amdgpu: Fatal error during GPU init
[ 4.961280] amdgpu 0000:04:00.0: amdgpu: amdgpu: finishing device.
[ 4.972221] amdgpu 0000:04:00.0: amdgpu: free PSP TMR buffer
[ 5.007124] amdgpu: probe of 0000:04:00.0 failed with error -110
[ 5.007306] amdgpu_fence_driver_sw_fini+0xc4/0xd0 [amdgpu]
[ 5.007964] amdgpu_device_fini_sw+0x1f/0x3c0 [amdgpu]
[ 5.008547] amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
...
everything else looks more or less the same as a successful load on an earlier kernel