pcieport Unable to change power state from D3Cold to D0 causes system freeze
Brief summary of the problem:
(Issue formerly known as amdgpu full system freeze)
I have setup a new system and I am running into daily full system lock ups. About once per day my entire UI freezes, audio starts to repeat and stutter and the only way out is to hit the big reset button.
The system itself is new and just recently setup. Everything works perfectly fine, except that one full system freeze every 8 hours or so.
The freeze happens under load and not under load. It can happen when I start any program. It can also happen when the system is idling and technically doing nothing (though maybe a background process can trigger it as well)
In addition to that I notice infrequent (like once per hour or so) full system stutters. This feels like a couple lost frames and is not debilitating. Though it is annoying. I do have a suspicion that this might be symptom, or precursor of the full system freeze.
Hardware description:
- CPU: AMD Ryzen 7 9800 X3D
- GPU: 03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX/7900 GRE/7900M] [1002:744c] (rev cc)
- System Memory: 64GB DDR5 6000 cl30
- Display(s): 2 2k and 1 FullHD all running at 60fps
- Type of Display Connection: The 2K displays are connected via DP, the FullHD panel is connected via HDMI
System information:
- Arch Linux running KDE plasma (It will also happen in X11)
- Kernel version: Linux aurorus 6.10.10-arch1-1 #1 (closed) SMP PREEMPT_DYNAMIC Mon, 23 Dec 2024 00:26:17 +0000 x86_64 GNU/Linux
- Custom kernel: no kernel customization, I did roll back my kernel which didn't help
- AMD official driver version: mesa?
How to reproduce the issue:
The issue is sadly not directly reproducible. It can happen at any point in time.
Attached files:
I have multiple dmesg kernel logs. I have already tried a couple of things.
I tried turning off powermanagement with amdgpu.runpm=0 dmesg.amdgpu.runpm.0.log This is with amdgpu.runpm=0 still freezes. This scenario was also my pc just idling without me interacting with it (I was sleeping xD)
dmesg.amdgpu.runpm.1.log This is with amdgpu.runpm=1 still freezes
From the logs the most interesting part would be here
Dec 26 16:38:49 aurorus kernel: [drm] Fence fallback timer expired on ring comp_1.1.0
Dec 26 16:38:50 aurorus kernel: [drm] Fence fallback timer expired on ring sdma0
Dec 26 16:38:53 aurorus kernel: [drm] Fence fallback timer expired on ring sdma1
Dec 26 16:38:53 aurorus kernel: [drm] Fence fallback timer expired on ring sdma0
Dec 26 16:38:53 aurorus kernel: [drm] Fence fallback timer expired on ring comp_1.2.1
Dec 26 16:38:54 aurorus kernel: pcieport 0000:18:00.0: Unable to change power state from D3cold to D0, device inaccessible
Dec 26 16:38:55 aurorus kernel: [drm] Fence fallback timer expired on ring gfx_0.0.0
Dec 26 16:38:57 aurorus kernel: [drm] Fence fallback timer expired on ring sdma0
Dec 26 16:38:58 aurorus kernel: [drm] Fence fallback timer expired on ring gfx_0.0.0
Dec 26 16:38:58 aurorus kernel: [drm] Fence fallback timer expired on ring gfx_0.0.0
Dec 26 16:38:59 aurorus kernel: [drm] Fence fallback timer expired on ring gfx_0.0.0
Dec 26 16:38:59 aurorus kernel: [drm] Fence fallback timer expired on ring sdma0
Dec 26 16:39:00 aurorus kernel: [drm] Fence fallback timer expired on ring gfx_0.0.0
Dec 26 16:39:00 aurorus kernel: [drm] Fence fallback timer expired on ring gfx_0.0.0
Dec 26 16:39:02 aurorus kernel: [drm] Fence fallback timer expired on ring gfx_0.0.0
Dec 26 16:39:04 aurorus kernel: [drm] Fence fallback timer expired on ring gfx_0.0.0
Dec 26 16:39:05 aurorus kernel: [drm] Fence fallback timer expired on ring sdma0
Dec 26 16:39:05 aurorus kernel: [drm] Fence fallback timer expired on ring sdma1
Dec 26 16:39:21 aurorus kernel: [drm] Fence fallback timer expired on ring sdma0
Dec 26 16:39:21 aurorus kernel: [drm] Fence fallback timer expired on ring comp_1.3.0
Dec 26 16:39:21 aurorus kernel: [drm] Fence fallback timer expired on ring sdma0
Dec 26 16:39:23 aurorus kernel: [drm] Fence fallback timer expired on ring sdma1
Dec 26 16:39:30 aurorus kernel: [drm] Fence fallback timer expired on ring sdma0
Dec 26 16:39:30 aurorus kernel: [drm] Fence fallback timer expired on ring sdma1
Dec 26 16:39:30 aurorus kernel: [drm] Fence fallback timer expired on ring gfx_0.0.0
Dec 26 16:39:31 aurorus kernel: [drm] Fence fallback timer expired on ring sdma0
What I do not understand is why even there is an attempt to shut off the GPU. Its clearly in use and should not be turned off.
Could this be a KDE issue? Maybe some too esoteric power scheduling? Its not a wayland nor a X11 issue, as I can observe this issue on both.
I tried turning power management off
I do also see this message in conjunction with the freeze
Dec 27 02:19:08 aurorus kernel: sched: RT throttling activated
I tried to turn that off too, though I do not think I had it turned off and ran into the same issue at the same time (this is still being tested)
edit: (update after a long period of analysis)
Root cause analysis
So after about two weeks of trying multiple things I finally came to a workaround that does seem to work. Apparently the pcie port power management is causing the system freeze. It seems that changing powerstate from D3Cold to D0 is not supported in my system configuration. It is still unclear to me why my system decides to switch power state like that.
Turning off pcie port powemanagement seems to be a feasible workaround.
I have been searching around for more information about this. I have found a mail list discussion about this topic here:
https://lore.kernel.org/all/CADnq5_PCTjUNwRHwb7sAynqRF98w=e09eYHbck3SFsvC-CgPzQ@mail.gmail.com/T/
Though I am not quite sure if this patch ever made it into mainline.
Workaround
It seems pcie power management is causing issues.
Setting the following kernel parameter stabilizes the system
pcie_port_pm=off
Renaming the issue to reflect the pcie port issue