s0ix suspend regression using 5.16-rc (Cezanne)
Brief summary of the problem:
I'm experiencing a very high rate of suspend failures while testing the upcoming 5.16-rc kernels. 5.15.y has been fairly reliable for a month or two, I might see one suspend failure in 50+ attempts, but the failure rate using the 5.16-rcs is probably closer to 1 in 3.
All 5.16-rc kernels tested so far are affected.
- ASUS Zephyrus G15 (GA503QR, Cezanne platform)
- CPU: 5900HS
01:00.0 VGA compatible controller : NVIDIA Corporation GA104M [GeForce RTX 3070 Mobile / Max-Q] [10de:249d] (rev a1) 07:00.0 VGA compatible controller : Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [1002:1638] (rev c4)
- System Memory: 16GB
- Display(s): Onboard laptop display
- Type of Display Connection: eDP
- Distro name and Version: Arch Linux
- Kernel version: 5.16-rc1, -rc2, -rc3
- Custom kernel: mainline kernel.org 5.16-rc
- AMD official driver version: in-kernel amdgpu driver
How to reproduce the issue:
Cold boot machine with a 5.16-rc kernel, wait for the machine to settle after boot, suspend, wait some seconds (5s, 10s) then wake. Repeat suspend/wake until the machine crashes, usually within about 3-4 suspend attempts.
Log files (for system lockups / game freezes / crashes)
All of my logged failures show the same error:
Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resuming... Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: amdgpu: dpm has been disabled Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully! Nov 28 23:43:56 arch-zephyrus kernel: [drm] DMUB hardware initialized: version=0x0101001C Nov 28 23:43:56 arch-zephyrus kernel: nvme nvme0: 16/0/0 default/read/poll queues Nov 28 23:43:56 arch-zephyrus kernel: nvme nvme1: 16/0/0 default/read/poll queues Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 test failed (-110) Nov 28 23:43:56 arch-zephyrus kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* resume of IP block <sdma_v4_0> failed -110 Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: amdgpu: amdgpu_device_ip_resume failed (-110). Nov 28 23:43:56 arch-zephyrus kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0x180 returns -110 Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: PM: failed to resume async: error -110
Attached tarball contains kernel journal and STB captures pre/post suspend for several failures:
2020-11-28-5.16-suspend-fail/good/ there are STB captures from two successful 5.15.y suspends for comparison.
All boots include
pm_debug_messages amd_pmc.enable_stb=1 amd_pmc.dyndbg="+p" acpi.dyndbg="file drivers/acpi/x86/s2idle.c +p" even if not shown on the kernel command line in the journal. Those parameters are baked into the kernel config directly using:
CONFIG_CMDLINE_BOOL=y CONFIG_CMDLINE="pm_debug_messages amd_pmc.enable_stb=1 amd_pmc.dyndbg=\"+p\" acpi.dyndbg=\"file drivers/acpi/x86/s2idle.c +p\"" # CONFIG_CMDLINE_OVERRIDE is not set
I've tested with both linux-firmware 2021-10-27 and linux-firmware-git 2021-11-23.
The exact same system configuration suspends (mostly) reliably using 5.15.y.