s0ix suspend regression using 5.16-rc (Cezanne)
Brief summary of the problem:
I'm experiencing a very high rate of suspend failures while testing the upcoming 5.16-rc kernels. 5.15.y has been fairly reliable for a month or two, I might see one suspend failure in 50+ attempts, but the failure rate using the 5.16-rcs is probably closer to 1 in 3.
All 5.16-rc kernels tested so far are affected.
Hardware description:
- ASUS Zephyrus G15 (GA503QR, Cezanne platform)
- CPU: 5900HS
- GPU:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA104M [GeForce RTX 3070 Mobile / Max-Q] [10de:249d] (rev a1)
07:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [1002:1638] (rev c4)
- System Memory: 16GB
- Display(s): Onboard laptop display
- Type of Display Connection: eDP
System information:
- Distro name and Version: Arch Linux
- Kernel version: 5.16-rc1, -rc2, -rc3
- Custom kernel: mainline kernel.org 5.16-rc
- AMD official driver version: in-kernel amdgpu driver
How to reproduce the issue:
Cold boot machine with a 5.16-rc kernel, wait for the machine to settle after boot, suspend, wait some seconds (5s, 10s) then wake. Repeat suspend/wake until the machine crashes, usually within about 3-4 suspend attempts.
Attached files:
Log files (for system lockups / game freezes / crashes)
All of my logged failures show the same error:
Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resuming...
Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: amdgpu: dpm has been disabled
Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!
Nov 28 23:43:56 arch-zephyrus kernel: [drm] DMUB hardware initialized: version=0x0101001C
Nov 28 23:43:56 arch-zephyrus kernel: nvme nvme0: 16/0/0 default/read/poll queues
Nov 28 23:43:56 arch-zephyrus kernel: nvme nvme1: 16/0/0 default/read/poll queues
Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 test failed (-110)
Nov 28 23:43:56 arch-zephyrus kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* resume of IP block <sdma_v4_0> failed -110
Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: amdgpu: amdgpu_device_ip_resume failed (-110).
Nov 28 23:43:56 arch-zephyrus kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0x180 returns -110
Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: PM: failed to resume async: error -110
Attached tarball contains kernel journal and STB captures pre/post suspend for several failures:
Under 2020-11-28-5.16-suspend-fail/good/
there are STB captures from two successful 5.15.y suspends for comparison.
Notes:
All boots include pm_debug_messages amd_pmc.enable_stb=1 amd_pmc.dyndbg="+p" acpi.dyndbg="file drivers/acpi/x86/s2idle.c +p"
even if not shown on the kernel command line in the journal. Those parameters are baked into the kernel config directly using:
CONFIG_CMDLINE_BOOL=y
CONFIG_CMDLINE="pm_debug_messages amd_pmc.enable_stb=1 amd_pmc.dyndbg=\"+p\" acpi.dyndbg=\"file drivers/acpi/x86/s2idle.c +p\""
# CONFIG_CMDLINE_OVERRIDE is not set
I've tested with both linux-firmware 2021-10-27 and linux-firmware-git 2021-11-23.
The exact same system configuration suspends (mostly) reliably using 5.15.y.