Skip to content

GitLab

  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • A amd
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 1,278
    • Issues 1,278
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • drm
  • amd
  • Issues
  • #1821

Closed
Open
Created Dec 04, 2021 by Scott Bruce@smbruce

s0ix suspend regression using 5.16-rc (Cezanne)

Brief summary of the problem:

I'm experiencing a very high rate of suspend failures while testing the upcoming 5.16-rc kernels. 5.15.y has been fairly reliable for a month or two, I might see one suspend failure in 50+ attempts, but the failure rate using the 5.16-rcs is probably closer to 1 in 3.

All 5.16-rc kernels tested so far are affected.

Hardware description:

  • ASUS Zephyrus G15 (GA503QR, Cezanne platform)
  • CPU: 5900HS
  • GPU:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA104M [GeForce RTX 3070 Mobile / Max-Q] [10de:249d] (rev a1)
07:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [1002:1638] (rev c4)
  • System Memory: 16GB
  • Display(s): Onboard laptop display
  • Type of Display Connection: eDP

System information:

  • Distro name and Version: Arch Linux
  • Kernel version: 5.16-rc1, -rc2, -rc3
  • Custom kernel: mainline kernel.org 5.16-rc
  • AMD official driver version: in-kernel amdgpu driver

How to reproduce the issue:

Cold boot machine with a 5.16-rc kernel, wait for the machine to settle after boot, suspend, wait some seconds (5s, 10s) then wake. Repeat suspend/wake until the machine crashes, usually within about 3-4 suspend attempts.

Attached files:

Log files (for system lockups / game freezes / crashes)

All of my logged failures show the same error:

  Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resuming...
  Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: amdgpu: dpm has been disabled
  Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!
  Nov 28 23:43:56 arch-zephyrus kernel: [drm] DMUB hardware initialized: version=0x0101001C
  Nov 28 23:43:56 arch-zephyrus kernel: nvme nvme0: 16/0/0 default/read/poll queues
  Nov 28 23:43:56 arch-zephyrus kernel: nvme nvme1: 16/0/0 default/read/poll queues
  Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 test failed (-110)
  Nov 28 23:43:56 arch-zephyrus kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* resume of IP block <sdma_v4_0> failed -110
  Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: amdgpu: amdgpu_device_ip_resume failed (-110).
  Nov 28 23:43:56 arch-zephyrus kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0x180 returns -110
  Nov 28 23:43:56 arch-zephyrus kernel: amdgpu 0000:07:00.0: PM: failed to resume async: error -110

Attached tarball contains kernel journal and STB captures pre/post suspend for several failures:

suspend-failures-5.16.tgz

Under 2020-11-28-5.16-suspend-fail/good/ there are STB captures from two successful 5.15.y suspends for comparison.

Notes:

All boots include pm_debug_messages amd_pmc.enable_stb=1 amd_pmc.dyndbg="+p" acpi.dyndbg="file drivers/acpi/x86/s2idle.c +p" even if not shown on the kernel command line in the journal. Those parameters are baked into the kernel config directly using:

CONFIG_CMDLINE_BOOL=y
CONFIG_CMDLINE="pm_debug_messages amd_pmc.enable_stb=1 amd_pmc.dyndbg=\"+p\" acpi.dyndbg=\"file drivers/acpi/x86/s2idle.c +p\""
# CONFIG_CMDLINE_OVERRIDE is not set

I've tested with both linux-firmware 2021-10-27 and linux-firmware-git 2021-11-23.

The exact same system configuration suspends (mostly) reliably using 5.15.y.

Assignee
Assign to
Time tracking