Dell G5 15 SE (R7 4800H/Renior/RX 5600M) sometimes crashes when going down for suspend, even after acpi-d3.patch

Brief summary of the problem:

I have the Dell G5 15 SE, running most 5.9-rc and 5.9 release kernels using the acpi-d3.patch posted in other issues, this laptop occasionally experiences a kernel crash when going down for suspend, and won't wake back up when resumed via keyboard or otherwise. This seems to happen intermittently, so it is difficult to reproduce, and usually I can issue a suspend from the command line or the window manager and it will gracefully suspend and be resumeable.

Hardware description:

CPU: Ryzen 7 4800H
GPU: Renior iGPU / RX 5600M dGPU
System Memory: 64GB
Display(s): Laptop flat panel + 4K ext display
Type of Diplay Connection: miniDP

System information:

Distro name and Version: Arch
Kernels: 5.9 linux-mainline (using amdgpu.runpm=0, as well as patched with acpi-d3.patch and no amdgpu.runpm setting), additionally see this on amd-staging-drm-next as of commit d79019080b9699e1f157b1b2ae790036cc40adfb with the same acpi-d3.patch applied)
AMD package version: No package, just using open source drivers

How to reproduce the issue:

I've set the Window Manager (enlightenment) to suspend about 15 secs after screen blanking (which occurs after 5 min). The system occasionally will be unresponsive when I try to resume it with a mouse movement or keyboard press. I was able to get the kernel messages out of journalctl -x -b -1 and observed the following stack trace in the kernel messages:

 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 47766 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1750 dm_suspend+0x1a7/0x1c0 [amdgpu]
 Modules linked in: fuse ebtable_filter ebtables ip6table_filter ip6_tables ccm iptable_nat nf_nat nf_conntrack cmac algif_hash nf_defrag_ipv6 algif_skcipher nf_defrag_ipv4 libcrc32c crc32c_generic af_alg iptable_mangle bnep iptable_filter amdgpu iwlmvm mac80211 snd_soc_dmic snd_acp3x_pdm_dma snd_acp3x_rn snd_soc_core snd_compress edac_mce_amd dell_wmi ac97_bus alienware_wmi sparse_keymap wmi_bmof snd_pcm_dmaengine kvm_amd snd_hda_codec_realtek libarc4 iwlwifi snd_hda_codec_generic snd_hda_codec_hdmi uvcvideo kvm btusb snd_hda_intel videobuf2_vmalloc gpu_sched videobuf2_memops btrtl snd_intel_dspcfg dell_laptop videobuf2_v4l2 i2c_algo_bit btbcm ttm videobuf2_common ledtrig_audio snd_hda_codec snd_usb_audio btintel dell_smbios snd_usbmidi_lib dell_wmi_descriptor bluetooth irqbypass cfg80211 videodev hid_multitouch crct10dif_pclmul drm_kms_helper snd_hda_core r8169 dcdbas nls_iso8859_1 snd_rawmidi nls_cp437 realtek vfat crc32_pclmul snd_hwdep fat crc32c_intel snd_seq_device
  snd_pcm mdio_devres ghash_clmulni_intel of_mdio psmouse aesni_intel cec fixed_phy crypto_simd cryptd snd_timer tpm_crb libphy glue_helper mc ecdh_generic ccp rc_core snd ucsi_acpi tpm_tis rapl ecc typec_ucsi tpm_tis_core crc16 syscopyarea input_leds tpm sysfillrect sp5100_tco joydev mousedev wmi snd_rn_pci_acp3x sysimgblt k10temp typec fb_sys_fops snd_pci_acp3x evdev soundcore mac_hid i2c_piix4 battery dell_rbtn i2c_hid rfkill rng_core acpi_tad acpi_cpufreq pinctrl_amd ac drm agpgart crypto_user ip_tables x_tables hid_generic usbhid hid zfs(POE) zunicode(POE) zzstd(OE) zlua(POE) zavl(POE) icp(POE) xhci_pci xhci_pci_renesas xhci_hcd serio_raw atkbd zcommon(POE) znvpair(POE) libps2 spl(OE) i8042 serio
 CPU: 0 PID: 47766 Comm: kworker/0:1 Tainted: P           OE     5.9.0-rc2-amdgpu-13534-g70b4e7dd361b #22
 Hardware name: Dell Inc. G5 5505/06WDJ9, BIOS 1.3.0 06/11/2020
 Workqueue: pm pm_runtime_work
 RIP: 0010:dm_suspend+0x1a7/0x1c0 [amdgpu]
 Code: ff 31 d2 4c 89 e6 4c 89 ff e8 85 54 10 00 83 f8 01 74 1e 89 c2 48 c7 c6 e0 10 1d c2 48 c7 c7 90 1b 25 c2 e8 0b 35 a8 fe eb c2 <0f> 0b e9 8e fe ff ff 4c 89 e6 4c 89 ff e8 47 9a 0f 00 eb ae e8 c0
 RSP: 0018:ffffb0c62c06bc40 EFLAGS: 00010282
 RAX: 0000000000000000 RBX: ffff92400ecb6320 RCX: 0000000000000000
 RDX: 000000000000000a RSI: 0000000000000ff8 RDI: ffff92400eca0000
 RBP: ffff92400eca0000 R08: 0000000000000000 R09: ffff923f73d13f2c
 R10: 0000000000000018 R11: 0000000000000018 R12: ffff92400eca0000
 R13: 0000000000000001 R14: 0000000000000000 R15: ffff92408f42bd30
 FS:  0000000000000000(0000) GS:ffff92408f400000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00000d6ccb11c000 CR3: 0000000f51b04000 CR4: 0000000000350ef0
 Call Trace:
  ? nv_common_set_clockgating_state+0xd2/0x220 [amdgpu]
  amdgpu_device_ip_suspend_phase1+0x75/0xd0 [amdgpu]
  amdgpu_device_suspend+0x89/0x2b0 [amdgpu]
  amdgpu_pmops_runtime_suspend+0x9d/0x140 [amdgpu]
  pci_pm_runtime_suspend+0x5e/0x170
  ? vga_switcheroo_runtime_resume+0x60/0x60
  vga_switcheroo_runtime_suspend+0x22/0xb0
  ? vga_switcheroo_runtime_resume+0x60/0x60
  ? vga_switcheroo_runtime_resume+0x60/0x60
  __rpm_callback+0x7b/0x130
  ? vga_switcheroo_runtime_resume+0x60/0x60
  rpm_callback+0x1f/0x70
  ? vga_switcheroo_runtime_resume+0x60/0x60
  rpm_suspend+0x177/0x6e0
  pm_runtime_work+0x94/0xa0
  process_one_work+0x1da/0x3d0
  worker_thread+0x4d/0x3d0
  ? rescuer_thread+0x410/0x410
  kthread+0x142/0x160
  ? __kthread_bind_mask+0x60/0x60
  ret_from_fork+0x22/0x30
 ---[ end trace f6cd26d8ac2ece46 ]---

Leading up to the crash are the following kernel messages at about the point where the suspend would kick in, and about 5mins before the crash listed above:

 [drm] PSP is resuming...
 [drm] reserve 0x900000 from 0x800f400000 for PSP TMR
 amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
 amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
 amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
 amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
 [drm] kiq ring mec 2 pipe 1 q 0
 [drm] VCN decode and encode initialized successfully(under DPG Mode).
 [drm] JPEG decode initialized successfully.
 amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
 amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
 amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
 amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
 amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
 amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
 amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
 amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
 amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
 amdgpu 0000:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
 amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
 amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
 amdgpu 0000:03:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1
 amdgpu 0000:03:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1
 amdgpu 0000:03:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 1
 amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
 [drm] free PSP TMR buffer
 [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
 [drm] PSP is resuming...
 [drm] reserve 0x900000 from 0x800f400000 for PSP TMR
 amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
 amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
 amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
 amdgpu 0000:03:00.0: amdgpu: failed send message:     RunBtc (58)         param: 0x00000000 response 0xffffffc2
 amdgpu 0000:03:00.0: amdgpu: RunBtc failed!
 amdgpu 0000:03:00.0: amdgpu: Failed to setup smc hw!
 [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -62
 amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
 snd_hda_intel 0000:03:00.1: refused to change power state from D3hot to D0
 snd_hda_intel 0000:03:00.1: CORB reset timeout#2, CORBRP = 65535

I cannot confirm but it is possible that this sequence occurs at the first attempt to suspend, and then the system gets resumed when the hw suspend encounters a failure, just for the system to then proceed to try suspending a second time (the window manager hits the suspend timeout again), and hits the crash I posted near top. I'm not sure, but looking at timestamps, that appears to be a possible explanation for the near-5min delay between the "refused to change power state" error and the stack trace posted above.

Attached files:

dmesg-amd-staging-drm-next.txt dmesg

Edited Oct 14, 2020 by Coleman Kane