amdgpu crashes when monitor goes to sleep
I'm experiencing crashes in amdgpu when screensaver blanks the screen and the monitor goes to sleep. This started with kernel 5.7 and also exists in 5.8-rc1
The machine is a Power9 based TalosII with Radeon WX4100 PRO.
kernel log from 5.8-rc1
...
čen 16 07:28:22 talos.danny.cz kernel: snd_hda_intel 0000:01:00.1: refused to change power state from D0 to D3hot
čen 16 07:28:43 talos.danny.cz kernel: snd_hda_intel 0000:01:00.1: refused to change power state from D0 to D3hot
čen 16 08:01:20 talos.danny.cz kernel: broken atomic modeset userspace detected, disabling atomic
čen 16 08:01:25 talos.danny.cz kernel: snd_hda_intel 0000:01:00.1: refused to change power state from D0 to D3hot
čen 16 08:01:49 talos.danny.cz kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=65044, emitted seq=65046
čen 16 08:01:49 talos.danny.cz kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
čen 16 08:01:49 talos.danny.cz kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
čen 16 08:01:50 talos.danny.cz kernel: amdgpu 0000:01:00.0: amdgpu: GPU BACO reset
čen 16 08:01:51 talos.danny.cz kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset succeeded, trying to resume
čen 16 08:01:51 talos.danny.cz kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
čen 16 08:01:51 talos.danny.cz kernel: [drm] VRAM is lost due to GPU reset!
čen 16 08:01:51 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.1 test failed (-110)
čen 16 08:01:51 talos.danny.cz kernel: [drm] UVD and UVD ENC initialized successfully.
čen 16 08:01:51 talos.danny.cz kernel: [drm] VCE initialized successfully.
čen 16 08:01:51 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on uvd (-110).
čen 16 08:01:51 talos.danny.cz kernel: amdgpu 0000:01:00.0: amdgpu: ib ring test failed (-110).
čen 16 08:02:10 talos.danny.cz kernel: [TTM] Buffer eviction failed
čen 16 08:02:10 talos.danny.cz kernel: amdgpu: Trying to disable SCLK DPM when DPM is disabled
čen 16 08:02:10 talos.danny.cz kernel: amdgpu: Trying to disable voltage DPM when DPM is disabled
čen 16 08:02:11 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
čen 16 08:02:11 talos.danny.cz kernel: [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
čen 16 08:02:11 talos.danny.cz kernel: ------------[ cut here ]------------
čen 16 08:02:11 talos.danny.cz kernel: WARNING: CPU: 29 PID: 341 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1005 amdgpu_bo_unpin+0x1a4/0x200 [amdgpu]
čen 16 08:02:11 talos.danny.cz kernel: Modules linked in: kvm_hv kvm xt_CHECKSUM xt_MASQUERADE tun nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache nf_nat_tftp nf_conntrack_tftp xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bridge stp llc rfkill ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter i2c_dev dm_crypt snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_seq ftdi_sio snd_seq_device snd_pcm snd_timer at24 snd soundcore ses regmap_i2c enclosure vmx_crypto scsi_transport_sas ofpart powernv_flash i2c_opal mtd ipmi_powernv ipmi_devintf rtc_opal ipmi_msghandler opal_prd crct10dif_vpmsum ip_tables amdgpu
čen 16 08:02:11 talos.danny.cz kernel: raid1 mfd_core gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm crc32c_vpmsum nvme tg3 nvme_core drm_panel_orientation_quirks i2c_core aacraid xhci_pci xhci_pci_renesas fuse
čen 16 08:02:11 talos.danny.cz kernel: CPU: 29 PID: 341 Comm: kworker/29:1 Not tainted 5.8.0-0.rc1.1.fc33.ppc64le #1
čen 16 08:02:11 talos.danny.cz kernel: Workqueue: pm pm_runtime_work
čen 16 08:02:11 talos.danny.cz kernel: NIP: c008000007d948fc LR: c008000007d96ef0 CTR: c0080000068091d8
čen 16 08:02:11 talos.danny.cz kernel: REGS: c0000007f74f74d0 TRAP: 0700 Not tainted (5.8.0-0.rc1.1.fc33.ppc64le)
čen 16 08:02:11 talos.danny.cz kernel: MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28424422 XER: 00000000
čen 16 08:02:11 talos.danny.cz kernel: CFAR: c008000007d9479c IRQMASK: 0
GPR00: c008000007d96ef0 c0000007f74f7760 c0080000082df000 c0000007e5441c00
GPR04: 0000000000000000 0000000000000000 0000000000000000 c0000007f7438100
GPR08: 0000000000000000 0000000000000000 0000000000000000 c0080000081a2c30
GPR12: c0080000068091d8 c0000007fffde600 c000000001df6880 c000000001f93a00
GPR16: c00000000205c1d8 0000000000000000 c000200725b5a248 0000000000000000
GPR20: c000200725b5a1e0 0000000000000000 0000000000000000 00000000000f4240
GPR24: 0000000000000003 0000000000000000 0000000000000000 c0000007dc770000
GPR28: c0000007dc764f30 c0000007dc764f30 0000000000000000 c0000007e5441c00
čen 16 08:02:11 talos.danny.cz kernel: NIP [c008000007d948fc] amdgpu_bo_unpin+0x1a4/0x200 [amdgpu]
čen 16 08:02:11 talos.danny.cz kernel: LR [c008000007d96ef0] amdgpu_gart_table_vram_unpin+0x78/0x150 [amdgpu]
čen 16 08:02:11 talos.danny.cz kernel: Call Trace:
čen 16 08:02:11 talos.danny.cz kernel: [c0000007f74f7800] [c008000007d96ef0] amdgpu_gart_table_vram_unpin+0x78/0x150 [amdgpu]
čen 16 08:02:11 talos.danny.cz kernel: [c0000007f74f7880] [c008000007e49cf4] gmc_v8_0_gart_disable+0xcc/0xf0 [amdgpu]
čen 16 08:02:11 talos.danny.cz kernel: [c0000007f74f78b0] [c008000007e49db4] gmc_v8_0_suspend+0x3c/0x60 [amdgpu]
čen 16 08:02:11 talos.danny.cz kernel: [c0000007f74f78e0] [c008000007d712e0] amdgpu_device_ip_suspend_phase2+0xc8/0x1b0 [amdgpu]
čen 16 08:02:11 talos.danny.cz kernel: [c0000007f74f7970] [c008000007d7796c] amdgpu_device_suspend+0x3e4/0x520 [amdgpu]
čen 16 08:02:11 talos.danny.cz kernel: [c0000007f74f7a30] [c008000007d707dc] amdgpu_pmops_runtime_suspend+0xf4/0x1f0 [amdgpu]
čen 16 08:02:11 talos.danny.cz kernel: [c0000007f74f7a80] [c00000000098cad4] pci_pm_runtime_suspend+0x84/0x2d0
čen 16 08:02:11 talos.danny.cz kernel: [c0000007f74f7b10] [c000000000ae27f8] __rpm_callback+0x128/0x260
čen 16 08:02:11 talos.danny.cz kernel: [c0000007f74f7b60] [c000000000ae13cc] rpm_suspend+0x45c/0xa70
čen 16 08:02:11 talos.danny.cz kernel: [c0000007f74f7c30] [c000000000ae4fd8] pm_runtime_work+0x158/0x160
čen 16 08:02:11 talos.danny.cz kernel: [c0000007f74f7c60] [c00000000019bf00] process_one_work+0x300/0x5b0
čen 16 08:02:11 talos.danny.cz kernel: [c0000007f74f7d00] [c00000000019c28c] worker_thread+0xdc/0x780
čen 16 08:02:11 talos.danny.cz kernel: [c0000007f74f7db0] [c0000000001a7c34] kthread+0x1f4/0x200
čen 16 08:02:11 talos.danny.cz kernel: [c0000007f74f7e20] [c00000000000cca8] ret_from_kernel_thread+0x5c/0x74
čen 16 08:02:11 talos.danny.cz kernel: Instruction dump:
čen 16 08:02:11 talos.danny.cz kernel: 7d0051ad 40c2fff4 4803a45d 60000000 393e5938 7d4048a8 7d435050 7d4049ad
čen 16 08:02:11 talos.danny.cz kernel: 40c2fff4 4bffff20 7c0802a6 f80100b0 <0fe00000> e87db0d0 3c820000 e8848fd0
čen 16 08:02:11 talos.danny.cz kernel: ---[ end trace 6ef6f5775069b661 ]---
čen 16 08:02:11 talos.danny.cz kernel: amdgpu 0000:01:00.0: amdgpu: 000000004b99cce7 unpin not necessary
čen 16 08:02:42 talos.danny.cz kernel: [TTM] Buffer eviction failed
čen 16 08:02:42 talos.danny.cz kernel: amdgpu 0000:01:00.0: refused to change power state from D0 to D3hot
čen 16 08:02:43 talos.danny.cz kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
čen 16 08:02:43 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
čen 16 08:02:43 talos.danny.cz kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -110
čen 16 08:02:43 talos.danny.cz kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-110).
čen 16 08:02:43 talos.danny.cz kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!