Desktop hangs on Linux 5.6.10 w/RX5700XT
Linux 5.6.10 on Arch Linux, Mesa 20.0.6, libdrm 2.4.101, running a Sway session with Sway ae3ec745 and wlroots 61d6408f. The symptoms are, seemingly at random, the screen will lock up and start glitching. Sometimes I can switch to another TTY, kill sway, and resume my session - but not always.
Here's my dmesg, and here are just what looks like the important bits:
[23501.176136] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[23502.242818] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[23506.296152] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2952765, emitted seq=2952767
[23506.296274] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=279299, emitted seq=279301
[23506.296386] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[23506.296508] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process sway pid 1029 thread sway:cs0 pid 1050
[23506.296509] amdgpu 0000:03:00.0: GPU reset begin!
[23506.296516] amdgpu 0000:03:00.0: GPU reset begin!
[23506.296519] [drm] Bailing on TDR for s_job:2a0ee2, as another already in progress
[23506.297194] ------------[ cut here ]------------
[23506.297326] WARNING: CPU: 2 PID: 75375 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_resource.c:3080 dcn20_validate_bandwidth+0x87/0xe0 [amdgpu]
[23506.297327] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq uinput rfcomm fuse xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge overlay cfg80211 cmac algif_hash algif_skcipher af_alg 8021q bnep garp mrp stp llc btrfs blake2b_generic xor intel_rapl_msr intel_rapl_common nls_iso8859_1 raid6_pq nls_cp437 vfat libcrc32c fat x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek kvm_intel snd_hda_codec_generic iTCO_wdt kvm mei_hdcp iTCO_vendor_support ledtrig_audio snd_hda_codec_hdmi eeepc_wmi asus_wmi btusb battery uvcvideo wmi_bmof sparse_keymap btrtl loop snd_usb_audio mxm_wmi irqbypass snd_hda_intel videobuf2_vmalloc btbcm btintel videobuf2_memops snd_intel_dspcfg snd_usbmidi_lib bluetooth ir_rc5_decoder snd_hda_codec intel_cstate videobuf2_v4l2 videobuf2_common intel_uncore snd_hda_core snd_rawmidi videodev intel_rapl_perf rc_streamzap
[23506.297361] snd_hwdep snd_seq_device i2c_i801 pcspkr streamzap ecdh_generic joydev rfkill input_leds ecc mousedev mc snd_pcm snd_timer xpad mei_me ff_memless e1000e snd lpc_ich mei soundcore ie31200_edac evdev mac_hid wmi sg crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd sr_mod cryptd cdrom glue_helper xhci_pci ehci_pci ehci_hcd xhci_hcd amdgpu gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm agpgart
[23506.297391] CPU: 2 PID: 75375 Comm: kworker/2:0 Tainted: G W 5.6.6-arch1-1 #1
[23506.297392] Hardware name: System manufacturer System Product Name/SABERTOOTH Z77, BIOS 1805 12/19/2012
[23506.297396] Workqueue: events drm_sched_job_timedout [gpu_sched]
[23506.297523] RIP: 0010:dcn20_validate_bandwidth+0x87/0xe0 [amdgpu]
[23506.297526] Code: 2d 44 22 a5 e8 1d 00 00 75 26 f2 0f 11 85 a8 21 00 00 31 d2 48 89 ee 4c 89 ef e8 a4 f5 ff ff 41 89 c4 22 85 e8 1d 00 00 75 4a <0f> 0b eb 02 75 d1 f2 0f 10 14 24 f2 0f 11 95 a8 21 00 00 e8 e1 71
[23506.297527] RSP: 0018:ffffb02088eeba80 EFLAGS: 00010246
[23506.297529] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000002a85802
[23506.297530] RDX: 0000000002a85602 RSI: 5b61bcadf153cdfa RDI: 00000000000321a0
[23506.297531] RBP: ffff97c7e0d60000 R08: 0000000000000006 R09: 0000000000000000
[23506.297533] R10: 0000000100000000 R11: 0000000100000001 R12: 0000000000000001
[23506.297534] R13: ffff97c983eb0000 R14: 0000000000000000 R15: ffff97c983ad4400
[23506.297536] FS: 0000000000000000(0000) GS:ffff97c98ec80000(0000) knlGS:0000000000000000
[23506.297537] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[23506.297538] CR2: 0000564170f4d300 CR3: 000000029aa0a006 CR4: 00000000001606e0
[23506.297539] Call Trace:
[23506.297663] dc_validate_global_state+0x28a/0x310 [amdgpu]
[23506.297694] ? drm_modeset_lock+0x31/0xb0 [drm]
[23506.297818] amdgpu_dm_atomic_check+0xea1/0xfc0 [amdgpu]
[23506.297845] drm_atomic_check_only+0x578/0x800 [drm]
[23506.297869] drm_atomic_commit+0x13/0x50 [drm]
[23506.297882] drm_atomic_helper_disable_all+0x175/0x190 [drm_kms_helper]
[23506.297894] drm_atomic_helper_suspend+0x73/0x120 [drm_kms_helper]
[23506.298017] dm_suspend+0x1c/0x60 [amdgpu]
[23506.298093] amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
[23506.298101] ? _raw_spin_lock+0x13/0x30
[23506.298177] amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[23506.298303] amdgpu_device_pre_asic_reset+0x191/0x1a4 [amdgpu]
[23506.298448] amdgpu_device_gpu_recover.cold+0x41c/0xbb6 [amdgpu]
[23506.298601] amdgpu_job_timedout+0x103/0x130 [amdgpu]
[23506.298609] drm_sched_job_timedout+0x6e/0xc0 [gpu_sched]
[23506.298617] process_one_work+0x1da/0x3d0
[23506.298623] worker_thread+0x4a/0x3d0
[23506.298628] kthread+0xfb/0x130
[23506.298632] ? process_one_work+0x3d0/0x3d0
[23506.298635] ? kthread_park+0x90/0x90
[23506.298640] ret_from_fork+0x35/0x40
[23506.298647] ---[ end trace f2cf1fff49435791 ]---
[23506.427324] snd_hda_codec_hdmi hdaudioC1D0: HDMI: ELD buf size is 0, force 128
[23506.427334] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD data byte 0
[23506.493202] snd_hda_codec_hdmi hdaudioC1D0: HDMI: ELD buf size is 0, force 128
[23506.493214] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD data byte 0
[23506.714007] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[23506.877925] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[23507.042271] [drm:gfx_v10_0_cp_gfx_enable [amdgpu]] *ERROR* failed to halt cp gfx
[23510.225184] amdgpu 0000:03:00.0: GPU reset succeeded, trying to resume
[23510.225250] [drm] PCIE GART of 512M enabled (table at 0x0000008000900000).
[23510.228441] [drm] PSP is resuming...
[23510.406091] [drm] reserve 0xa00000 from 0x81fe400000 for PSP TMR
[23510.566049] amdgpu 0000:03:00.0: RAS: ras ta ucode is not available
[23510.586048] amdgpu: [powerplay] SMU is resuming...
[23510.589178] amdgpu: [powerplay] SMU is resumed successfully!
[23510.941870] [drm] kiq ring mec 2 pipe 1 q 0
[23510.949911] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[23510.949972] [drm] JPEG decode initialized successfully.
[23510.949976] amdgpu 0000:03:00.0: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[23510.949977] amdgpu 0000:03:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[23510.949978] amdgpu 0000:03:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[23510.949979] amdgpu 0000:03:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[23510.949980] amdgpu 0000:03:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[23510.949981] amdgpu 0000:03:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[23510.949982] amdgpu 0000:03:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[23510.949983] amdgpu 0000:03:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[23510.949984] amdgpu 0000:03:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[23510.949985] amdgpu 0000:03:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[23510.949986] amdgpu 0000:03:00.0: ring sdma0 uses VM inv eng 12 on hub 0
[23510.949987] amdgpu 0000:03:00.0: ring sdma1 uses VM inv eng 13 on hub 0
[23510.949988] amdgpu 0000:03:00.0: ring vcn_dec uses VM inv eng 0 on hub 1
[23510.949989] amdgpu 0000:03:00.0: ring vcn_enc0 uses VM inv eng 1 on hub 1
[23510.949990] amdgpu 0000:03:00.0: ring vcn_enc1 uses VM inv eng 4 on hub 1
[23510.949991] amdgpu 0000:03:00.0: ring jpeg_dec uses VM inv eng 5 on hub 1
[23510.953184] [drm] recover vram bo from shadow start
[23510.962888] [drm] recover vram bo from shadow done
[23510.962890] [drm] Skip scheduling IBs!
[23510.962896] [drm] Skip scheduling IBs!
[23510.962905] [drm] Skip scheduling IBs!
[23510.962906] [drm] Skip scheduling IBs!
[23510.962916] [drm] Skip scheduling IBs!
[23510.962918] [drm] Skip scheduling IBs!
[23510.962920] [drm] Skip scheduling IBs!
[23510.962922] [drm] Skip scheduling IBs!
[23510.962923] [drm] Skip scheduling IBs!
[23510.962925] [drm] Skip scheduling IBs!
[23510.962927] [drm] Skip scheduling IBs!
[23510.962929] [drm] Skip scheduling IBs!
[23510.962931] [drm] Skip scheduling IBs!
[23510.962932] [drm] Skip scheduling IBs!
[23510.962934] [drm] Skip scheduling IBs!
[23510.962945] [drm] Skip scheduling IBs!
[23510.962946] [drm] Skip scheduling IBs!
[23510.962948] [drm] Skip scheduling IBs!
[23510.962950] [drm] Skip scheduling IBs!
[23510.962953] [drm] Skip scheduling IBs!
[23510.962962] [drm] Skip scheduling IBs!
[23510.962970] [drm] Skip scheduling IBs!
[23510.962976] amdgpu 0000:03:00.0: GPU reset(1) succeeded!
[23510.962976] [drm] Skip scheduling IBs!
[23510.962984] [drm] Skip scheduling IBs!
[23510.962989] [drm] Skip scheduling IBs!
[23510.962995] [drm] Skip scheduling IBs!
[23510.963000] [drm] Skip scheduling IBs!
[23510.963002] [drm] Skip scheduling IBs!
[23510.963005] [drm] Skip scheduling IBs!
[23510.963008] [drm] Skip scheduling IBs!
[23510.963011] [drm] Skip scheduling IBs!
[23510.963015] [drm] Skip scheduling IBs!
[23510.963017] [drm] Skip scheduling IBs!
[23510.963021] [drm] Skip scheduling IBs!
[23510.963116] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[23510.963117] [drm] Skip scheduling IBs!
[23510.963120] [drm] Skip scheduling IBs!
[23510.963123] [drm] Skip scheduling IBs!
[23510.963126] [drm] Skip scheduling IBs!