RX 5500 XT Ubuntu 20.10 Instability, Crashing ([drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!)
I've recently built an all-AMD system, with a Ryzen 7 3700X CPU and an RX 5500 XT Phantom D Gaming GPU. I have a Gigabyte Aorus Pro Wifi Motherboard and 32GB of Trident Z Neo RAM, with XMP enabled.
I'm running Ubuntu 20.10, with the 5.6.13-050613-generic kernel.
I've been having repeated issues with the amdgpu drivers freezing GNOME and all the windows on the screen, but not the mouse. A power cycle is needed to fix the issue, although SSH'ing into the machine works fine (so the kernel isn't hung).
Here is an excerpt of a kernel log from that crash:
635:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: [gfxhub] page fault (src_id:0 ring:40 vmid:0 pasid:0, for process pid 0 thread pid 0)
636:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: in page starting at address 0x0000000000888000 from client 27
637:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00041C50
638:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: MORE_FAULTS: 0x0
639:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: WALKER_ERROR: 0x0
640:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: PERMISSION_FAULTS: 0x5
641:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: MAPPING_ERROR: 0x0
642:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: RW: 0x1
645:May 17 16:29:19 arctic kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
646:May 17 16:29:19 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=10870, emitted seq=10872
647:May 17 16:29:19 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
648:May 17 16:29:19 arctic kernel: amdgpu 0000:0b:00.0: GPU reset begin!
649:May 17 16:29:21 arctic kernel: amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
654:May 17 16:29:21 arctic kernel: amdgpu: [powerplay] SMU is resuming...
655:May 17 16:29:21 arctic kernel: amdgpu: [powerplay] SMU is resumed successfully!
659:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
660:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
661:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
662:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
663:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
664:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
665:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
666:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
667:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
668:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
669:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring sdma0 uses VM inv eng 12 on hub 0
670:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring sdma1 uses VM inv eng 13 on hub 0
671:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring vcn_dec uses VM inv eng 0 on hub 1
672:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring vcn_enc0 uses VM inv eng 1 on hub 1
673:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring vcn_enc1 uses VM inv eng 4 on hub 1
674:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring jpeg_dec uses VM inv eng 5 on hub 1
680:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: GPU reset(1) succeeded!
688:May 17 16:29:22 arctic /usr/lib/gdm3/gdm-x-session[2329]: amdgpu: amdgpu_cs_query_fence_status failed.
689:May 17 16:29:22 arctic gnome-shell[2678]: amdgpu: amdgpu_cs_query_fence_status failed.
709:May 17 16:33:23 arctic kernel: [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
728:May 17 16:39:00 arctic kernel: [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
852:May 17 16:49:44 arctic kernel: [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
917:May 17 20:12:32 arctic kernel: [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
Here is a similar crash on 5.6.13:
May 18 03:41:05 arctic kernel: [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: [gfxhub] page fault (src_id:0 ring:40 vmid:0 pasid:0, for process pid 0 thread pid 0)
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: in page starting at address 0x00000000008fc000 from client 27
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00041A50
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: MORE_FAULTS: 0x0
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: WALKER_ERROR: 0x0
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: PERMISSION_FAULTS: 0x5
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: MAPPING_ERROR: 0x0
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: RW: 0x1
May 18 03:41:16 arctic kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
May 18 03:41:16 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=6205, emitted seq=6208
May 18 03:41:16 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
May 18 03:41:16 arctic kernel: amdgpu 0000:0b:00.0: GPU reset begin!
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
May 18 03:41:18 arctic kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000E10000).
May 18 03:41:18 arctic kernel: [drm] VRAM is lost due to GPU reset!
May 18 03:41:18 arctic kernel: [drm] PSP is resuming...
May 18 03:41:18 arctic kernel: [drm] reserve 0xa00000 from 0x81fe400000 for PSP TMR
May 18 03:41:18 arctic kernel: amdgpu: [powerplay] SMU is resuming...
May 18 03:41:18 arctic kernel: amdgpu: [powerplay] SMU is resumed successfully!
May 18 03:41:18 arctic kernel: [drm] kiq ring mec 2 pipe 1 q 0
May 18 03:41:18 arctic kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
May 18 03:41:18 arctic kernel: [drm] JPEG decode initialized successfully.
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring sdma0 uses VM inv eng 12 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring sdma1 uses VM inv eng 13 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring vcn_dec uses VM inv eng 0 on hub 1
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring vcn_enc0 uses VM inv eng 1 on hub 1
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring vcn_enc1 uses VM inv eng 4 on hub 1
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring jpeg_dec uses VM inv eng 5 on hub 1
May 18 03:41:18 arctic kernel: [drm] recover vram bo from shadow start
May 18 03:41:18 arctic kernel: [drm] recover vram bo from shadow done
May 18 03:41:18 arctic kernel: [drm] Skip scheduling IBs!
Here are some logs (from a mix of kernel versions, sorry I'm not sure exactly which ones came from which kernel:
I've upgraded from kernel 5.4 to 5.5.19 to 5.6.13, and issues are still present.
Here is a crash log from a time the display just randomly disconnected (kernel 5.6.13):
May 18 02:30:57 arctic kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
May 18 02:30:57 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=167698, emitted seq=167700
May 18 02:30:57 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2090 thread Xorg:cs0 pid 2091
May 18 02:30:57 arctic kernel: amdgpu 0000:0b:00.0: GPU reset begin!
May 18 02:30:59 arctic kernel: amdgpu: [powerplay] failed send message: DisallowGfxOff (42) param: 0x00000000 response 0xffffffc2
May 18 02:31:02 arctic /usr/lib/gdm3/gdm-x-session[2090]: (II) event12 - Logitech MX Master 3000: SYN_DROPPED event - some input events have been lost.
May 18 02:31:02 arctic kernel: amdgpu: [powerplay] Msg issuing pre-check failed and SMU may be not in the right state!
May 18 02:31:02 arctic /usr/lib/gdm3/gdm-x-session[2090]: (EE) client bug: timer event12 debounce: scheduled expiry is in the past (-194ms), your system is too slow
May 18 02:31:02 arctic kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
May 18 02:31:02 arctic kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
May 18 02:31:04 arctic kernel: amdgpu: [powerplay] Msg issuing pre-check failed and SMU may be not in the right state!
May 18 02:31:04 arctic kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62
May 18 02:31:07 arctic kernel: amdgpu: [powerplay] Msg issuing pre-check failed and SMU may be not in the right state! May 18 02:31:07 arctic kernel: [drm:amdgpu_device_gpu_recover.cold [amdgpu]] *ERROR* ASIC reset failed with error, -62 for drm dev, 0000:0b:00.0
May 18 02:31:07 arctic kernel: amdgpu 0000:0b:00.0: GPU reset(1) failed
May 18 02:31:07 arctic kernel: amdgpu 0000:0b:00.0: GPU reset end with ret = -62
May 18 02:31:12 arctic /usr/lib/gdm3/gdm-x-session[2090]: (EE) client bug: timer event12 debounce short: scheduled expiry is in the past (-5ms), your system is too slow May
18 02:31:17 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=167700, emitted seq=167700
May 18 02:31:17 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2090 thread Xorg:cs0 pid 2091
May 18 02:31:17 arctic kernel: amdgpu 0000:0b:00.0: GPU reset begin!
I've set AMD_DEBUG=nodma,nongg
, but it doesn't help. I can update the BIOS on my motherboard, although I'm only one version off of most recent version, and it only provides "Memory enhancements." And I can try the proprietary amdgpu-pro
drivers instead of the open-source amdgpu
drivers. But I can't think of anything else. I've tried 3 separate kernels already... Anyone have ideas?
$ glxinfo | grep "OpenGL Version"
OpenGL version string: 4.6 (Compatibility Profile) Mesa 20.0.6
Question on askUbuntu: https://askubuntu.com/q/1240879/1082990
@agd5f I've heard you're my only hope. Please help me not have to buy an nvidia card.