amdgpu rx590 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout
using steam playing "black mesa", I get frequent(within 5 minutes) crashes where the entire graphics card gets put into a broken state.
even rebooting doesnt bring stuff back up, I have to power off and on to get signal out.
mesa 19.3.2 with llvm 9.0.1, kernel 5.5.4. system is otherwise stable with composited desktop, and various misc games
This following log was first time it happened, I subsequently tried a few times, and it happens immediately. I also tried with AMD_DEBUG="nodma,nongg", no difference
[2347661.350664] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=1968597, emitted seq=1968598
[2347661.350668] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[2347661.350694] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[2347661.350697] amdgpu 0000:08:00.0: GPU reset begin!
[2347661.350816] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[2347661.350817] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350818] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350820] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350821] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350823] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350824] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350825] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350827] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350828] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350830] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350831] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350832] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350834] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350835] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350837] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350838] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350839] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350841] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350842] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350843] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350845] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350846] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350847] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350849] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350850] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350851] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350853] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350854] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350855] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350857] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350858] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350859] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347661.350861] amdgpu: [powerplay]
last message was failed ret is 65535
[2347661.350862] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[2347671.590664] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[2347671.590699] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[2347681.830638] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:45:plane-5] flip_done timed out
[2347681.830668] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[2347681.830713] amdgpu: [powerplay]
last message was failed ret is 65535
[2347681.830714] amdgpu: [powerplay]
failed to send message 306 ret is 65535
[2347681.830715] amdgpu: [powerplay]
last message was failed ret is 65535
[2347681.830717] amdgpu: [powerplay]
failed to send message 5e ret is 65535
[2347681.830718] amdgpu: [powerplay]
last message was failed ret is 65535
[2347681.830719] amdgpu: [powerplay]
failed to send message 145 ret is 65535
[2347681.830721] amdgpu: [powerplay]
last message was failed ret is 65535
[2347681.830722] amdgpu: [powerplay]
failed to send message 146 ret is 65535
[2347681.830724] amdgpu: [powerplay]
last message was failed ret is 65535
[2347681.830725] amdgpu: [powerplay]
failed to send message 148 ret is 65535
[2347681.830727] amdgpu: [powerplay]
last message was failed ret is 65535
[2347681.830728] amdgpu: [powerplay]
failed to send message 145 ret is 65535
[2347681.830729] amdgpu: [powerplay]
last message was failed ret is 65535
[2347681.830730] amdgpu: [powerplay]
failed to send message 146 ret is 65535
[2347681.862786] [drm] REG_WAIT timeout 10us * 3000 tries - dce110_stream_encoder_dp_blank line:954
[2347681.862795] ------------[ cut here ]------------
[2347681.862819] WARNING: CPU: 13 PID: 2425 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:332 generic_reg_wait.cold+0x27/0x2e [amdgpu]
[2347681.862820] Modules linked in: tun tcp_diag inet_diag fuse nct6775 bridge stp llc binfmt_misc dm_crypt sd_mod dm_mod usb_storage kvm_amd kvm irqbypass hwmon_vid amdgpu mousedev snd_hda_codec_generic gpu_sched ttm evdev snd_hda_intel snd_intel_dspcfg aesni_intel igb glue_helper drm_kms_helper snd_hda_codec libaes crypto_simd syscopyarea i2c_algo_bit snd_hwdep sysfillrect cryptd sysimgblt snd_hda_core fb_sys_fops drm snd_pcm snd_timer i2c_core snd soundcore k10temp button efivarfs [last unloaded: nct6775]
[2347681.862831] CPU: 13 PID: 2425 Comm: kworker/13:1 Not tainted 5.5.4 #1
[2347681.862831] Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 1405 11/19/2019
[2347681.862834] Workqueue: events drm_sched_job_timedout [gpu_sched]
[2347681.862856] RIP: 0010:generic_reg_wait.cold+0x27/0x2e [amdgpu]
[2347681.862857] Code: 34 fe ff 44 8b 44 24 24 48 8b 4c 24 18 44 89 fa 89 ee 48 c7 c7 28 4f df c0 e8 27 8b d2 cd 41 83 7c 24 20 01 0f 84 a6 3e fe ff <0f> 0b e9 9f 3e fe ff e8 56 26 ef ff 48 c7 c7 00 40 e8 c0 e8 ea b4
[2347681.862857] RSP: 0018:ffff9fc44bfe7708 EFLAGS: 00010297
[2347681.862858] RAX: 0000000000000052 RBX: 0000000000000010 RCX: 0000000000000000
[2347681.862859] RDX: 0000000000000000 RSI: ffff90b87f158248 RDI: ffff90b87f158248
[2347681.862859] RBP: 000000000000000a R08: 0000000000000001 R09: 0000000000001651
[2347681.862859] R10: 0000000000000001 R11: 0000000000000000 R12: ffff90b86ce39a80
[2347681.862860] R13: 0000000000004ca4 R14: 0000000000000bb9 R15: 0000000000000bb8
[2347681.862860] FS: 0000000000000000(0000) GS:ffff90b87f140000(0000) knlGS:0000000000000000
[2347681.862861] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2347681.862861] CR2: 0000355e79c9b000 CR3: 00000005b8fde000 CR4: 00000000003406e0
[2347681.862862] Call Trace:
[2347681.862889] dce110_stream_encoder_dp_blank+0xe5/0x140 [amdgpu]
[2347681.862910] core_link_disable_stream+0x65/0x330 [amdgpu]
[2347681.862936] ? smu7_upload_dpm_level_enable_mask+0x53/0x70 [amdgpu]
[2347681.862961] ? smu7_force_dpm_level+0x2bc/0x340 [amdgpu]
[2347681.862982] dce110_reset_hw_ctx_wrap+0xba/0x250 [amdgpu]
[2347681.863003] dce110_apply_ctx_to_hw+0x45/0x4d0 [amdgpu]
[2347681.863023] ? amdgpu_pm_compute_clocks+0xc8/0x600 [amdgpu]
[2347681.863049] ? dm_pp_apply_display_requirements+0x19a/0x1c0 [amdgpu]
[2347681.863069] dc_commit_state+0x299/0x5e0 [amdgpu]
[2347681.863101] amdgpu_dm_atomic_commit_tail+0xaba/0x1d10 [amdgpu]
[2347681.863103] ? va_format.isra.0+0x6e/0xa0
[2347681.863105] ? up+0xd/0x50
[2347681.863106] ? vprintk_store+0x120/0x1f0
[2347681.863108] ? __irq_work_queue_local+0x4b/0x50
[2347681.863109] ? irq_work_queue+0x20/0x40
[2347681.863109] ? wake_up_klogd+0x32/0x50
[2347681.863110] ? vprintk_emit+0x10d/0x1e0
[2347681.863111] ? printk+0x53/0x6a
[2347681.863114] ? drm_atomic_helper_wait_for_dependencies+0x1d0/0x1e0 [drm_kms_helper]
[2347681.863117] ? drm_err+0x6d/0x90 [drm]
[2347681.863120] commit_tail+0x8d/0x120 [drm_kms_helper]
[2347681.863123] drm_atomic_helper_commit+0x111/0x140 [drm_kms_helper]
[2347681.863125] drm_atomic_helper_disable_all+0x16e/0x180 [drm_kms_helper]
[2347681.863127] drm_atomic_helper_suspend+0x63/0x110 [drm_kms_helper]
[2347681.863153] dm_suspend+0x17/0x50 [amdgpu]
[2347681.863171] amdgpu_device_ip_suspend_phase1+0x7c/0xd0 [amdgpu]
[2347681.863190] amdgpu_device_ip_suspend+0x17/0x60 [amdgpu]
[2347681.863212] amdgpu_device_pre_asic_reset+0x18c/0x19f [amdgpu]
[2347681.863233] amdgpu_device_gpu_recover+0x2df/0xa2a [amdgpu]
[2347681.863258] amdgpu_job_timedout+0xfe/0x120 [amdgpu]
[2347681.863260] drm_sched_job_timedout+0x39/0x80 [gpu_sched]
[2347681.863261] process_one_work+0x1b2/0x2f0
[2347681.863263] worker_thread+0x45/0x3c0
[2347681.863264] kthread+0xfb/0x130
[2347681.863265] ? current_work+0x30/0x30
[2347681.863265] ? kthread_park+0x80/0x80
[2347681.863267] ret_from_fork+0x22/0x40
[2347681.863268] ---[ end trace 2ed48ba81185804c ]---
[2347686.866656] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting
[2347686.866676] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DBA0 (len 824, WS 0, PS 0) @ 0xDD20
[2347686.866695] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DA5A (len 326, WS 0, PS 0) @ 0xDB4A
[2347686.866720] [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
[2347686.866728] ------------[ cut here ]------------
[2347686.866753] WARNING: CPU: 13 PID: 2425 at drivers/gpu/drm/amd/amdgpu/../display/dc/dce/dce_link_encoder.c:1099 dce110_link_encoder_disable_output+0x13a/0x150 [amdgpu]
[2347686.866753] Modules linked in: tun tcp_diag inet_diag fuse nct6775 bridge stp llc binfmt_misc dm_crypt sd_mod dm_mod usb_storage kvm_amd kvm irqbypass hwmon_vid amdgpu mousedev snd_hda_codec_generic gpu_sched ttm evdev snd_hda_intel snd_intel_dspcfg aesni_intel igb glue_helper drm_kms_helper snd_hda_codec libaes crypto_simd syscopyarea i2c_algo_bit snd_hwdep sysfillrect cryptd sysimgblt snd_hda_core fb_sys_fops drm snd_pcm snd_timer i2c_core snd soundcore k10temp button efivarfs [last unloaded: nct6775]
[2347686.866765] CPU: 13 PID: 2425 Comm: kworker/13:1 Tainted: G W 5.5.4 #1
[2347686.866766] Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 1405 11/19/2019
[2347686.866768] Workqueue: events drm_sched_job_timedout [gpu_sched]
[2347686.866793] RIP: 0010:dce110_link_encoder_disable_output+0x13a/0x150 [amdgpu]
[2347686.866794] Code: 44 24 38 65 48 33 04 25 28 00 00 00 75 20 48 83 c4 40 5b 5d 41 5c c3 48 c7 c6 20 fc da c0 48 c7 c7 38 09 df c0 e8 a6 3e 69 ff <0f> 0b eb d0 e8 2d 11 d5 cd 66 66 2e 0f 1f 84 00 00 00 00 00 66 90
[2347686.866795] RSP: 0018:ffff9fc44bfe7720 EFLAGS: 00010246
[2347686.866795] RAX: 0000000000000000 RBX: ffff90b8766e8060 RCX: 0000000000000007
[2347686.866796] RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff90b87f158240
[2347686.866796] RBP: 0000000000000020 R08: 0000000000000001 R09: 000000000000168d
[2347686.866796] R10: 0000000000000001 R11: 0000000000000000 R12: ffff9fc44bfe7724
[2347686.866797] R13: ffff90b0fc0e0000 R14: ffff90b870eb6000 R15: ffff90b876730c00
[2347686.866797] FS: 0000000000000000(0000) GS:ffff90b87f140000(0000) knlGS:0000000000000000
[2347686.866798] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2347686.866798] CR2: 0000355e79c9b000 CR3: 00000005b8fde000 CR4: 00000000003406e0
[2347686.866799] Call Trace:
[2347686.866821] dp_disable_link_phy+0x6a/0xf0 [amdgpu]
[2347686.866842] core_link_disable_stream+0x103/0x330 [amdgpu]
[2347686.866868] ? smu7_upload_dpm_level_enable_mask+0x53/0x70 [amdgpu]
[2347686.866893] ? smu7_force_dpm_level+0x2bc/0x340 [amdgpu]
[2347686.866914] dce110_reset_hw_ctx_wrap+0xba/0x250 [amdgpu]
[2347686.866936] dce110_apply_ctx_to_hw+0x45/0x4d0 [amdgpu]
[2347686.866955] ? amdgpu_pm_compute_clocks+0xc8/0x600 [amdgpu]
[2347686.866980] ? dm_pp_apply_display_requirements+0x19a/0x1c0 [amdgpu]
[2347686.867001] dc_commit_state+0x299/0x5e0 [amdgpu]
[2347686.867027] amdgpu_dm_atomic_commit_tail+0xaba/0x1d10 [amdgpu]
[2347686.867029] ? va_format.isra.0+0x6e/0xa0
[2347686.867031] ? up+0xd/0x50
[2347686.867032] ? vprintk_store+0x120/0x1f0
[2347686.867034] ? __irq_work_queue_local+0x4b/0x50
[2347686.867035] ? irq_work_queue+0x20/0x40
[2347686.867036] ? wake_up_klogd+0x32/0x50
[2347686.867036] ? vprintk_emit+0x10d/0x1e0
[2347686.867037] ? printk+0x53/0x6a
[2347686.867040] ? drm_atomic_helper_wait_for_dependencies+0x1d0/0x1e0 [drm_kms_helper]
[2347686.867044] ? drm_err+0x6d/0x90 [drm]
[2347686.867047] commit_tail+0x8d/0x120 [drm_kms_helper]
[2347686.867049] drm_atomic_helper_commit+0x111/0x140 [drm_kms_helper]
[2347686.867052] drm_atomic_helper_disable_all+0x16e/0x180 [drm_kms_helper]
[2347686.867054] drm_atomic_helper_suspend+0x63/0x110 [drm_kms_helper]
[2347686.867080] dm_suspend+0x17/0x50 [amdgpu]
[2347686.867098] amdgpu_device_ip_suspend_phase1+0x7c/0xd0 [amdgpu]
[2347686.867117] amdgpu_device_ip_suspend+0x17/0x60 [amdgpu]
[2347686.867138] amdgpu_device_pre_asic_reset+0x18c/0x19f [amdgpu]
[2347686.867160] amdgpu_device_gpu_recover+0x2df/0xa2a [amdgpu]
[2347686.867185] amdgpu_job_timedout+0xfe/0x120 [amdgpu]
[2347686.867186] drm_sched_job_timedout+0x39/0x80 [gpu_sched]
[2347686.867188] process_one_work+0x1b2/0x2f0
[2347686.867189] worker_thread+0x45/0x3c0
[2347686.867190] kthread+0xfb/0x130
[2347686.867192] ? current_work+0x30/0x30
[2347686.867192] ? kthread_park+0x80/0x80
[2347686.867193] ret_from_fork+0x22/0x40
[2347686.867195] ---[ end trace 2ed48ba81185804d ]---
[2347691.870649] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting
[2347691.870669] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C536 (len 62, WS 0, PS 0) @ 0xC552
[2347691.899891] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vce_v3_0> failed -110
[2347691.958661] amdgpu: [powerplay]
last message was failed ret is 65535
[2347691.958664] amdgpu: [powerplay]
failed to send message 133 ret is 65535
[2347691.958668] amdgpu: [powerplay]
last message was failed ret is 65535
[2347691.958669] amdgpu: [powerplay]
failed to send message 306 ret is 65535
[2347691.958670] amdgpu: [powerplay]
last message was failed ret is 65535
[2347691.958671] amdgpu: [powerplay]
failed to send message 5e ret is 65535
[2347691.958673] amdgpu: [powerplay]
last message was failed ret is 65535
[2347691.958674] amdgpu: [powerplay]
failed to send message 145 ret is 65535
[2347691.958675] amdgpu: [powerplay]
last message was failed ret is 65535
[2347691.958677] amdgpu: [powerplay]
failed to send message 146 ret is 65535
[2347691.958678] amdgpu: [powerplay]
last message was failed ret is 65535
[2347691.958680] amdgpu: [powerplay]
failed to send message 148 ret is 65535
[2347691.958681] amdgpu: [powerplay]
last message was failed ret is 65535
[2347691.958682] amdgpu: [powerplay]
failed to send message 145 ret is 65535
[2347691.958683] amdgpu: [powerplay]
last message was failed ret is 65535
[2347691.958685] amdgpu: [powerplay]
failed to send message 146 ret is 65535
[2347691.958687] amdgpu: [powerplay]
last message was failed ret is 65535
[2347691.958688] amdgpu: [powerplay]
failed to send message 16a ret is 65535
[2347691.958689] amdgpu: [powerplay]
last message was failed ret is 65535
[2347691.958690] amdgpu: [powerplay]
failed to send message 186 ret is 65535
[2347691.958692] amdgpu: [powerplay]
last message was failed ret is 65535
[2347691.958693] amdgpu: [powerplay]
failed to send message 54 ret is 65535
[2347692.070662] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[2347692.305340] amdgpu: [powerplay]
last message was failed ret is 65535
[2347692.305342] amdgpu: [powerplay]
failed to send message 26b ret is 65535
[2347692.305344] amdgpu: [powerplay]
last message was failed ret is 65535
[2347692.305346] amdgpu: [powerplay]
failed to send message 13d ret is 65535
[2347692.305347] amdgpu: [powerplay]
last message was failed ret is 65535
[2347692.305348] amdgpu: [powerplay]
failed to send message 14f ret is 65535
[2347692.305350] amdgpu: [powerplay]
last message was failed ret is 65535
[2347692.305351] amdgpu: [powerplay]
failed to send message 151 ret is 65535
[2347692.305352] amdgpu: [powerplay]
last message was failed ret is 65535
[2347692.305354] amdgpu: [powerplay]
failed to send message 135 ret is 65535
[2347692.305354] amdgpu: [powerplay]
last message was failed ret is 65535
[2347692.305356] amdgpu: [powerplay]
failed to send message 190 ret is 65535
[2347692.305357] amdgpu: [powerplay]
last message was failed ret is 65535
[2347692.305358] amdgpu: [powerplay]
failed to send message 63 ret is 65535
[2347692.305363] amdgpu: [powerplay]
last message was failed ret is 65535
[2347692.305364] amdgpu: [powerplay]
failed to send message 84 ret is 65535
[2347692.305366] amdgpu: [powerplay] Failed to force to switch arbf0!
[2347692.305366] amdgpu: [powerplay] [disable_dpm_tasks] Failed to disable DPM!
[2347692.305390] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <powerplay> failed -22
[2347692.478863] amdgpu 0000:08:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[2347692.478888] [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[2347692.823050] cp is busy, skip halt cp
[2347692.992338] rlc is busy, skip halt rlc
[2347702.310662] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[2347712.550698] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[2347722.794679] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[2347733.030659] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[2347743.270698] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[2347753.510659] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[2347756.282649] amdgpu 0000:08:00.0: GPU BACO reset
[2347761.286647] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting
[2347761.286671] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C536 (len 62, WS 0, PS 0) @ 0xC552
[2347761.286690] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing ADA2 (len 156, WS 0, PS 8) @ 0xADBD
[2347761.286691] [drm] asic atom init failed!
[2347761.286692] amdgpu 0000:08:00.0: GPU reset succeeded, trying to resume
[2347761.459378] amdgpu 0000:08:00.0: Wait for MC idle timedout !
[2347761.631911] amdgpu 0000:08:00.0: Wait for MC idle timedout !
[2347761.633367] [drm] PCIE GART of 256M enabled (table at 0x000000F400300000).
[2347761.633373] [drm] VRAM is lost due to GPU reset!
[2347761.642382] amdgpu: [powerplay] Failed to send Message.
[2347761.642388] amdgpu: [powerplay] SMC address must be 4 byte aligned.
[2347761.642389] amdgpu: [powerplay] [AVFS][Polaris10_SetupGfxLvlStruct] Problems copying VRConfig value over to SMC
[2347761.642389] amdgpu: [powerplay] [AVFS][Polaris10_AVFSEventMgr] Could not Copy Graphics Level table over to SMU
[2347761.642392] amdgpu: [powerplay]
last message was failed ret is 65535
[2347761.642394] amdgpu: [powerplay]
failed to send message 252 ret is 65535
[2347761.642395] amdgpu: [powerplay]
last message was failed ret is 65535
[2347761.642396] amdgpu: [powerplay]
failed to send message 253 ret is 65535
[2347761.642398] amdgpu: [powerplay]
last message was failed ret is 65535
[2347761.642399] amdgpu: [powerplay]
failed to send message 250 ret is 65535
[2347761.642401] amdgpu: [powerplay]
last message was failed ret is 65535
[2347761.642402] amdgpu: [powerplay]
failed to send message 251 ret is 65535
[2347761.642403] amdgpu: [powerplay]
last message was failed ret is 65535
[2347761.642404] amdgpu: [powerplay]
failed to send message 254 ret is 65535
[2347761.815082] [drm] Timeout wait for RLC serdes 0,0
[2347761.989410] amdgpu 0000:08:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
[2347761.989429] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -110
[2347761.989448] amdgpu 0000:08:00.0: GPU reset(11) failed
[2347761.989486] amdgpu 0000:08:00.0: GPU reset end with ret = -110
[2347772.198657] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[2347772.198696] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=1968598, emitted seq=1968598
[2347772.198733] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[2347772.198736] amdgpu 0000:08:00.0: GPU reset begin!
[2347813.186599] sysrq: HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p) show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(V) show-blocked-tasks(w)
[2347830.717304] [drm] schedsdma0 is not ready, skipping
[2347830.717305] [drm] schedsdma1 is not ready, skipping
[2347830.717310] amdgpu 0000:08:00.0: failed to clear page tables on GEM object close (-2)
[2347830.717333] BUG: kernel NULL pointer dereference, address: 0000000000000008
[2347830.717335] #PF: supervisor read access in kernel mode
[2347830.717336] #PF: error_code(0x0000) - not-present page
[2347830.717337] PGD 87702a067 P4D 87702a067 PUD bda185067 PMD 0
[2347830.717339] Oops: 0000 [#1] PREEMPT SMP NOPTI
[2347830.717341] CPU: 13 PID: 14901 Comm: SaveJob0 Tainted: G W 5.5.4 #1
[2347830.717342] Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 1405 11/19/2019
[2347830.717370] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x140 [amdgpu]
[2347830.717372] Code: 8b b2 88 01 00 00 4c 8b a8 80 00 00 00 48 8d a8 b8 00 00 00 48 05 68 01 00 00 80 7f 10 00 41 8b 56 08 48 0f 44 e8 48 8b 45 10 <48> 8b 40 08 48 8d 78 88 85 d2 0f 84 c4 00 00 00 48 8b 40 90 4c 89
[2347830.717373] RSP: 0018:ffff9fc4499379d0 EFLAGS: 00010246
[2347830.717374] RAX: 0000000000000000 RBX: ffff9fc449937a10 RCX: 0000000000000600
[2347830.717374] RDX: 0000000000000014 RSI: ffff9fc449937aa8 RDI: ffff9fc449937a10
[2347830.717375] RBP: ffff90b876fa9968 R08: 0000000000001000 R09: 0000000000000011
[2347830.717376] R10: 000000000000000f R11: 000000000000000d R12: ffff9fc449937aa8
[2347830.717376] R13: ffff90a995003800 R14: ffff90b7a8838df8 R15: ffff90b870da0000
[2347830.717377] FS: 0000000000000000(0000) GS:ffff90b87f140000(0000) knlGS:0000000000000000
[2347830.717378] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[2347830.717378] CR2: 0000000000000008 CR3: 000000011423a000 CR4: 00000000003406e0
[2347830.717379] Call Trace:
[2347830.717404] amdgpu_vm_bo_update_mapping+0xdd/0xf0 [amdgpu]
[2347830.717427] amdgpu_vm_clear_freed+0xbb/0x200 [amdgpu]
[2347830.717449] amdgpu_gem_object_close+0x122/0x190 [amdgpu]
[2347830.717455] ? drm_vma_offset_remove+0xf/0x70 [drm]
[2347830.717459] drm_gem_object_release_handle+0x2b/0x90 [drm]
[2347830.717463] ? drm_gem_object_handle_put_unlocked+0x90/0x90 [drm]
[2347830.717466] idr_for_each+0x5e/0xd0
[2347830.717470] drm_gem_release+0x17/0x20 [drm]
[2347830.717473] drm_file_free.part.0+0x206/0x260 [drm]
[2347830.717477] drm_release+0x95/0xd0 [drm]
[2347830.717479] __fput+0xa9/0x230
[2347830.717482] task_work_run+0x8e/0xb0
[2347830.717484] do_exit+0x303/0xa10
[2347830.717485] do_group_exit+0x35/0x90
[2347830.717487] get_signal+0xf9/0x740
[2347830.717490] do_signal+0x2b/0x630
[2347830.717492] ? __ia32_sys_futex_time32+0x13a/0x168
[2347830.717493] exit_to_usermode_loop+0x53/0xa0
[2347830.717494] do_fast_syscall_32+0x205/0x280
[2347830.717496] entry_SYSCALL_compat_after_hwframe+0x45/0x4d
[2347830.717497] Modules linked in: tun tcp_diag inet_diag fuse nct6775 bridge stp llc binfmt_misc dm_crypt sd_mod dm_mod usb_storage kvm_amd kvm irqbypass hwmon_vid amdgpu mousedev snd_hda_codec_generic gpu_sched ttm evdev snd_hda_intel snd_intel_dspcfg aesni_intel igb glue_helper drm_kms_helper snd_hda_codec libaes crypto_simd syscopyarea i2c_algo_bit snd_hwdep sysfillrect cryptd sysimgblt snd_hda_core fb_sys_fops drm snd_pcm snd_timer i2c_core snd soundcore k10temp button efivarfs [last unloaded: nct6775]
[2347830.717510] CR2: 0000000000000008
[2347830.717511] ---[ end trace 2ed48ba81185804e ]---
[2347830.949589] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x140 [amdgpu]
[2347830.949593] Code: 8b b2 88 01 00 00 4c 8b a8 80 00 00 00 48 8d a8 b8 00 00 00 48 05 68 01 00 00 80 7f 10 00 41 8b 56 08 48 0f 44 e8 48 8b 45 10 <48> 8b 40 08 48 8d 78 88 85 d2 0f 84 c4 00 00 00 48 8b 40 90 4c 89
[2347830.949594] RSP: 0018:ffff9fc4499379d0 EFLAGS: 00010246
[2347830.949595] RAX: 0000000000000000 RBX: ffff9fc449937a10 RCX: 0000000000000600
[2347830.949596] RDX: 0000000000000014 RSI: ffff9fc449937aa8 RDI: ffff9fc449937a10
[2347830.949597] RBP: ffff90b876fa9968 R08: 0000000000001000 R09: 0000000000000011
[2347830.949598] R10: 000000000000000f R11: 000000000000000d R12: ffff9fc449937aa8
[2347830.949599] R13: ffff90a995003800 R14: ffff90b7a8838df8 R15: ffff90b870da0000
[2347830.949600] FS: 0000000000000000(0000) GS:ffff90b87f140000(0000) knlGS:0000000000000000
[2347830.949601] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[2347830.949602] CR2: 0000000000000008 CR3: 000000011423a000 CR4: 00000000003406e0
[2347830.949603] Fixing recursive fault but reboot is needed!