[NAVI14] Screen freeze then black screen after error "failed send message: TransferTableSmu2Dram (18)"
I got multiple crash inside amdgpu driver, always after a error with TransferTableSmu2Dram message.
First, the initial error for the TransferTableSmu2Dram message :
[93047.579372] failed send message: TransferTableSmu2Dram (18) param: 0x00000006 response 0xffffffc2
[93047.579375] Failed to export SMU metrics table!
[93049.194680] Msg issuing pre-check failed and SMU may be not in the right state!
[93049.194684] Failed to export SMU metrics table!
[93050.809268] Msg issuing pre-check failed and SMU may be not in the right state!
[93050.809271] Failed to export SMU metrics table!
[93052.580637] Msg issuing pre-check failed and SMU may be not in the right state!
[93052.580640] Failed to export SMU metrics table!
[93054.122533] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=1401670, emitted seq=1401672
[93054.122591] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process chrome pid 39576 thread chrome:cs0 pid 39601
[93054.195936] Msg issuing pre-check failed and SMU may be not in the right state!
[93054.195939] Failed to export SMU metrics table!
[93054.195960] amdgpu 0000:03:00.0: GPU reset begin!
[93055.810397] Msg issuing pre-check failed and SMU may be not in the right state!
[93055.810402] Failed to export SMU metrics table!
[93057.382428] Msg issuing pre-check failed and SMU may be not in the right state!
[93058.985010] Msg issuing pre-check failed and SMU may be not in the right state!
[93058.985014] Failed to export SMU metrics table!
[93059.754884] show_signal_msg: 42 callbacks suppressed
[93059.754886] GpuWatchdog[39613]: segfault at 0 ip 00005593d75d84e0 sp 00007f2813d954a0 error 6 in chrome[5593d326c000+734a000]
[93059.754890] Code: 3d 00 58 fb fa be 01 00 00 00 ba 07 00 00 00 e8 16 fa 71 fe 48 8d 3d e8 95 fc fa be 01 00 00 00 ba 03 00 00 00 e8 00 fa 71 fe <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 e6 bf 96 03 01 80 7d 87 00
[93060.556352] Msg issuing pre-check failed and SMU may be not in the right state!
[93060.771239] amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[93060.771293] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[93060.962088] amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[93060.962144] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[93062.199162] Msg issuing pre-check failed and SMU may be not in the right state!
[93062.199166] Failed to export SMU metrics table!
[93063.772028] Msg issuing pre-check failed and SMU may be not in the right state!
[93063.772035] Failed to get smu version.
[93063.772102] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62
[93065.342793] Msg issuing pre-check failed and SMU may be not in the right state!
[93065.342796] Failed to export SMU metrics table!
[93066.920058] Msg issuing pre-check failed and SMU may be not in the right state!
[93066.920062] Failed to export SMU metrics table!
[93068.490766] Msg issuing pre-check failed and SMU may be not in the right state!
[93068.490770] Failed to export SMU metrics table!
[93070.061527] Msg issuing pre-check failed and SMU may be not in the right state!
[93070.061530] Failed to export SMU metrics table!
Followed by ASIC reset error:
[93071.631875] Msg issuing pre-check failed and SMU may be not in the right state!
[93071.641826] [drm:amdgpu_do_asic_reset [amdgpu]] *ERROR* ASIC reset failed with error, -62 for drm dev, 0000:03:00.0
[93071.641865] amdgpu 0000:03:00.0: GPU reset(1) failed
[93071.641866] amdgpu 0000:03:00.0: couldn't schedule ib on ring <sdma0>
[93071.641930] [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
[93071.641939] amdgpu 0000:03:00.0: couldn't schedule ib on ring <sdma0>
[93071.642010] [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
[93071.642011] amdgpu 0000:03:00.0: GPU reset end with ret = -62
[93071.642015] amdgpu 0000:03:00.0: couldn't schedule ib on ring <sdma0>
[93071.642072] [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
[93071.642075] amdgpu 0000:03:00.0: couldn't schedule ib on ring <sdma0>
[93071.642130] [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
[93071.643527] [drm] scheduler sdma0 is not ready, skipping
[93071.643528] [drm] scheduler sdma1 is not ready, skipping
[93071.643577] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-2)
Then, the crash (24s after the initial error):
[93071.643603] BUG: kernel NULL pointer dereference, address: 0000000000000008
[93071.643604] #PF: supervisor read access in kernel mode
[93071.643605] #PF: error_code(0x0000) - not-present page
[93071.643606] PGD 0 P4D 0
[93071.643608] Oops: 0000 [#1] PREEMPT SMP PTI
[93071.643610] CPU: 0 PID: 2384 Comm: Xorg Kdump: loaded Not tainted 5.7.0-050700rc5-lowlatency #202005101931
[93071.643611] Hardware name: ASUS All Series/Z87-PLUS, BIOS 2103 08/15/2014
[93071.643657] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x200 [amdgpu]
[93071.643659] Code: 47 20 80 7f 10 00 4c 8b a0 88 01 00 00 48 8b 47 08 4c 8d b0 e0 00 00 00 75 07 4c 8d b0 98 01 00 00 49 8b 46 10 41 8b 54 24 08 <48> 8b 40 08 48 8d 78 88 85 d2 0f 84 47 01 00 00 48 8b 40 90 4c 89
[93071.643660] RSP: 0018:ffffb5e94257fb20 EFLAGS: 00010246
[93071.643661] RAX: 0000000000000000 RBX: ffffb5e94257fb80 RCX: 0000000800105000
[93071.643662] RDX: 0000000000000020 RSI: ffffb5e94257fc28 RDI: ffffb5e94257fb80
[93071.643663] RBP: ffffb5e94257fb50 R08: ffff9d0cde740888 R09: 0000000000000000
[93071.643664] R10: 000000000000001d R11: 000000000000000d R12: ffff9d0af0e54de8
[93071.643664] R13: ffffb5e94257fc28 R14: ffff9d0ccc16c198 R15: ffff9d0ccc16c000
[93071.643666] FS: 00007fda99fb0a80(0000) GS:ffff9d0ce6c00000(0000) knlGS:0000000000000000
[93071.643666] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[93071.643667] CR2: 0000000000000008 CR3: 000000021dfda003 CR4: 00000000001606f0
[93071.643668] Call Trace:
[93071.643713] amdgpu_vm_bo_update_mapping+0x1b6/0x1e0 [amdgpu]
[93071.643755] amdgpu_vm_clear_freed+0xdb/0x220 [amdgpu]
[93071.643796] amdgpu_gem_va_ioctl+0x3d0/0x4e0 [amdgpu]
[93071.643836] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
[93071.643854] drm_ioctl_kernel+0xae/0xf0 [drm]
[93071.643864] drm_ioctl+0x234/0x3d0 [drm]
[93071.643902] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
[93071.643906] ? __fget_files+0x6a/0x90
[93071.643943] amdgpu_drm_ioctl+0x4e/0x80 [amdgpu]
[93071.643946] ksys_ioctl+0x9d/0xd0
[93071.643948] __x64_sys_ioctl+0x1a/0x20
[93071.643950] do_syscall_64+0x57/0x1d0
[93071.643954] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[93071.643955] RIP: 0033:0x7fda9a30e37b
[93071.643956] Code: 0f 1e fa 48 8b 05 15 3b 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e5 3a 0d 00 f7 d8 64 89 01 48
[93071.643957] RSP: 002b:00007fff800ae6f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[93071.643959] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fda9a30e37b
[93071.643960] RDX: 00007fff800ae750 RSI: 00000000c0286448 RDI: 000000000000000d
[93071.643960] RBP: 00007fff800ae750 R08: ffff800104e00000 R09: 000000000000000e
[93071.643961] R10: 0000000000000fff R11: 0000000000000246 R12: 00000000c0286448
[93071.643962] R13: 000000000000000d R14: 0000000000000002 R15: 00007fff800ae7d0
[93071.643964] Modules linked in: cdc_acm rpcsec_gss_krb5 nfsv4 nfs fscache hid_logitech_hidpp nls_iso8859_1 intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_rapl_perf snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio uvcvideo snd_hda_codec_hdmi videobuf2_vmalloc videobuf2_memops joydev hid_logitech_dj videobuf2_v4l2 videobuf2_common snd_hda_intel snd_intel_dspcfg videodev snd_hda_codec snd_usb_audio snd_hda_core snd_usbmidi_lib snd_hwdep mc cp210x usbserial snd_pcm eeepc_wmi asus_wmi wmi_bmof sparse_keymap snd_seq_midi efi_pstore snd_seq_midi_event i2c_i801 snd_rawmidi snd_seq snd_seq_device snd_timer snd e1000e mei_me soundcore mei lpc_ich sch_fq_codel nfsd nct6775 hwmon_vid parport_pc ppdev auth_rpcgss binfmt_misc nfs_acl lp parport lockd grace sunrpc ip_tables x_tables autofs4 input_leds hid_generic usbhid hid amdgpu mxm_wmi aesni_intel amd_iommu_v2 glue_helper
[93071.643988] gpu_sched crypto_simd i2c_algo_bit cryptd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec ahci rc_core libahci drm wmi video mac_hid
[93071.643995] CR2: 0000000000000008
Some packages version:
- Low-latency kernel from Ubuntu mainline repo: 5.7.0-050700rc5.202005101931 and 5.7.0-050700rc4.202005051752
- Mesa daily from oibaf repo: 20.2
git2005140730.cf21b7oibaf~f - AMD amdgpu Pro OpenCL: 20.10-1048554