NP dereference, amdgpu_dm_atomic_commit_tail, dc_resource_state_copy_construct
I've had some amdgpu related crashes today. First a kfence, then upgraded drivers, then a NP de-reference on latest kernel, both followed by GUI lockups but otherwise responsive (SSH).
Could this be the related to #1700 (closed)?
GPU is a Powercolor Radeon RX 550, on an ASUS PRIME X570-PRO with 5900X CPU. Connected are three different displays. Running sway (wayland). Machine is relatively new, have been used it since beginning of August, but without any issues until today.
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7) (prog-if 00 [VGA controller])
I've been on linux-5.3.10.arch-1 since Aug 16th. linux-firmware 20210716.b7c134f-1 since before that.
Did not have a single kernel issue until this morning, where suddenly all GUI hung while using Thunderbird on one workspace. One of the other displays is running Chromium with a couple of windowed video feeds. Other monitor showing some random websites in Chromium. Was able to SSH into the machine and found this in journal:
Sep 09 07:31:22 johan-amd kernel: WARNING: CPU: 18 PID: 45102 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x39/0xc0
Sep 09 07:31:22 johan-amd kernel: Modules linked in: veth nf_tables tcp_diag inet_diag v4l2loopback(OE) tun snd_seq_dummy snd_hrtimer snd_seq xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl_msr eeepc_wmi asus_wmi sparse_keymap rfkill video mxm_wmi wmi_bmof intel_rapl_common edac_mce_amd kvm_amd snd_hda_codec_realtek amdgpu kvm snd_hda_codec_generic ledtrig_audio irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel snd_hda_codec_hdmi aesni_intel crypto_simd snd_hda_intel cryptd gpu_sched snd_intel_dspcfg drm_ttm_helper snd_intel_sdw_acpi rapl snd_usb_audio ttm snd_hda_codec snd_usbmidi_lib drm_kms_helper snd_hda_core snd_rawmidi snd_hwdep snd_seq_device uvcvideo snd_pcm cec igb videobuf2_vmalloc snd_timer ccp videobuf2_memops syscopyarea videobuf2_v4l2 sysfillrect snd joydev sp5100_tco sysimgblt i2c_algo_bit pcspkr k10temp
Sep 09 07:31:22 johan-amd kernel: i2c_piix4 rng_core fb_sys_fops soundcore videobuf2_common cp210x mousedev dca wmi pinctrl_amd mac_hid acpi_cpufreq videodev mc drm nct6775 hwmon_vid fuse agpgart bpf_preload ip_tables x_tables hid_logitech_hidpp hid_logitech_dj usbhid uas usb_storage zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) xhci_pci spl(OE) xhci_pci_renesas [last unloaded: v4l2loopback]
Sep 09 07:31:22 johan-amd kernel: CPU: 18 PID: 45102 Comm: kworker/u64:5 Tainted: P OE 5.13.10-arch1-1 #1
Sep 09 07:31:22 johan-amd kernel: Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 4002 06/15/2021
Sep 09 07:31:22 johan-amd kernel: Workqueue: events_unbound commit_work [drm_kms_helper]
Sep 09 07:31:22 johan-amd kernel: RIP: 0010:kfence_protect_page+0x39/0xc0
Sep 09 07:31:22 johan-amd kernel: Code: 25 28 00 00 00 48 89 44 24 08 31 c0 48 8d 74 24 04 c7 44 24 04 00 00 00 00 e8 e3 64 db ff 48 85 c0 74 07 83 7c 24 04 01 74 06 <0f> 0b 31 c0 eb 4c 48 8b 38 48 89 c2 84 db 75 59 48 89 f8 0f 1f 40
Sep 09 07:31:22 johan-amd kernel: RSP: 0018:ffffb6d4c5697928 EFLAGS: 00010046
Sep 09 07:31:22 johan-amd kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffbda10000
Sep 09 07:31:22 johan-amd kernel: RDX: ffffb6d4c569792c RSI: 0000000000000000 RDI: ffffffffbda10000
Sep 09 07:31:22 johan-amd kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Sep 09 07:31:22 johan-amd kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
Sep 09 07:31:22 johan-amd kernel: R13: ffffb6d4c56979e8 R14: 0000000000000002 R15: 0000000000000000
Sep 09 07:31:22 johan-amd kernel: FS: 0000000000000000(0000) GS:ffff99286ee80000(0000) knlGS:0000000000000000
Sep 09 07:31:22 johan-amd kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 09 07:31:22 johan-amd kernel: CR2: 0000000000000390 CR3: 0000000429446000 CR4: 0000000000750ee0
Sep 09 07:31:22 johan-amd kernel: PKRU: 55555554
Sep 09 07:31:22 johan-amd kernel: Call Trace:
Sep 09 07:31:22 johan-amd kernel: kfence_unprotect+0x13/0x30
Sep 09 07:31:22 johan-amd kernel: page_fault_oops+0x9d/0x2d0
Sep 09 07:31:22 johan-amd kernel: exc_page_fault+0x78/0x180
Sep 09 07:31:22 johan-amd kernel: asm_exc_page_fault+0x1e/0x30
Sep 09 07:31:22 johan-amd kernel: RIP: 0010:dc_stream_retain+0x11/0x40 [amdgpu]
Sep 09 07:31:22 johan-amd kernel: Code: 00 00 c3 c7 87 3c 03 00 00 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 4c 8d 87 90 03 00 00 b8 01 00 00 00 <f0> 0f c1 87 90 03 00 00 85 c0 74 15 8d 50 01 09 c2 78 01 c3 be 01
Sep 09 07:31:22 johan-amd kernel: RSP: 0018:ffffb6d4c5697a90 EFLAGS: 00010246
Sep 09 07:31:22 johan-amd kernel: RAX: 0000000000000001 RBX: ffff99248c080000 RCX: ffff99248c082068
Sep 09 07:31:22 johan-amd kernel: RDX: 0000000000000000 RSI: ffff9926659b2468 RDI: 0000000000000000
Sep 09 07:31:22 johan-amd kernel: RBP: ffff99248c080000 R08: 0000000000000390 R09: 0000000000000006
Sep 09 07:31:22 johan-amd kernel: R10: 00000000000190f2 R11: 0000000000000020 R12: ffff99248c080000
Sep 09 07:31:22 johan-amd kernel: R13: 0000000000000001 R14: 0000000000000000 R15: ffff991b9f1eeb80
Sep 09 07:31:22 johan-amd kernel: dc_resource_state_copy_construct+0xea/0x130 [amdgpu]
Sep 09 07:31:22 johan-amd kernel: amdgpu_dm_atomic_commit_tail+0x1f20/0x2690 [amdgpu]
Sep 09 07:31:22 johan-amd kernel: commit_tail+0x94/0x120 [drm_kms_helper]
Sep 09 07:31:22 johan-amd kernel: process_one_work+0x1e3/0x3b0
Sep 09 07:31:22 johan-amd kernel: worker_thread+0x50/0x3b0
Sep 09 07:31:22 johan-amd kernel: ? process_one_work+0x3b0/0x3b0
Sep 09 07:31:22 johan-amd kernel: kthread+0x133/0x160
Sep 09 07:31:22 johan-amd kernel: ? set_kthread_struct+0x40/0x40
Sep 09 07:31:22 johan-amd kernel: ret_from_fork+0x22/0x30
Sep 09 07:31:22 johan-amd kernel: ---[ end trace e6999ddb21ea307c ]---
Sep 09 07:31:22 johan-amd kernel: ------------[ cut here ]------------
Sep 09 07:31:22 johan-amd kernel: WARNING: CPU: 18 PID: 45102 at mm/kfence/core.c:135 kfence_unprotect+0x18/0x30
Sep 09 07:31:22 johan-amd kernel: Modules linked in: veth nf_tables tcp_diag inet_diag v4l2loopback(OE) tun snd_seq_dummy snd_hrtimer snd_seq xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl_msr eeepc_wmi asus_wmi sparse_keymap rfkill video mxm_wmi wmi_bmof intel_rapl_common edac_mce_amd kvm_amd snd_hda_codec_realtek amdgpu kvm snd_hda_codec_generic ledtrig_audio irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel snd_hda_codec_hdmi aesni_intel crypto_simd snd_hda_intel cryptd gpu_sched snd_intel_dspcfg drm_ttm_helper snd_intel_sdw_acpi rapl snd_usb_audio ttm snd_hda_codec snd_usbmidi_lib drm_kms_helper snd_hda_core snd_rawmidi snd_hwdep snd_seq_device uvcvideo snd_pcm cec igb videobuf2_vmalloc snd_timer ccp videobuf2_memops syscopyarea videobuf2_v4l2 sysfillrect snd joydev sp5100_tco sysimgblt i2c_algo_bit pcspkr k10temp
Sep 09 07:31:22 johan-amd kernel: i2c_piix4 rng_core fb_sys_fops soundcore videobuf2_common cp210x mousedev dca wmi pinctrl_amd mac_hid acpi_cpufreq videodev mc drm nct6775 hwmon_vid fuse agpgart bpf_preload ip_tables x_tables hid_logitech_hidpp hid_logitech_dj usbhid uas usb_storage zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) xhci_pci spl(OE) xhci_pci_renesas [last unloaded: v4l2loopback]
Sep 09 07:31:22 johan-amd kernel: CPU: 18 PID: 45102 Comm: kworker/u64:5 Tainted: P W OE 5.13.10-arch1-1 #1
Sep 09 07:31:22 johan-amd kernel: Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 4002 06/15/2021
Sep 09 07:31:22 johan-amd kernel: Workqueue: events_unbound commit_work [drm_kms_helper]
Sep 09 07:31:22 johan-amd kernel: RIP: 0010:kfence_unprotect+0x18/0x30
Sep 09 07:31:22 johan-amd kernel: Code: 05 ec fe 92 01 00 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 81 e7 00 f0 ff ff 31 f6 e8 fd fe ff ff 84 c0 74 01 c3 <0f> 0b c6 05 bf fe 92 01 00 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f
Sep 09 07:31:22 johan-amd kernel: RSP: 0018:ffffb6d4c5697950 EFLAGS: 00010046
Sep 09 07:31:22 johan-amd kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffbda10000
Sep 09 07:31:22 johan-amd kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffbda10000
Sep 09 07:31:22 johan-amd kernel: RBP: 0000000000000390 R08: 0000000000000000 R09: 0000000000000000
Sep 09 07:31:22 johan-amd kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
Sep 09 07:31:22 johan-amd kernel: R13: ffffb6d4c56979e8 R14: 0000000000000002 R15: 0000000000000000
Sep 09 07:31:22 johan-amd kernel: FS: 0000000000000000(0000) GS:ffff99286ee80000(0000) knlGS:0000000000000000
Sep 09 07:31:22 johan-amd kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 09 07:31:22 johan-amd kernel: CR2: 0000000000000390 CR3: 0000000429446000 CR4: 0000000000750ee0
Sep 09 07:31:22 johan-amd kernel: PKRU: 55555554
Sep 09 07:31:22 johan-amd kernel: Call Trace:
Sep 09 07:31:22 johan-amd kernel: page_fault_oops+0x9d/0x2d0
Sep 09 07:31:22 johan-amd kernel: exc_page_fault+0x78/0x180
Sep 09 07:31:22 johan-amd kernel: asm_exc_page_fault+0x1e/0x30
Sep 09 07:31:22 johan-amd kernel: RIP: 0010:dc_stream_retain+0x11/0x40 [amdgpu]
Sep 09 07:31:22 johan-amd kernel: Code: 00 00 c3 c7 87 3c 03 00 00 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 4c 8d 87 90 03 00 00 b8 01 00 00 00 <f0> 0f c1 87 90 03 00 00 85 c0 74 15 8d 50 01 09 c2 78 01 c3 be 01
Sep 09 07:31:22 johan-amd kernel: RSP: 0018:ffffb6d4c5697a90 EFLAGS: 00010246
Sep 09 07:31:22 johan-amd kernel: RAX: 0000000000000001 RBX: ffff99248c080000 RCX: ffff99248c082068
Sep 09 07:31:22 johan-amd kernel: RDX: 0000000000000000 RSI: ffff9926659b2468 RDI: 0000000000000000
Sep 09 07:31:22 johan-amd kernel: RBP: ffff99248c080000 R08: 0000000000000390 R09: 0000000000000006
Sep 09 07:31:22 johan-amd kernel: R10: 00000000000190f2 R11: 0000000000000020 R12: ffff99248c080000
Sep 09 07:31:22 johan-amd kernel: R13: 0000000000000001 R14: 0000000000000000 R15: ffff991b9f1eeb80
Sep 09 07:31:22 johan-amd kernel: dc_resource_state_copy_construct+0xea/0x130 [amdgpu]
Sep 09 07:31:22 johan-amd kernel: amdgpu_dm_atomic_commit_tail+0x1f20/0x2690 [amdgpu]
Sep 09 07:31:22 johan-amd kernel: commit_tail+0x94/0x120 [drm_kms_helper]
Sep 09 07:31:22 johan-amd kernel: process_one_work+0x1e3/0x3b0
Sep 09 07:31:22 johan-amd kernel: worker_thread+0x50/0x3b0
Sep 09 07:31:22 johan-amd kernel: ? process_one_work+0x3b0/0x3b0
Sep 09 07:31:22 johan-amd kernel: kthread+0x133/0x160
Sep 09 07:31:22 johan-amd kernel: ? set_kthread_struct+0x40/0x40
Sep 09 07:31:22 johan-amd kernel: ret_from_fork+0x22/0x30
Sep 09 07:31:22 johan-amd kernel: ---[ end trace e6999ddb21ea307d ]---
Sep 09 07:31:22 johan-amd kernel: BUG: kernel NULL pointer dereference, address: 0000000000000390
Sep 09 07:31:22 johan-amd kernel: #PF: supervisor write access in kernel mode
Sep 09 07:31:22 johan-amd kernel: #PF: error_code(0x0002) - not-present page
Sep 09 07:31:22 johan-amd kernel: PGD 435d8b067 P4D 435d8b067 PUD 0
Sep 09 07:31:22 johan-amd kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
Sep 09 07:31:22 johan-amd kernel: CPU: 18 PID: 45102 Comm: kworker/u64:5 Tainted: P W OE 5.13.10-arch1-1 #1
Sep 09 07:31:22 johan-amd kernel: Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 4002 06/15/2021
Sep 09 07:31:22 johan-amd kernel: Workqueue: events_unbound commit_work [drm_kms_helper]
Sep 09 07:31:22 johan-amd kernel: RIP: 0010:dc_stream_retain+0x11/0x40 [amdgpu]
Sep 09 07:31:22 johan-amd kernel: Code: 00 00 c3 c7 87 3c 03 00 00 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 4c 8d 87 90 03 00 00 b8 01 00 00 00 <f0> 0f c1 87 90 03 00 00 85 c0 74 15 8d 50 01 09 c2 78 01 c3 be 01
Sep 09 07:31:22 johan-amd kernel: RSP: 0018:ffffb6d4c5697a90 EFLAGS: 00010246
Sep 09 07:31:22 johan-amd kernel: RAX: 0000000000000001 RBX: ffff99248c080000 RCX: ffff99248c082068
Sep 09 07:31:22 johan-amd kernel: RDX: 0000000000000000 RSI: ffff9926659b2468 RDI: 0000000000000000
Sep 09 07:31:22 johan-amd kernel: RBP: ffff99248c080000 R08: 0000000000000390 R09: 0000000000000006
Sep 09 07:31:22 johan-amd kernel: R10: 00000000000190f2 R11: 0000000000000020 R12: ffff99248c080000
Sep 09 07:31:22 johan-amd kernel: R13: 0000000000000001 R14: 0000000000000000 R15: ffff991b9f1eeb80
Sep 09 07:31:22 johan-amd kernel: FS: 0000000000000000(0000) GS:ffff99286ee80000(0000) knlGS:0000000000000000
Sep 09 07:31:22 johan-amd kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 09 07:31:22 johan-amd kernel: CR2: 0000000000000390 CR3: 0000000429446000 CR4: 0000000000750ee0
Sep 09 07:31:22 johan-amd kernel: PKRU: 55555554
Sep 09 07:31:22 johan-amd kernel: Call Trace:
Sep 09 07:31:22 johan-amd kernel: dc_resource_state_copy_construct+0xea/0x130 [amdgpu]
Sep 09 07:31:22 johan-amd kernel: amdgpu_dm_atomic_commit_tail+0x1f20/0x2690 [amdgpu]
Sep 09 07:31:22 johan-amd kernel: commit_tail+0x94/0x120 [drm_kms_helper]
Sep 09 07:31:22 johan-amd kernel: process_one_work+0x1e3/0x3b0
Sep 09 07:31:22 johan-amd kernel: worker_thread+0x50/0x3b0
Sep 09 07:31:22 johan-amd kernel: ? process_one_work+0x3b0/0x3b0
Sep 09 07:31:22 johan-amd kernel: kthread+0x133/0x160
Sep 09 07:31:22 johan-amd kernel: ? set_kthread_struct+0x40/0x40
Sep 09 07:31:22 johan-amd kernel: ret_from_fork+0x22/0x30
Sep 09 07:31:22 johan-amd kernel: Modules linked in: veth nf_tables tcp_diag inet_diag v4l2loopback(OE) tun snd_seq_dummy snd_hrtimer snd_seq xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl_msr eeepc_wmi asus_wmi sparse_keymap rfkill video mxm_wmi wmi_bmof intel_rapl_common edac_mce_amd kvm_amd snd_hda_codec_realtek amdgpu kvm snd_hda_codec_generic ledtrig_audio irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel snd_hda_codec_hdmi aesni_intel crypto_simd snd_hda_intel cryptd gpu_sched snd_intel_dspcfg drm_ttm_helper snd_intel_sdw_acpi rapl snd_usb_audio ttm snd_hda_codec snd_usbmidi_lib drm_kms_helper snd_hda_core snd_rawmidi snd_hwdep snd_seq_device uvcvideo snd_pcm cec igb videobuf2_vmalloc snd_timer ccp videobuf2_memops syscopyarea videobuf2_v4l2 sysfillrect snd joydev sp5100_tco sysimgblt i2c_algo_bit pcspkr k10temp
Sep 09 07:31:22 johan-amd kernel: i2c_piix4 rng_core fb_sys_fops soundcore videobuf2_common cp210x mousedev dca wmi pinctrl_amd mac_hid acpi_cpufreq videodev mc drm nct6775 hwmon_vid fuse agpgart bpf_preload ip_tables x_tables hid_logitech_hidpp hid_logitech_dj usbhid uas usb_storage zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) xhci_pci spl(OE) xhci_pci_renesas [last unloaded: v4l2loopback]
Sep 09 07:31:22 johan-amd kernel: CR2: 0000000000000390
Sep 09 07:31:22 johan-amd kernel: ---[ end trace e6999ddb21ea307e ]---
Sep 09 07:31:22 johan-amd kernel: RIP: 0010:dc_stream_retain+0x11/0x40 [amdgpu]
Sep 09 07:31:22 johan-amd kernel: Code: 00 00 c3 c7 87 3c 03 00 00 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 4c 8d 87 90 03 00 00 b8 01 00 00 00 <f0> 0f c1 87 90 03 00 00 85 c0 74 15 8d 50 01 09 c2 78 01 c3 be 01
Sep 09 07:31:22 johan-amd kernel: RSP: 0018:ffffb6d4c5697a90 EFLAGS: 00010246
Sep 09 07:31:22 johan-amd kernel: RAX: 0000000000000001 RBX: ffff99248c080000 RCX: ffff99248c082068
Sep 09 07:31:22 johan-amd kernel: RDX: 0000000000000000 RSI: ffff9926659b2468 RDI: 0000000000000000
Sep 09 07:31:22 johan-amd kernel: RBP: ffff99248c080000 R08: 0000000000000390 R09: 0000000000000006
Sep 09 07:31:22 johan-amd kernel: R10: 00000000000190f2 R11: 0000000000000020 R12: ffff99248c080000
Sep 09 07:31:22 johan-amd kernel: R13: 0000000000000001 R14: 0000000000000000 R15: ffff991b9f1eeb80
Sep 09 07:31:22 johan-amd kernel: FS: 0000000000000000(0000) GS:ffff99286ee80000(0000) knlGS:0000000000000000
Sep 09 07:31:22 johan-amd kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 09 07:31:22 johan-amd kernel: CR2: 0000000000000390 CR3: 0000000429446000 CR4: 0000000000750ee0
Sep 09 07:31:22 johan-amd kernel: PKRU: 55555554
Sep 09 07:31:37 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 07:31:47 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 07:31:57 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 07:32:07 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 07:32:17 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 07:32:27 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 07:32:37 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 07:32:47 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 07:32:57 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
So, dug a bit and found some potential related things in linux-firmware and in latest kernel updates (tbh not sure about the kernel part now, but think so at least). Updated linux to 5.13.13.arch1-1, linux-firmware to 20210818.c46b8c3-1. Rebooted, and all seemed happy.
After a few hours, while in Video meeting (Chromium there too, but was actually focused on another workspace atm, scrolling in a JVM-based application), all GPU output hung again. Audio still worked both ways, but apparently my camera froze. Was able to SSH in again, and now had this NP dereference:
Sep 09 10:46:56 johan-amd kernel: BUG: kernel NULL pointer dereference, address: 0000000000000390
Sep 09 10:46:56 johan-amd kernel: #PF: supervisor write access in kernel mode
Sep 09 10:46:56 johan-amd kernel: #PF: error_code(0x0002) - not-present page
Sep 09 10:46:56 johan-amd kernel: PGD 0 P4D 0
Sep 09 10:46:56 johan-amd kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
Sep 09 10:46:56 johan-amd kernel: CPU: 15 PID: 270 Comm: kworker/u64:7 Tainted: P OE 5.13.13-arch1-1 #1
Sep 09 10:46:56 johan-amd kernel: Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 4002 06/15/2021
Sep 09 10:46:56 johan-amd kernel: Workqueue: events_unbound commit_work [drm_kms_helper]
Sep 09 10:46:56 johan-amd kernel: RIP: 0010:dc_stream_retain+0x11/0x40 [amdgpu]
Sep 09 10:46:56 johan-amd kernel: Code: 00 00 c3 c7 87 3c 03 00 00 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 4c 8d 87 90 03 00 00 b8 01 00 00 00 <f0> 0f c1 87 90 03 00 00 85 c0 74 15 8d 50 01 09 c2 78 01 c3 be 01
Sep 09 10:46:56 johan-amd kernel: RSP: 0018:ffffb273c0d87a90 EFLAGS: 00010246
Sep 09 10:46:56 johan-amd kernel: RAX: 0000000000000001 RBX: ffff9e0aa93c0000 RCX: ffff9e0aa93c2068
Sep 09 10:46:56 johan-amd kernel: RDX: 0000000000000000 RSI: ffff9e0aa65b2468 RDI: 0000000000000000
Sep 09 10:46:56 johan-amd kernel: RBP: ffff9e0aa93c0000 R08: 0000000000000390 R09: 0000000000000006
Sep 09 10:46:56 johan-amd kernel: R10: 0000000000007bc2 R11: 0000000000000020 R12: ffff9e0aa93c0000
Sep 09 10:46:56 johan-amd kernel: R13: 0000000000000001 R14: 0000000000000000 R15: ffff9e0609552200
Sep 09 10:46:56 johan-amd kernel: FS: 0000000000000000(0000) GS:ffff9e14eedc0000(0000) knlGS:0000000000000000
Sep 09 10:46:56 johan-amd kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 09 10:46:56 johan-amd kernel: CR2: 0000000000000390 CR3: 00000005a74e0000 CR4: 0000000000750ee0
Sep 09 10:46:56 johan-amd kernel: PKRU: 55555554
Sep 09 10:46:56 johan-amd kernel: Call Trace:
Sep 09 10:46:56 johan-amd kernel: dc_resource_state_copy_construct+0xea/0x130 [amdgpu]
Sep 09 10:46:56 johan-amd kernel: amdgpu_dm_atomic_commit_tail+0x1f20/0x2690 [amdgpu]
Sep 09 10:46:56 johan-amd kernel: ? cpumask_next_and+0x1f/0x20
Sep 09 10:46:56 johan-amd kernel: ? update_sd_lb_stats.constprop.0+0xf1/0x7c0
Sep 09 10:46:56 johan-amd kernel: ? find_busiest_group+0x41/0x310
Sep 09 10:46:56 johan-amd kernel: commit_tail+0x94/0x120 [drm_kms_helper]
Sep 09 10:46:56 johan-amd kernel: process_one_work+0x1e3/0x3b0
Sep 09 10:46:56 johan-amd kernel: worker_thread+0x50/0x3b0
Sep 09 10:46:56 johan-amd kernel: ? process_one_work+0x3b0/0x3b0
Sep 09 10:46:56 johan-amd kernel: kthread+0x133/0x160
Sep 09 10:46:56 johan-amd kernel: ? set_kthread_struct+0x40/0x40
Sep 09 10:46:56 johan-amd kernel: ret_from_fork+0x22/0x30
Sep 09 10:46:56 johan-amd kernel: Modules linked in: tun snd_seq_dummy snd_hrtimer snd_seq xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter xt_addrtype nft_compat nf_tables libcrc32c crc32c_generic nfnetlink br_netfilter bridge stp llc intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel eeepc_wmi aesni_intel asus_wmi sparse_keymap snd_hda_codec_realtek crypto_simd cryptd rfkill video wmi_bmof mxm_wmi snd_hda_codec_generic amdgpu ledtrig_audio rapl snd_hda_codec_hdmi snd_hda_intel gpu_sched snd_intel_dspcfg drm_ttm_helper snd_intel_sdw_acpi snd_usb_audio ttm snd_hda_codec drm_kms_helper snd_usbmidi_lib snd_hda_core snd_rawmidi uvcvideo snd_hwdep snd_seq_device snd_pcm cec videobuf2_vmalloc videobuf2_memops syscopyarea videobuf2_v4l2 snd_timer sysfillrect ccp pcspkr sp5100_tco snd joydev sysimgblt k10temp i2c_piix4 rng_core fb_sys_fops videobuf2_common
Sep 09 10:46:56 johan-amd kernel: soundcore mousedev igb i2c_algo_bit dca wmi pinctrl_amd mac_hid acpi_cpufreq v4l2loopback(OE) videodev mc drm nct6775 hwmon_vid fuse agpgart bpf_preload ip_tables x_tables hid_logitech_hidpp hid_logitech_dj usbhid uas usb_storage zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) xhci_pci xhci_pci_renesas
Sep 09 10:46:56 johan-amd kernel: CR2: 0000000000000390
Sep 09 10:46:56 johan-amd kernel: ---[ end trace ad2f666fc8a871eb ]---
Sep 09 10:46:56 johan-amd kernel: RIP: 0010:dc_stream_retain+0x11/0x40 [amdgpu]
Sep 09 10:46:56 johan-amd kernel: Code: 00 00 c3 c7 87 3c 03 00 00 01 00 00 00 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 4c 8d 87 90 03 00 00 b8 01 00 00 00 <f0> 0f c1 87 90 03 00 00 85 c0 74 15 8d 50 01 09 c2 78 01 c3 be 01
Sep 09 10:46:56 johan-amd kernel: RSP: 0018:ffffb273c0d87a90 EFLAGS: 00010246
Sep 09 10:46:56 johan-amd kernel: RAX: 0000000000000001 RBX: ffff9e0aa93c0000 RCX: ffff9e0aa93c2068
Sep 09 10:46:56 johan-amd kernel: RDX: 0000000000000000 RSI: ffff9e0aa65b2468 RDI: 0000000000000000
Sep 09 10:46:56 johan-amd kernel: RBP: ffff9e0aa93c0000 R08: 0000000000000390 R09: 0000000000000006
Sep 09 10:46:56 johan-amd kernel: R10: 0000000000007bc2 R11: 0000000000000020 R12: ffff9e0aa93c0000
Sep 09 10:46:56 johan-amd kernel: R13: 0000000000000001 R14: 0000000000000000 R15: ffff9e0609552200
Sep 09 10:46:56 johan-amd kernel: FS: 0000000000000000(0000) GS:ffff9e14eedc0000(0000) knlGS:0000000000000000
Sep 09 10:46:56 johan-amd kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 09 10:46:56 johan-amd kernel: CR2: 0000000000000390 CR3: 00000005a74e0000 CR4: 0000000000750ee0
Sep 09 10:46:56 johan-amd kernel: PKRU: 55555554
Sep 09 10:47:11 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 10:47:21 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 10:47:31 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 10:47:41 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 10:47:51 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 10:48:01 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 10:48:11 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 10:48:21 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 10:48:31 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 10:48:41 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 10:48:51 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 10:49:01 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
Sep 09 10:49:11 johan-amd kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:51:crtc-0] hw_done or flip_done timed out
...
I have a prometheus monitoring the machine's node_exporter output, and the node_hwmon_temp_celsius for the GPU was about 50C, but actually last reported value was at 10:26:31, a bit before the NP dereference. Looking historically, the temperature seems to ramp up/down between 50-60C.