Null pointer dereference and freeze with kernel 5.18.1
I’ve found this and this, but wasn’t sure if it was related.
Jun 06 07:15:51 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Jun 06 07:15:51 kernel: #PF: supervisor read access in kernel mode
Jun 06 07:15:51 kernel: #PF: error_code(0x0000) - not-present page
Jun 06 07:15:51 kernel: PGD 0 P4D 0
Jun 06 07:15:51 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jun 06 07:15:51 kernel: CPU: 21 PID: 745 Comm: kworker/21:1H Tainted: P OE 5.18.1-arch1-1-zen2 #1 5bdf949406d275a886d90ed593ef80489ae62843
Jun 06 07:15:51 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Creator, BIOS P3.50 04/21/2021
Jun 06 07:15:51 kernel: Workqueue: events_highpri dm_irq_work_func [amdgpu]
Jun 06 07:15:51 kernel: RIP: 0010:dcn20_reset_hw_ctx_wrap+0x2f6/0x370 [amdgpu]
Jun 06 07:15:51 kernel: Code: 40 ad ee de 48 8b 54 24 18 48 8b ba e0 02 00 00 48 8b 07 48 8b 40 30 e8 28 ad ee de 31 f6 48 8b 54 24 18 48 8b ba e0 02 00 00 <48> 8b 07 48 8b 80 58 01 00 00 e8 0b ad ee de 48 8b 54 24 18 48 8b
Jun 06 07:15:51 kernel: RSP: 0018:ffffb5cc4290b918 EFLAGS: 00010246
Jun 06 07:15:51 kernel: RAX: 0000000000000001 RBX: 0000000000000002 RCX: ffffffffc0cff3b0
Jun 06 07:15:51 kernel: RDX: ffff975c01241020 RSI: 0000000000000000 RDI: 0000000000000000
Jun 06 07:15:51 kernel: RBP: ffff975c01240000 R08: 0000000000000210 R09: 0000000000000001
Jun 06 07:15:51 kernel: R10: 0000000000000300 R11: 0000000000000000 R12: 0000000000000002
Jun 06 07:15:51 kernel: R13: ffff975c01241208 R14: ffff975555fc0000 R15: 0000000000001208
Jun 06 07:15:51 kernel: FS: 0000000000000000(0000) GS:ffff97740ef40000(0000) knlGS:0000000000000000
Jun 06 07:15:51 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 06 07:15:51 kernel: CR2: 0000000000000000 CR3: 000080095ab36000 CR4: 0000000000350ee0
Jun 06 07:15:51 kernel: Call Trace:
Jun 06 07:15:51 kernel: <TASK>
Jun 06 07:15:51 kernel: dce110_apply_ctx_to_hw+0x66/0x720 [amdgpu 36c4441bb4fee6df23663bbf4b7d2f527de6e3f4]
Jun 06 07:15:51 kernel: ? __free_pages_ok+0x29a/0x540
Jun 06 07:15:51 kernel: dc_commit_state+0x377/0xaa0 [amdgpu 36c4441bb4fee6df23663bbf4b7d2f527de6e3f4]
Jun 06 07:15:51 kernel: amdgpu_dm_atomic_commit_tail+0x3a0/0x21e0 [amdgpu 36c4441bb4fee6df23663bbf4b7d2f527de6e3f4]
Jun 06 07:15:51 kernel: ? dcn20_fast_validate_bw+0x36d/0x410 [amdgpu 36c4441bb4fee6df23663bbf4b7d2f527de6e3f4]
Jun 06 07:15:51 kernel: ? dcn20_validate_bandwidth_internal+0xfd/0x2e0 [amdgpu 36c4441bb4fee6df23663bbf4b7d2f527de6e3f4]
Jun 06 07:15:51 kernel: ? dc_fpu_end+0x94/0xb0 [amdgpu 36c4441bb4fee6df23663bbf4b7d2f527de6e3f4]
Jun 06 07:15:51 kernel: ? dcn20_validate_bandwidth+0x47/0x50 [amdgpu 36c4441bb4fee6df23663bbf4b7d2f527de6e3f4]
Jun 06 07:15:51 kernel: ? dc_validate_global_state+0x309/0x3d0 [amdgpu 36c4441bb4fee6df23663bbf4b7d2f527de6e3f4]
Jun 06 07:15:51 kernel: ? dm_plane_helper_prepare_fb+0x207/0x2f0 [amdgpu 36c4441bb4fee6df23663bbf4b7d2f527de6e3f4]
Jun 06 07:15:51 kernel: ? __wait_for_common+0x197/0x1c0
Jun 06 07:15:51 kernel: ? usleep_range_state+0x90/0x90
Jun 06 07:15:51 kernel: commit_tail+0x92/0x120
Jun 06 07:15:51 kernel: drm_atomic_helper_commit+0x113/0x140
Jun 06 07:15:51 kernel: dm_restore_drm_connector_state+0xec/0x160 [amdgpu 36c4441bb4fee6df23663bbf4b7d2f527de6e3f4]
Jun 06 07:15:51 kernel: handle_hpd_irq_helper+0x14b/0x190 [amdgpu 36c4441bb4fee6df23663bbf4b7d2f527de6e3f4]
Jun 06 07:15:51 kernel: process_one_work+0x1c7/0x380
Jun 06 07:15:51 kernel: worker_thread+0x51/0x3a0
Jun 06 07:15:51 kernel: ? rescuer_thread+0x3a0/0x3a0
Jun 06 07:15:51 kernel: ? rescuer_thread+0x3a0/0x3a0
Jun 06 07:15:51 kernel: kthread+0xde/0x110
Jun 06 07:15:51 kernel: ? kthread_complete_and_exit+0x20/0x20
Jun 06 07:15:51 kernel: ret_from_fork+0x22/0x30
Jun 06 07:15:51 kernel: </TASK>
Jun 06 07:15:51 kernel: Modules linked in: iwlmvm iwlwifi iwlmei mac80211 libarc4 cfg80211 mei cdc_acm tcp_diag inet_diag vhost_net tap tun vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_type1 vfio vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock ecb ecryptfs ax88179_178a usbnet mii ses enclosure nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) smartpqi scsi_transport_sas ccm bridge dummy nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct cmac nft_nat algif_hash algif_skcipher af_alg bnep nft_chain_nat nf_nat nf_conntrack vfat fat nf_defrag_ipv6 nf_defrag_ipv4 8021q garp nf_tables mrp stp llc nct6683 nfnetlink lm92 raid1 md_mod mousedev joydev snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi typec_displayport intel_rapl_msr mxm_wmi intel_wmi_thunderbolt wmi_bmof snd_hda_intel snd_intel_dspcfg uvcvideo snd_intel_sdw_acpi snd_usb_audio videobuf2_vmalloc snd_hda_codec videobuf2_memops snd_usbmidi_lib intel_rapl_common btusb amd64_edac btrtl
Jun 06 07:15:51 kernel: videobuf2_v4l2 btbcm snd_hda_core snd_rawmidi edac_mce_amd videobuf2_common btintel snd_hwdep snd_seq_device usbip_host videodev btmtk kvm_amd razeraccessory(OE) razerkbd(OE) mc razermouse(OE) usbip_core kvm bluetooth snd_pcm ucsi_ccg snd_timer atlantic irqbypass typec_ucsi ecdh_generic snd sp5100_tco igb typec rfkill rapl pcspkr i2c_piix4 roles zenpower(OE) soundcore crc16 dca macsec pinctrl_amd mac_hid wmi acpi_cpufreq dm_multipath ipmi_devintf ipmi_msghandler sg crypto_user it87 hwmon_vid fuse xxhash_generic btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq dm_crypt cbc encrypted_keys trusted asn1_encoder tee uas usb_storage usbhid dm_mod amdgpu crct10dif_pclmul tpm_crb crc32_pclmul crc32c_intel ghash_clmulni_intel drm_ttm_helper aesni_intel ttm tpm_tis crypto_simd gpu_sched nvme tpm_tis_core cryptd ccp drm_dp_helper xhci_pci tpm nvme_core xhci_pci_renesas rng_core thunderbolt [last unloaded: cfg80211]
Jun 06 07:15:51 kernel: CR2: 0000000000000000
Jun 06 07:15:51 kernel: ---[ end trace 0000000000000000 ]---
Jun 06 07:15:51 kernel: RIP: 0010:dcn20_reset_hw_ctx_wrap+0x2f6/0x370 [amdgpu]
Jun 06 07:15:51 kernel: Code: 40 ad ee de 48 8b 54 24 18 48 8b ba e0 02 00 00 48 8b 07 48 8b 40 30 e8 28 ad ee de 31 f6 48 8b 54 24 18 48 8b ba e0 02 00 00 <48> 8b 07 48 8b 80 58 01 00 00 e8 0b ad ee de 48 8b 54 24 18 48 8b
Jun 06 07:15:51 kernel: RSP: 0018:ffffb5cc4290b918 EFLAGS: 00010246
Jun 06 07:15:51 kernel: RAX: 0000000000000001 RBX: 0000000000000002 RCX: ffffffffc0cff3b0
Jun 06 07:15:51 kernel: RDX: ffff975c01241020 RSI: 0000000000000000 RDI: 0000000000000000
Jun 06 07:15:51 kernel: RBP: ffff975c01240000 R08: 0000000000000210 R09: 0000000000000001
Jun 06 07:15:51 kernel: R10: 0000000000000300 R11: 0000000000000000 R12: 0000000000000002
Jun 06 07:15:51 kernel: R13: ffff975c01241208 R14: ffff975555fc0000 R15: 0000000000001208
Jun 06 07:15:51 kernel: FS: 0000000000000000(0000) GS:ffff97740ef40000(0000) knlGS:0000000000000000
Jun 06 07:15:51 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 06 07:15:51 kernel: CR2: 0000000000000000 CR3: 000080095ab36000 CR4: 0000000000350ee0
Brief summary of the problem:
This causes a freeze and does not cause a reboot, likely because it’s not a panic. The machine stops responding to network and all inputs … or so it seems. Strangely enough, virtual machines keep running, to the point that a virtual machine using a dedicated eGPU (a NVidia, not the AMD in the internal PCIe slot) can be unlocked (and its forwarded USB keyboard works, despite the fact that keyboards used by the physical machine don’t even respond to CapsLock). But it cannot sync
; that will freeze too.
Hardware description:
The hardware is described in this post in more detail, the only difference being that it now runs the latest non-beta firmware 3.50.
- CPU: AMD Ryzen 3950X
- GPU: The NVidia is in a Thunderbolt eGPU, dedicated to VMs etc. The AMD is used as the “main” GPU and drives monitors.
*-display description: VGA compatible controller product: Navi 10 [Radeon Pro W5700] [1002:7312] vendor: Advanced Micro Devices, Inc. [AMD/ATI] [1002] physical id: 0 bus info: pci@0000:7e:00.0 version: 00 width: 64 bits clock: 33MHz capabilities: pm pciexpress msi vga_controller bus_master cap_list rom configuration: driver=amdgpu latency=0 resources: iomemory:7c0-7bf iomemory:7e0-7df irq:131 memory:7c00000000-7dffffffff memory:7e00000000-7e0fffffff ioport:f000(size=256) memory:ef800000-ef87ffff memory:ef880000-ef89ffff *-display description: VGA compatible controller product: GP104GL [Quadro P5000] [10DE:1BB0] vendor: NVIDIA Corporation [10DE] physical id: 0 bus info: pci@0000:3d:00.0 version: a1 width: 64 bits clock: 33MHz capabilities: pm msi pciexpress vga_controller bus_master cap_list rom configuration: driver=vfio-pci latency=0 resources: iomemory:240-23f iomemory:240-23f irq:45 memory:e1000000-e1ffffff memory:2450000000-245fffffff memory:2460000000-2461ffffff ioport:0(size=128) memory:e2000000-e207ffff
- System Memory: 128 GB
- Display(s): 2 HP Z27q monitors (5120×2880), i.e. 4 tiles 2560×2880 each
- Type of Display Connection: 4× DisplayPort 1.2
System information:
- Distro name and Version: Arch (What is a Version?)
- Kernel version:
Linux ******************* 5.18.1-arch1-1-zen2 #1 SMP PREEMPT_DYNAMIC Sun, 05 Jun 2022 13:49:12 +0000 x86_64 GNU/Linux
- Custom kernel: Kind of. Built with
-march=znver2
, but using stock Arch kernel config. - AMD official driver version: N/A (I don’t understand this question. The
amdgpu
module is bundled with the kernel. What is an “AMD official driver version”?)
How to reproduce the issue:
Just let it run with monitors in a low-power state. It will happen, eventually, in a few tens of minutes to hours.
As a side note, there is also a (possibly) GPU-related issue related to screen blanking (in Gnome on the same GPU configuration), described in this long-standing bug report, but I think this problem might be unrelated, partly because the Gnome crash is recoverable and has existed for ages, whereas this is not recoverable and only appeared with 5.18 kernels.
Attached files:
Log files (for system lockups / game freezes / crashes)
- Dmesg log (full log) The NVidia API mismatch in the log is due to a system upgrade done during that particular uptime. However, the null pointer dereference keeps happening during low-power states of the AMD GPU, no matter if the NVidia eGPU is connected or not.
- Xorg log: N/A (I don’t use Xorg.)
- Any other log: N/A