RDNA4 NULL pointer dereference
Brief summary of the problem:
After some time of playing of Witcher 3, I am getting following crash/hang in amdgpu. The GPU doesn't recover and I cannot control the PC. It is not consistently reproducible but my wife is able to hit it every hour or so of playing (rather consistently during opening crates but also with switching menus or HUD).
Hardware description:
- CPU: AMD Ryzen 5950X
- GPU: AMD RX 9070 XT
- System Memory: 64 GiB DDR4 3600 MHz
- Display(s): Eizo FORIS FS2735
- Type of Display Connection: DP (FreeSync)
- Resizable BAR: On
- IOMMU: On
System information:
- Distro name and Version: Gentoo Linux
- Kernel version: 6.14-rc6 (config copied from 6.13.6 gentoo-dist)
- Custom kernel: N/A
- AMD official driver version: N/A
- linux-firmware: 20250311
- MESA: mesa-git (mesa-9999)
How to reproduce the issue:
-
Have Witcher 3 NG installed in Wine 10.3 + dxvk + vkd3d-proton
-
Play the game using DirectX 12 without RT using TAAU for a while. Within few hours, the amdgpu crashes with following report in dmesg:
Mar 12 21:19:39 arcadia kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 Mar 12 21:19:39 arcadia kernel: #PF: supervisor read access in kernel mode Mar 12 21:19:39 arcadia kernel: #PF: error_code(0x0000) - not-present page Mar 12 21:19:39 arcadia kernel: PGD 308b8a067 P4D 308b8a067 PUD 0 Mar 12 21:19:39 arcadia kernel: Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI Mar 12 21:19:39 arcadia kernel: CPU: 22 UID: 0 PID: 142070 Comm: kworker/u130:1 Tainted: G U 6.14.0-rc6-gentoo-dist #1 Mar 12 21:19:39 arcadia kernel: Tainted: [U]=USER Mar 12 21:19:39 arcadia kernel: Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ULTRA/X570 AORUS ULTRA, BIOS F39d 09/02/2024 Mar 12 21:19:39 arcadia kernel: Workqueue: events_unbound commit_work [drm_kms_helper] Mar 12 21:19:39 arcadia kernel: RIP: 0010:calculate_mcache_setting+0x516/0xbe0 [amdgpu] Mar 12 21:19:39 arcadia kernel: Code: 0f 2a c0 e8 0c 70 04 00 48 8b 93 90 00 00 00 f2 48 0f 2c c0 0f af 85 08 46 00 00 42 89 04 a2 48 8b 83 80 00 00 00 49 83 c4 01 <8b> 00 83 e8 01 41 39 c4 72 aa 8b 95 18 46 00 00 48 c1 e0 02 48 8b Mar 12 21:19:39 arcadia kernel: RSP: 0018:ffffada1da5f7540 EFLAGS: 00010202 Mar 12 21:19:39 arcadia kernel: RAX: 0000000000000000 RBX: ffff94231ba0d028 RCX: 0000000000000000 Mar 12 21:19:39 arcadia kernel: RDX: ffff94231ba064d0 RSI: ffffffffc266cc10 RDI: ffffffffc271afb0 Mar 12 21:19:39 arcadia kernel: RBP: ffff94231ba08de8 R08: ffff94231ba04f4c R09: 0000000000000000 Mar 12 21:19:39 arcadia kernel: R10: ffffada1da5f7520 R11: ffff94231ba04740 R12: 0000000000001af8 Mar 12 21:19:39 arcadia kernel: R13: ffff94231ba06494 R14: ffff94231ba00040 R15: ffff94231ba038f0 Mar 12 21:19:39 arcadia kernel: FS: 0000000000000000(0000) GS:ffff94253eb00000(0000) knlGS:0000000000000000 Mar 12 21:19:39 arcadia kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 12 21:19:39 arcadia kernel: CR2: 0000000000000000 CR3: 000000026caf6000 CR4: 0000000000f50ef0 Mar 12 21:19:39 arcadia kernel: PKRU: 55555554 Mar 12 21:19:39 arcadia kernel: Call Trace: Mar 12 21:19:39 arcadia kernel: <TASK> Mar 12 21:19:39 arcadia kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Mar 12 21:19:39 arcadia kernel: ? show_trace_log_lvl+0x255/0x2f0 Mar 12 21:19:39 arcadia kernel: ? show_trace_log_lvl+0x255/0x2f0 Mar 12 21:19:39 arcadia kernel: ? dml_core_mode_support+0x7ee1/0x17550 [amdgpu] Mar 12 21:19:39 arcadia kernel: ? __die_body.cold+0x8/0x12 Mar 12 21:19:39 arcadia kernel: ? page_fault_oops+0x148/0x160 Mar 12 21:19:39 arcadia kernel: ? exc_page_fault+0x7e/0x1a0 Mar 12 21:19:39 arcadia kernel: ? asm_exc_page_fault+0x26/0x30 Mar 12 21:19:39 arcadia kernel: ? calculate_mcache_setting+0x516/0xbe0 [amdgpu] Mar 12 21:19:39 arcadia kernel: dml_core_mode_support+0x7ee1/0x17550 [amdgpu] Mar 12 21:19:39 arcadia kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Mar 12 21:19:39 arcadia kernel: ? calculate_vactive_det_fill_latency+0x15a/0x2d0 [amdgpu] Mar 12 21:19:39 arcadia kernel: ? dml2_core_calcs_mode_support_ex+0x31/0x100 [amdgpu] Mar 12 21:19:39 arcadia kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Mar 12 21:19:39 arcadia kernel: dml2_core_calcs_mode_support_ex+0x31/0x100 [amdgpu] Mar 12 21:19:39 arcadia kernel: core_dcn4_mode_support+0x76/0xb00 [amdgpu] Mar 12 21:19:39 arcadia kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Mar 12 21:19:39 arcadia kernel: dml2_top_optimization_perform_optimization_phase+0x1c3/0x280 [amdgpu] Mar 12 21:19:39 arcadia kernel: ? core_dcn4_mode_support+0x6d0/0xb00 [amdgpu] Mar 12 21:19:39 arcadia kernel: dml2_top_soc15_build_mode_programming+0x2f5/0x800 [amdgpu] Mar 12 21:19:39 arcadia kernel: dml21_mode_check_and_programming+0x110/0x1b0 [amdgpu] Mar 12 21:19:39 arcadia kernel: update_planes_and_stream_state+0x285/0x5c0 [amdgpu] Mar 12 21:19:39 arcadia kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Mar 12 21:19:39 arcadia kernel: ? complete+0x1c/0x80 Mar 12 21:19:39 arcadia kernel: update_planes_and_stream_v3+0x54/0x1b0 [amdgpu] Mar 12 21:19:39 arcadia kernel: dc_update_planes_and_stream+0x43/0x100 [amdgpu] Mar 12 21:19:39 arcadia kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Mar 12 21:19:39 arcadia kernel: ? sort+0x31/0x50 Mar 12 21:19:39 arcadia kernel: amdgpu_dm_commit_planes+0x590/0x1480 [amdgpu] Mar 12 21:19:39 arcadia kernel: amdgpu_dm_atomic_commit_tail+0x9d9/0x10b0 [amdgpu] Mar 12 21:19:39 arcadia kernel: commit_tail+0xaf/0x160 [drm_kms_helper] Mar 12 21:19:39 arcadia kernel: process_one_work+0x179/0x330 Mar 12 21:19:39 arcadia kernel: worker_thread+0x252/0x390 Mar 12 21:19:39 arcadia kernel: ? __pfx_worker_thread+0x10/0x10 Mar 12 21:19:39 arcadia kernel: kthread+0xef/0x230 Mar 12 21:19:39 arcadia kernel: ? __pfx_kthread+0x10/0x10 Mar 12 21:19:39 arcadia kernel: ret_from_fork+0x34/0x50 Mar 12 21:19:39 arcadia kernel: ? __pfx_kthread+0x10/0x10 Mar 12 21:19:39 arcadia kernel: ret_from_fork_asm+0x1a/0x30 Mar 12 21:19:39 arcadia kernel: </TASK> Mar 12 21:19:39 arcadia kernel: Modules linked in: uinput veth nf_conntrack_netlink rfcomm snd_seq_dummy snd_hrtimer snd_seq ip6table_nat ip6table_filter ip6_tables 8021q garp mrp vhost_net vhost vhost_iotlb tap tun bridge stp llc overlay xt_nat xt_MASQUERADE xt_addrtype iptable_nat nf_nat xt_conntrack nf_conntrac> Mar 12 21:19:39 arcadia kernel: soundcore gigabyte_wmi wmi_bmof i2c_algo_bit i2c_smbus pcspkr k10temp mxm_wmi rfkill dca grace nfs_localio sunrpc fuse loop nfnetlink dm_crypt polyval_clmulni nvme polyval_generic nvme_core ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 sp5100_tco nvme_auth wmi virtio_bal> Mar 12 21:19:39 arcadia kernel: CR2: 0000000000000000 Mar 12 21:19:39 arcadia kernel: ---[ end trace 0000000000000000 ]--- Mar 12 21:19:39 arcadia kernel: RIP: 0010:calculate_mcache_setting+0x516/0xbe0 [amdgpu] Mar 12 21:19:39 arcadia kernel: Code: 0f 2a c0 e8 0c 70 04 00 48 8b 93 90 00 00 00 f2 48 0f 2c c0 0f af 85 08 46 00 00 42 89 04 a2 48 8b 83 80 00 00 00 49 83 c4 01 <8b> 00 83 e8 01 41 39 c4 72 aa 8b 95 18 46 00 00 48 c1 e0 02 48 8b Mar 12 21:19:39 arcadia kernel: RSP: 0018:ffffada1da5f7540 EFLAGS: 00010202 Mar 12 21:19:39 arcadia kernel: RAX: 0000000000000000 RBX: ffff94231ba0d028 RCX: 0000000000000000 Mar 12 21:19:39 arcadia kernel: RDX: ffff94231ba064d0 RSI: ffffffffc266cc10 RDI: ffffffffc271afb0 Mar 12 21:19:39 arcadia kernel: RBP: ffff94231ba08de8 R08: ffff94231ba04f4c R09: 0000000000000000 Mar 12 21:19:39 arcadia kernel: R10: ffffada1da5f7520 R11: ffff94231ba04740 R12: 0000000000001af8 Mar 12 21:19:39 arcadia kernel: R13: ffff94231ba06494 R14: ffff94231ba00040 R15: ffff94231ba038f0 Mar 12 21:19:39 arcadia kernel: FS: 0000000000000000(0000) GS:ffff94253eb00000(0000) knlGS:0000000000000000 Mar 12 21:19:39 arcadia kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 12 21:19:39 arcadia kernel: CR2: 0000000000000000 CR3: 000000026caf6000 CR4: 0000000000f50ef0 Mar 12 21:19:39 arcadia kernel: PKRU: 55555554 Mar 12 21:19:39 arcadia kernel: note: kworker/u130:1[142070] exited with irqs disabled Mar 12 22:04:34 arcadia kernel: [drm:do_aquire_global_lock.isra.0 [amdgpu]] *ERROR* [CRTC:79:crtc-0] hw_done or flip_done timed out
I have tried a lot of different kernels, firmware, even BIOS and its config and I have encountered lot of different issues (including pageflip or ring timeout). This bug stands out as it is usually the first one to hit and comes with the stack trace. Hopefully, it could be reasonably easy to fix it.