[regression] amdgpu kernel NULL pointer dereference in 5.15.118
Brief summary of the problem:
I updated from a custom kernel 5.15.94 to 5.15.118 today on my desktop with AMD Navi 22 Radeon RX 6700XT using Slackware 15.0. After restart and a successful boot with no additional kernel parameters, I logged in, started Xorg, a kernel Bug occurs: Bug: kernel NULL pointer dereference, address: 00000000000002b0`.
This seems to be a regression caused by commit 1cc40dccad76 (drm/amdgpu: Fix Null pointer dereference error in amdgpu_device_recover_vram) which was a fix for 6c032c37ac3e ("drm/amdgpu: Fix vram recover doesn't work after whole GPU reset (v2)")
As a workaround, I applied the patch in reverse (patch -R) to the 5.15.118 kernel source, built, installed, restarted, and then was able to successfully launch Xorg.
Click to expand kernel changelog entry
commit 1cc40dccad76dda40f1a295173bcba0991af7c6f
Author:
Horatio Zhang <Hongkun.Zhang@amd.com>
Date: Mon May 29 14:23:37 2023 -0400
drm/amdgpu: fix Null pointer dereference error in amdgpu_device_recover_vram [ Upstream commit 2a1eb1a343208ce7d6839b73d62aece343e693ff ] Use the function of amdgpu_bo_vm_destroy to handle the resource release of shadow bo. During the amdgpu_mes_self_test, shadow bo released, but vmbo->shadow_list was not, which caused a null pointer reference error in amdgpu_device_recover_vram when GPU reset. Fixes: 6c032c37ac3e ("drm/amdgpu: Fix vram recover doesn't work after whole GPU reset (v2)") Signed-off-by: xinhui pan <xinhui.pan@amd.com> Signed-off-by: Horatio Zhang <Hongkun.Zhang@amd.com> Acked-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Hardware description:
- CPU: Intel i5-9600K @ 3.70GHz
- GPU:
- AMD Radeon RX 6700XT 1002:73df (12gb dedicated)
- Intel UHD Graphics 630 (via PRIME Z390-A)
- drm i915 is loaded, but this is not in-use by any display.
- System Memory: 16gb
- Display(s):
- Acer Model 8b7 (Monitor name: K243Y)
- GSM Model 5b55 (Monitor name: LG FULL HD)
- Type of Display Connection:
- DP-1
- HDMI-A-2
System information:
- Distro name and Version: Slackware 15.0 64-bit + multilib
- Kernel version: 5.15.118
- Custom kernel:
.config
attached. I am not using huge or generic. - AMD official driver version: N/A
- Mesa version: 21.3.5
How to reproduce the issue:
I don't know how much this helps.
- Compiled and installed kernel 5.15.118 with drm/amdgpu module using attached .config from previous 5.15.94 kernel after
make oldconfig
. - Restart
- Login as non-root user.
- startx
Attached files:
Screenshots/video files
Log files (for system lockups / game freezes / crashes)
Click to expand for kernel bug stack trace
Jun 22 08:45:40 kochi kernel: BUG: kernel NULL pointer dereference, address: 00000000000002b0
Jun 22 08:45:40 kochi kernel: #PF: supervisor read access in kernel mode
Jun 22 08:45:40 kochi kernel: #PF: error_code(0x0000) - not-present page
Jun 22 08:45:40 kochi kernel: Oops: 0000 [#1] SMP NOPTI
Jun 22 08:45:40 kochi kernel: CPU: 2 PID: 1336 Comm: Xorg.wrap Not tainted 5.15.118 #1
Jun 22 08:45:40 kochi kernel: Hardware name: System manufacturer System Product Name/PRIME Z390-A, BIOS 1105 06/06/2019
Jun 22 08:45:40 kochi kernel: RIP: 0010:amdgpu_bo_vm_destroy+0x15/0x80 [amdgpu]
Jun 22 08:45:40 kochi kernel: Code: c7 83 e8 01 00 00 00 00 00 00 48 89 ef 5b 5d e9 c1 40 0b f2 90 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 53 48 8b 9f e8 01 00 00 <48> 8b 83 b0 02 00 00 4c 8d a3 b0 02 00 00 49 39 c4 74 41 48 8b 87
Jun 22 08:45:40 kochi kernel: RSP: 0018:ffffb146c0f03dd0 EFLAGS: 00010246
Jun 22 08:45:40 kochi kernel: RAX: ffffffffc11821a0 RBX: 0000000000000000 RCX: 00000000000000a4
Jun 22 08:45:40 kochi kernel: RDX: 0000000000000000 RSI: ffffb146c0f03d78 RDI: ffff96482b158058
Jun 22 08:45:40 kochi kernel: RBP: ffff96482b158058 R08: 0000000000000000 R09: ffff964811d65d40
Jun 22 08:45:40 kochi kernel: R10: ffff964808eae6a0 R11: fffff6318447fac0 R12: ffff9648035d27b0
Jun 22 08:45:40 kochi kernel: R13: ffff964803438000 R14: 0000000000008001 R15: ffff9648034380d0
Jun 22 08:45:40 kochi kernel: FS: 00007f0db1547b80(0000) GS:ffff964b6dc80000(0000) knlGS:0000000000000000
Jun 22 08:45:40 kochi kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 22 08:45:40 kochi kernel: CR2: 00000000000002b0 CR3: 0000000137728005 CR4: 00000000003706e0
Jun 22 08:45:40 kochi kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 22 08:45:40 kochi kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jun 22 08:45:40 kochi kernel: Call Trace:
Jun 22 08:45:40 kochi kernel: <TASK>
Jun 22 08:45:40 kochi kernel: ? __die+0x59/0x9c
Jun 22 08:45:40 kochi kernel: ? page_fault_oops+0xae/0x250
Jun 22 08:45:40 kochi kernel: ? kfree+0xc2/0x250
Jun 22 08:45:40 kochi kernel: ? kvfree_call_rcu+0x69/0x2a0
Jun 22 08:45:40 kochi kernel: ? exc_page_fault+0x406/0x770
Jun 22 08:45:40 kochi kernel: ? asm_exc_page_fault+0x22/0x30
Jun 22 08:45:40 kochi kernel: ? amdgpu_bo_destroy+0x70/0x70 [amdgpu]
Jun 22 08:45:40 kochi kernel: ? amdgpu_bo_vm_destroy+0x15/0x80 [amdgpu]
Jun 22 08:45:40 kochi kernel: amdgpu_bo_unref+0x1a/0x30 [amdgpu]
Jun 22 08:45:40 kochi kernel: amdgpu_driver_postclose_kms+0x18f/0x240 [amdgpu]
Jun 22 08:45:40 kochi kernel: drm_file_free.part.0+0x1d1/0x220 [drm]
Jun 22 08:45:40 kochi kernel: drm_release+0x65/0x110 [drm]
Jun 22 08:45:40 kochi kernel: __fput+0x89/0x250
Jun 22 08:45:40 kochi kernel: task_work_run+0x63/0xa0
Jun 22 08:45:40 kochi kernel: exit_to_user_mode_prepare+0x13f/0x150
Jun 22 08:45:40 kochi kernel: syscall_exit_to_user_mode+0x1d/0x40
Jun 22 08:45:40 kochi kernel: ? __x64_sys_close+0xd/0x50
Jun 22 08:45:40 kochi kernel: do_syscall_64+0x48/0xc0
Jun 22 08:45:40 kochi kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jun 22 08:45:40 kochi kernel: RIP: 0033:0x7f0db1739463
Jun 22 08:45:40 kochi kernel: Code: 8b 15 69 ab 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 83 ec 18 89 7c 24 0c e8
Jun 22 08:45:40 kochi kernel: RSP: 002b:00007fff49dceaa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
Jun 22 08:45:40 kochi kernel: RAX: 0000000000000000 RBX: 0000000000000002 RCX: 00007f0db1739463
Jun 22 08:45:40 kochi kernel: RDX: 00007fff49dcead0 RSI: 00000000c04064a0 RDI: 0000000000000003
Jun 22 08:45:40 kochi kernel: RBP: 0000000000000002 R08: 0000000000000000 R09: 00007fff49dce950
Jun 22 08:45:40 kochi kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000402075
Jun 22 08:45:40 kochi kernel: R13: 00007fff49dceb10 R14: 000000000040206c R15: 0000000000000001
Jun 22 08:45:40 kochi kernel: </TASK>
Jun 22 08:45:40 kochi kernel: Modules linked in: bnep cfg80211 bridge stp llc ipv6 hid_logitech_hidpp joydev hid_logitech hid_logitech_dj hid_generic snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio btusb btrtl btbcm uvcvideo btintel videobuf2_vmalloc snd_usb_audio videobuf2_memops videobuf2_v4l2 usbhid bluetooth snd_usbmidi_lib videobuf2_common ecdh_generic snd_rawmidi ecc hid rfkill videodev snd_seq_device mc amdgpu i915 coretemp mfd_core intel_tcc_cooling x86_pkg_temp_thermal gpu_sched snd_hda_codec_hdmi intel_powerclamp drm_ttm_helper prime_numbers ttm snd_hda_intel snd_intel_dspcfg drm_kms_helper snd_hda_codec kvm_intel snd_hwdep syscopyarea wmi_bmof mxm_wmi sysfillrect snd_hda_core sysimgblt fb_sys_fops snd_pcm nvme e1000e drm intel_gtt snd_timer kvm nvme_core snd xhci_pci mei_me agpgart hwmon soundcore irqbypass ptp mei i2c_i801 xhci_hcd pps_core i2c_smbus evdev fan thermal wmi video acpi_pad button loop
Jun 22 08:45:40 kochi kernel: CR2: 00000000000002b0
Jun 22 08:45:40 kochi kernel: ---[ end trace ae6d30bc1cedef9e ]---
- messages
- Xorg log (unfortunately not saved, I'd need to re-compile a clean 5.15.118 and recapture this later)