Kernel NULL pointer dereference in amdgpu_gmc_set_pte_pde when enabling both Clover and AMD ROCm 5.7.0 OpenCL ICDs
Hello,
as I anticipated on IRC, I have a Lenovo ThinkPad P15v Gen3 with an AMD Ryzen 7 PRO 6850H with integrated graphics and a discrete NVIDIA GPU.
System information
NOTE: the inxi
output was produced under X after loading the NVIDIA module, but the issue is reproducible even on the framebuffer console,
before loading and graphical interface, and with the NVIDIA GPU modules in blacklist.
System:
Host: oblomov Kernel: 6.5.0-3-amd64 arch: x86_64 bits: 64 compiler: gcc
v: 13.2.0 Desktop: awesome v: 4.3 dm: SDDM Distro: Debian GNU/Linux
trixie/sid
CPU:
Info: 8-core model: AMD Ryzen 7 PRO 6850H with Radeon Graphics bits: 64
type: MT MCP arch: Zen 3+ rev: 1 cache: L1: 512 KiB L2: 4 MiB L3: 16 MiB
Speed (MHz): avg: 524 high: 1397 min/max: 400/4785 cores: 1: 400 2: 400
3: 400 4: 400 5: 400 6: 400 7: 400 8: 400 9: 1397 10: 400 11: 1397 12: 400
13: 400 14: 400 15: 400 16: 400 bogomips: 102200
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
Device-1: NVIDIA GA107GLM [RTX A2000 Mobile] vendor: Lenovo driver: nvidia
v: 525.125.06 arch: Ampere pcie: speed: 16 GT/s lanes: 8 bus-ID: 01:00.0
chip-ID: 10de:25b8
Device-2: AMD Rembrandt [Radeon 680M] vendor: Lenovo driver: amdgpu
v: kernel arch: RDNA-2 pcie: speed: 16 GT/s lanes: 16 ports: active: eDP-1
empty: DP-1, DP-2, DP-3, DP-4, DP-5, HDMI-A-1 bus-ID: 66:00.0
chip-ID: 1002:1681 temp: 39.0 C
Device-3: Bison Integrated Camera driver: uvcvideo type: USB rev: 2.0
speed: 480 Mb/s lanes: 1 bus-ID: 5-1:2 chip-ID: 5986:9106
Display: server: X.Org v: 1.21.1.9 with: Xwayland v: 23.2.2 driver: X:
loaded: modesetting,nvidia dri: radeonsi gpu: amdgpu display-ID: :0
screens: 1
Screen-1: 0 s-res: 3840x2160 s-dpi: 284
Monitor-1: eDP-1 model-id: CSO 0x1508 res: 3840x2160 dpi: 284
diag: 395mm (15.5")
API: EGL v: 1.5 platforms: device: 0 drv: nvidia device: 1 drv: radeonsi
device: 3 drv: swrast gbm: drv: nvidia surfaceless: drv: nvidia x11:
drv: radeonsi inactive: wayland,device-2
API: OpenGL v: 4.6 vendor: amd mesa v: 23.2.1-1 glx-v: 1.4 es-v: 3.2
direct-render: yes renderer: AMD Radeon Graphics (rembrandt LLVM 16.0.6 DRM
3.54 6.5.0-3-amd64) device-ID: 1002:1681
API: Vulkan v: 1.3.250 surfaces: xcb,xlib device: 0 type: integrated-gpu
driver: mesa radv device-ID: 1002:1681 device: 1 type: discrete-gpu
driver: nvidia device-ID: 10de:25b8 device: 2 type: cpu
driver: mesa llvmpipe device-ID: 10005:0000
The lspci -nnvv
for the AMD GPU in question is:
66:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt [Radeon 680M] [1002:1681] (rev d8) (prog-if 00 [VGA controller])
Subsystem: Lenovo Rembrandt [Radeon 680M] [17aa:2300]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 83
IOMMU group: 19
Region 0: Memory at a70000000 (64-bit, prefetchable) [size=256M]
Region 2: Memory at a80000000 (64-bit, prefetchable) [size=2M]
Region 4: I/O ports at 1000 [size=256]
Region 5: Memory at b1900000 (32-bit, non-prefetchable) [size=512K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s, Width x16
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-
AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn+
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [a0] MSI: Enable- Count=1/4 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [c0] MSI-X: Enable+ Count=4 Masked-
Vector table: BAR=5 offset=00042000
PBA: BAR=5 offset=00043000
Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [270 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [2a0 v1] Access Control Services
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Capabilities: [2b0 v1] Address Translation Service (ATS)
ATSCap: Invalidate Queue Depth: 00
ATSCtl: Enable+, Smallest Translation Unit: 00
Capabilities: [2c0 v1] Page Request Interface (PRI)
PRICtl: Enable- Reset-
PRISta: RF- UPRGI- Stopped+
Page Request Capacity: 00000100, Page Request Allocation: 00000000
Capabilities: [2d0 v1] Process Address Space ID (PASID)
PASIDCap: Exec+ Priv+, Max PASID Width: 10
PASIDCtl: Enable- Exec- Priv-
Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [450 v1] Lane Margining at the Receiver <?>
Kernel driver in use: amdgpu
Kernel modules: amdgpu
Describe the issue
I am seeing a kernel (Linux 6.5.8) NULL pointer dereference when enabling both the Mesa Clover 23.2 and AMD ROCm 5.7.0 OpenCL ICDs,
even for something as simple as clinfo -l
that simply enumerates platforms and devices.
No fault when only rocm and rusticl are enabled, or when only clover and rusticl are enabled.
The only combination that triggers the error seems to be clover + rocm.
This also happens with only the AMD GPU modules loaded, with neither nouveau nor the proprietary NVIDIA kernel modules. It also happens on the framebuffer console (no Xorg, no Wayland). It's consistently reproducible.
Regression
I just got this machine so I don't know if other version combinations would work.
Log files as attachment
This is the relevant part of the dmesg
[ 135.520536] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 135.522275] #PF: supervisor write access in kernel mode
[ 135.523736] #PF: error_code(0x0002) - not-present page
[ 135.524489] PGD 0 P4D 0
[ 135.525184] Oops: 0002 [#1] PREEMPT SMP NOPTI
[ 135.526528] CPU: 10 PID: 2172 Comm: clinfo Tainted: G OE 6.5.0-3-amd64 #1 Debian 6.5.8-1
[ 135.526528] Hardware name: LENOVO 21EMS02300/21EMS02300, BIOS N3KET34W (1.12 ) 12/19/2022
[ 135.528841] RIP: 0010:amdgpu_gmc_set_pte_pde+0x23/0x30 [amdgpu]
[ 135.529207] Code: 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff ff 00 00 48 21 c1 8d 04 d5 00 00 00 00 4c 09 c1 48 01 c6 <48> 89 0e 31 c0 e9 73 9e 94 d5 0f 1f 00 90 90 90 90 90 90 90 90 90
[ 135.529207] RSP: 0018:ffffa2bf855ef978 EFLAGS: 00010246
[ 135.529207] RAX: 0000000000000000 RBX: 0000000000200000 RCX: 0040000000000480
[ 135.533330] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff96ef5d200000
[ 135.533330] RBP: ffffa2bf855efae0 R08: 0040000000000480 R09: 0000000000200000
[ 135.533330] R10: 0040000000000480 R11: 0000000000000009 R12: 0000000000200000
[ 135.533330] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
[ 135.536844] FS: 00007fadb1a3a740(0000) GS:ffff96f67e880000(0000) knlGS:0000000000000000
[ 135.536844] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 135.536844] CR2: 0000000000000000 CR3: 000000010d60a000 CR4: 0000000000750ee0
[ 135.536844] PKRU: 55555554
[ 135.541706] Call Trace:
[ 135.541706] <TASK>
[ 135.541706] ? __die+0x23/0x70
[ 135.544848] ? page_fault_oops+0x171/0x4f0
[ 135.545301] ? put_pages_list+0xc8/0xf0
[ 135.545301] ? exc_page_fault+0x7f/0x180
[ 135.545301] ? asm_exc_page_fault+0x26/0x30
[ 135.545301] ? amdgpu_gmc_set_pte_pde+0x23/0x30 [amdgpu]
[ 135.545301] ? srso_alias_return_thunk+0x5/0x7f
[ 135.545301] amdgpu_vm_cpu_update+0x92/0x110 [amdgpu]
[ 135.545301] amdgpu_vm_ptes_update+0x32c/0x930 [amdgpu]
[ 135.545301] amdgpu_vm_update_range+0x241/0x740 [amdgpu]
[ 135.545301] amdgpu_vm_clear_freed+0x116/0x250 [amdgpu]
[ 135.545301] amdgpu_gem_va_ioctl+0x43f/0x590 [amdgpu]
[ 135.545301] ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
[ 135.545301] drm_ioctl_kernel+0xcd/0x170 [drm]
[ 135.545301] drm_ioctl+0x26d/0x4b0 [drm]
[ 135.545301] ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
[ 135.545301] amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
[ 135.545301] __x64_sys_ioctl+0x97/0xd0
[ 135.545301] do_syscall_64+0x60/0xc0
[ 135.545301] ? srso_alias_return_thunk+0x5/0x7f
[ 135.564849] ? exit_to_user_mode_prepare+0x40/0x1d0
[ 135.565150] ? srso_alias_return_thunk+0x5/0x7f
[ 135.565150] ? syscall_exit_to_user_mode+0x2b/0x40
[ 135.566816] ? srso_alias_return_thunk+0x5/0x7f
[ 135.566816] ? do_syscall_64+0x6c/0xc0
[ 135.566816] ? up_read+0x3b/0x80
[ 135.566816] ? srso_alias_return_thunk+0x5/0x7f
[ 135.566816] ? do_user_addr_fault+0x18c/0x640
[ 135.566816] ? srso_alias_return_thunk+0x5/0x7f
[ 135.566816] ? fpregs_assert_state_consistent+0x26/0x50
[ 135.566816] ? srso_alias_return_thunk+0x5/0x7f
[ 135.566816] ? exit_to_user_mode_prepare+0x40/0x1d0
[ 135.566816] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 135.566816] RIP: 0033:0x7fadb1b3a51b
[ 135.566816] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 135.566816] RSP: 002b:00007ffd5f560db0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 135.566816] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fadb1b3a51b
[ 135.566816] RDX: 00007ffd5f560e50 RSI: 00000000c0286448 RDI: 0000000000000006
[ 135.566816] RBP: 00007ffd5f560e50 R08: ffff800100000000 R09: 000000000000000e
[ 135.566816] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000c0286448
[ 135.566816] R13: 0000000000000006 R14: 0000000000200000 R15: 0000000000000002
[ 135.566816] </TASK>
[ 135.566816] Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel xfrm_interface xfrm6_tunnel pppox tunnel6 ppp_generic tunnel4 slhc xfrm_user xfrm_algo ctr ccm michael_mic xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 cmac algif_hash algif_skcipher af_alg nf_tables nfnetlink bridge stp llc qrtr_mhi nfc cpufreq_ondemand cpufreq_powersave cpufreq_userspace cpufreq_conservative ipmi_devintf ipmi_msghandler bnep uinput intel_rapl_msr intel_rapl_common binfmt_misc edac_mce_amd btusb btrtl btbcm btintel kvm_amd snd_ctl_led btmtk bluetooth snd_hda_codec_realtek qrtr kvm snd_hda_codec_generic ath11k_pci snd_hda_codec_hdmi ath11k irqbypass sha3_generic uvcvideo jitterentropy_rng qmi_helpers snd_hda_intel videobuf2_vmalloc uvc videobuf2_memops ghash_clmulni_intel snd_intel_dspcfg sha512_ssse3
[ 135.566816] videobuf2_v4l2 snd_soc_dmic snd_soc_acp6x_mach snd_acp6x_pdm_dma snd_intel_sdw_acpi nls_ascii sha512_generic mac80211 videodev snd_soc_core nls_cp437 drbg snd_hda_codec vfat videobuf2_common aesni_intel ansi_cprng fat snd_compress mc libarc4 snd_hda_core crypto_simd ecdh_generic cryptd ecc think_lmi snd_hwdep rapl snd_pci_acp6x firmware_attributes_class wmi_bmof snd_pci_acp5x snd_pcm_oss cfg80211 snd_mixer_oss snd_rn_pci_acp3x ucsi_acpi sp5100_tco snd_acp_config typec_ucsi snd_pcm snd_soc_acpi roles k10temp watchdog ccp mhi snd_pci_acp3x snd_timer typec thinkpad_acpi nvram ledtrig_audio platform_profile snd soundcore rfkill ac amd_pmc acpi_tad joydev evdev serio_raw drivetemp scsi_mod scsi_common msr ecryptfs parport_pc nfsd ppdev lp parport auth_rpcgss nfs_acl loop lockd grace fuse efi_pstore dm_mod configfs sunrpc ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c
[ 135.585209] crc32c_generic raid1 raid0 multipath linear md_mod amdgpu amdxcp drm_buddy gpu_sched i2c_algo_bit xhci_pci drm_suballoc_helper drm_display_helper nvme xhci_hcd cec nvme_core rc_core drm_ttm_helper ttm t10_pi sdhci_pci usbcore cqhci crc64_rocksoft drm_kms_helper r8169 crc64 crc_t10dif sdhci realtek crct10dif_generic mdio_devres crc32_pclmul crct10dif_pclmul drm thunderbolt psmouse crc32c_intel libphy mmc_core i2c_piix4 usb_common crct10dif_common video fan battery wmi i2c_scmi button
[ 135.593360] CR2: 0000000000000000
[ 135.593360] ---[ end trace 0000000000000000 ]---
[ 135.918248] RIP: 0010:amdgpu_gmc_set_pte_pde+0x23/0x30 [amdgpu]
[ 135.943334] Code: 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff ff 00 00 48 21 c1 8d 04 d5 00 00 00 00 4c 09 c1 48 01 c6 <48> 89 0e 31 c0 e9 73 9e 94 d5 0f 1f 00 90 90 90 90 90 90 90 90 90
[ 135.945228] RSP: 0018:ffffa2bf855ef978 EFLAGS: 00010246
[ 135.945228] RAX: 0000000000000000 RBX: 0000000000200000 RCX: 0040000000000480
[ 135.945228] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff96ef5d200000
[ 135.945228] RBP: ffffa2bf855efae0 R08: 0040000000000480 R09: 0000000000200000
[ 135.945228] R10: 0040000000000480 R11: 0000000000000009 R12: 0000000000200000
[ 135.945228] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
[ 135.945228] FS: 00007fadb1a3a740(0000) GS:ffff96f67e880000(0000) knlGS:0000000000000000
[ 135.945228] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 135.945228] CR2: 0000000000000000 CR3: 000000010d60a000 CR4: 0000000000750ee0
[ 135.945228] PKRU: 55555554
[ 135.945228] note: clinfo[2172] exited with irqs disabled