Xe kernel NULL pointer dereference at virtual address 0000000000000000
While debugging the xe driver on a Raspberry Pi 5 from the 6.12-branch using a PCIe-HAT I get the following NULL pointer when having Intel Arc A770 or Intel Arc A380 installed:
[ 7.024176] xe 0000:03:00.0: enabling device (0000 -> 0002)
[ 7.028448] xe 0000:03:00.0: [drm] Found DG2/G10 (device ID 56a0) display version 13.00 stepping C0
[ 7.039286] xe 0000:03:00.0: [drm] Using GuC firmware from i915/dg2_guc_70.bin version 70.29.2
[ 7.047758] xe 0000:03:00.0: [drm] *ERROR* GT0: load failed: status = 0x400000A0, time = 4ms, freq = 2400MHz (req 2400MHz), done = -1
[ 7.047765] xe 0000:03:00.0: [drm] *ERROR* GT0: load failed: status: Reset = 0, BootROM = 0x50, UKernel = 0x00, MIA = 0x00, Auth = 0x01
[ 7.047768] xe 0000:03:00.0: [drm] *ERROR* GT0: firmware signature verification failed
[ 7.047773] xe 0000:03:00.0: [drm] *ERROR* CRITICAL: Xe has declared device 0000:03:00.0 as wedged.
IOCTLs and executions are blocked. Only a rebind may clear the failure
Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new
[ 7.047786] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 7.056611] Mem abort info:
[ 7.059407] ESR = 0x0000000096000004
[ 7.063163] EC = 0x25: DABT (current EL), IL = 32 bits
[ 7.068246] [drm] Initialized vc4 0.0.0 for axi:gpu on minor 1
[ 7.068491] SET = 0, FnV = 0
[ 7.071549] EA = 0, S1PTW = 0
[ 7.074695] FSC = 0x04: level 0 translation fault
[ 7.079586] Data abort info:
[ 7.082469] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 7.087970] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 7.093035] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 7.098363] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000105c17000
[ 7.104826] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[ 7.111640] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 7.117929] Modules linked in: spidev xe(+) vc4(+) brcmfmac hci_uart brcmutil snd_soc_hdmi_codec drm_dma_helper btbcm joydev cfg80211 bluetooth snd_soc_core i2c_algo_bit cec drm_buddy rpivid_hevc(C) drm_suballoc_helper hid_logitech_dj aes_ce_blk drm_gpuvm aes_ce_cipher pisp_be drm_exec v4l2_mem2mem ghash_ce drm_display_helper gf128mul sha2_ce ecdh_generic snd_compress videobuf2_dma_contig sha256_arm64 ecc v3d snd_pcm_dmaengine sha1_ce videobuf2_memops videobuf2_v4l2 libaes rfkill drm_ttm_helper snd_pcm videodev sha1_generic ttm snd_timer snd gpu_sched videobuf2_common raspberrypi_hwmon mc i2c_brcmstb drm_shmem_helper gpio_keys spi_bcm2835 drm_kms_helper pwm_fan rp1_pio raspberrypi_gpiomem rp1_mailbox nvmem_rmem rp1 rp1_adc uio_pdrv_genirq uio drm fuse drm_panel_orientation_quirks backlight dm_mod ip_tables x_tables ipv6
[ 7.190971] CPU: 2 UID: 0 PID: 331 Comm: (udev-worker) Tainted: G C 6.12.1-v8-16k+ #1
[ 7.190975] Tainted: [C]=CRAP
[ 7.190976] Hardware name: Raspberry Pi 5 Model B Rev 1.0 (DT)
[ 7.190977] pstate: 804000c9 (Nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 7.190980] pc : xe_gt_tlb_invalidation_reset+0x68/0xf0 [xe]
[ 7.191077] fbcon_init: detected unhandled fb_set_par error, error code -16
[ 7.191107] lr : xe_gt_tlb_invalidation_reset+0x4c/0xf0 [xe]
[ 7.191159] Console: switching to colour frame buffer device 240x67
[ 7.191205] sp : ffff8000809b35b0
[ 7.191206] x29: ffff8000809b35b0 x28: 0000000000000000 x27: 431bde82d7b634db
[ 7.191209] x26: 00000000001b0001 x25: ffff0001031cc328 x24: ffff0001031cce08
[ 7.191211] x23: 00000000ffffffff x22: ffff0001031cd770 x21: ffff0001031cc318
[ 7.191214] x20: ffff0001031cc080 x19: ffff0001031cc080 x18: ffffffffffffffff
[ 7.191216] x17: 7461636966697265 x16: ffffa07c4e4c4260 x15: 2f3a737074746820
[ 7.191218] x14: 74612074726f7065 x13: 2f3a737074746820 x12: 74612074726f7065
[ 7.191220] x11: 77656e2f73657573 x10: 0000000000000000 x9 : ffffa07c4e4c4168
[ 7.191223] x8 : 0000000000000000 x7 : ffffffffffffffff x6 : ffff8000809b3550
[ 7.191225] x5 : 0000000000000010 x4 : ffff8000809b3550 x3 : 0000000000000000
[ 7.191227] x2 : 00000000000fffff x1 : 0000000000000000 x0 : 00000000ffffffff
[ 7.191230] Call trace:
[ 7.191231] xe_gt_tlb_invalidation_reset+0x68/0xf0 [xe]
[ 7.191327] xe_gt_declare_wedged+0x2c/0x48 [xe]
[ 7.191422] xe_device_declare_wedged+0xec/0x160 [xe]
[ 7.191520] __xe_guc_upload+0x48c/0x650 [xe]
[ 7.191612] xe_guc_min_load_for_hwconfig+0x4c/0xd8 [xe]
[ 7.191720] xe_uc_init_hwconfig+0x34/0x48 [xe]
[ 7.191820] xe_gt_init_hwconfig+0x7c/0xd8 [xe]
[ 7.191915] xe_device_probe+0x204/0x570 [xe]
[ 7.192009] xe_pci_probe+0x69c/0xa20 [xe]
[ 7.192102] local_pci_probe+0x48/0xb8
[ 7.192109] pci_device_probe+0xc0/0x1c0
[ 7.192113] really_probe+0xc4/0x2a8
[ 7.192116] __driver_probe_device+0x80/0x140
[ 7.192117] driver_probe_device+0x44/0x170
[ 7.192119] __driver_attach+0x9c/0x1b0
[ 7.192121] bus_for_each_dev+0x80/0xe8
[ 7.192126] driver_attach+0x2c/0x40
[ 7.192128] bus_add_driver+0xec/0x218
[ 7.192130] driver_register+0x68/0x138
[ 7.192132] __pci_register_driver+0x54/0x68
[ 7.192135] xe_register_pci_driver+0x30/0x48 [xe]
[ 7.192229] xe_init+0x3c/0xb8 [xe]
[ 7.201506] vc4-drm axi:gpu: [drm] fb0: vc4drmfb frame buffer device
[ 7.204477] do_one_initcall+0x60/0x2a0
[ 7.244144] Bluetooth: hci0: BCM: chip id 107
[ 7.245292] do_init_module+0x68/0x250
[ 7.245496] Bluetooth: hci0: BCM: features 0x2f
[ 7.252467] load_module+0x1fe0/0x20c8
[ 7.252469] __do_sys_init_module+0x180/0x1f0
[ 7.252471] __arm64_sys_init_module+0x24/0x38
[ 7.253517] Bluetooth: hci0: BCM4345C0
[ 7.259644] invoke_syscall+0x50/0x120
[ 7.259655] Bluetooth: hci0: BCM4345C0 (003.001.025) build 0000
[ 7.259662] el0_svc_common.constprop.0+0xc8/0xf0
[ 7.469972] do_el0_svc+0x24/0x38
[ 7.469975] el0_svc+0x30/0xd0
[ 7.476921] el0t_64_sync_handler+0x100/0x130
[ 7.476923] el0t_64_sync+0x190/0x198
[ 7.485513] Code: aa0303e1 71000400 1a821000 b9029680 (f85b8422)
[ 7.485515] ---[ end trace 0000000000000000 ]---
[ 7.485516] note: (udev-worker)[331] exited with irqs disabled
[ 7.485549] note: (udev-worker)[331] exited with preempt_count 1
After some digging, I came up with the following workaround:
--- drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c.orig 2024-12-18 12:29:45.344716928 +0100
+++ drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c 2024-12-18 12:30:05.320006026 +0100
@@ -151,9 +151,11 @@
WRITE_ONCE(gt->tlb_invalidation.seqno_recv, pending_seqno);
- list_for_each_entry_safe(fence, next,
- >->tlb_invalidation.pending_fences, link)
- invalidation_fence_signal(gt_to_xe(gt), fence);
+ if(pending_seqno != -1){
+ list_for_each_entry_safe(fence, next,
+ >->tlb_invalidation.pending_fences, link)
+ invalidation_fence_signal(gt_to_xe(gt), fence);
+ }
spin_unlock_irq(>->tlb_invalidation.pending_lock);
mutex_unlock(>->uc.guc.ct.lock);
The effect of the NULL pointer results in the onboard GPU not being able to show any output. With my workaround applied the driver still marks the driver as wedged (probably due to #3594), but removes the NULL pointer. Looking at the code (https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/fba0f039affdd0c8767f24e41d5dbef49addea78/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c#L158), I assume that my workaround works, because pending_seqno
is 0
and hitting the else
-branch, having no entry in the pending_fences
array.
Probably not a complete fix, because pending_fences is supposed to have any entries, but it removes the NULL pointer inside the kernel when calling the list_for_each_entry_safe function.