'BUG: kernel NULL pointer dereference' when loading Xe KMD without firmware blobs
I know that Xe KMD requires firmwares to work but when it is not available it should fail driver load without causing kernel NULL pointer dereference
and other errors that it current have.
- Make sure it doesn't crash when loading without GuC
- Make sure it doesn't crash when loading without HuC
- Make sure it doesn't crash when loading without GSC
Designs
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- José Roberto de Souza added Improvement label
added Improvement label
- Developer
@zehortigoza, do you have a log? Most of the kernel crashes I have seen are coming from memory management code not coping with the fact that earlier parts of the driver init have correctly returned error codes.
- Author Developer
I don't have it anymore, probably all the refactors for SRIOV changed the dmesg output.
- José Roberto de Souza closed
closed
- José Roberto de Souza reopened
reopened
- Author Developer
Just reproduced this kernel crash with drm-tip from 2024-07-23, LNL had GuC firmware but did not had HuC firmware:
[ 930.052107] xe 0000:00:02.0: [drm] Found LUNARLAKE (device ID 64a0) display version 20.00 [ 930.052118] xe 0000:00:02.0: [drm:xe_pci_probe [xe]] LUNARLAKE 64a0:0004 dgfx:0 gfx:Xe2_LPG / Xe2_HPG (20.04) media:Xe2_LPM / Xe2_HPM (20.00) display:yes dma_m_s:46 tc:1 gscfi:0 cscfi:0 [ 930.052220] xe 0000:00:02.0: [drm:xe_pci_probe [xe]] Stepping = (G:B0, M:B0, D:**, B:**) [ 930.052306] xe 0000:00:02.0: [drm:xe_pci_probe [xe]] SR-IOV support: no (mode: none) [ 930.054598] xe 0000:00:02.0: [drm] Using GuC firmware from intel-ci/xe/lnl_guc_70.bin version 70.26.5 [ 930.057201] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 0] = 0x0024efd3 [ 930.057288] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 1] = 0x00000000 [ 930.057364] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 2] = 0x00000000 [ 930.057433] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 3] = 0x00000003 [ 930.057497] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 4] = 0x000004ca [ 930.057556] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 5] = 0x64a00004 [ 930.057613] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 6] = 0x00000000 [ 930.057665] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 7] = 0x00000000 [ 930.057719] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 8] = 0x00000000 [ 930.057777] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 9] = 0x00000000 [ 930.057847] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[10] = 0x00000000 [ 930.057897] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[11] = 0x00000000 [ 930.057937] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[12] = 0x00000000 [ 930.057975] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[13] = 0x00000000 [ 930.058014] xe 0000:00:02.0: [drm:xe_wopcm_init [xe]] WOPCM: 2048K [ 930.058110] xe 0000:00:02.0: [drm:xe_wopcm_init [xe]] GuC WOPCM is already locked [6144K, 832K) [ 930.059369] xe 0000:00:02.0: [drm:__xe_guc_upload.isra.0 [xe]] GT0: load still in progress, timeouts = 0, freq = 1950MHz (req 1950MHz), status = 0x00000064 [0x32/00] [ 930.059426] xe 0000:00:02.0: [drm:__xe_guc_upload.isra.0 [xe]] GT0: load still in progress, timeouts = 0, freq = 1950MHz (req 1950MHz), status = 0x00000072 [0x39/00] [ 930.064382] xe 0000:00:02.0: [drm:__xe_guc_upload.isra.0 [xe]] GT0: load still in progress, timeouts = 0, freq = 1950MHz (req 1950MHz), status = 0x80000534 [0x1A/05] [ 930.067022] xe 0000:00:02.0: [drm:__xe_guc_upload.isra.0 [xe]] GT0: init took 7ms, freq = 1950MHz (req = 1950MHz), before = 1950MHz, status = 0x8002F034, timeouts = 0 [ 930.067597] xe 0000:00:02.0: [drm:xe_guc_ct_enable [xe]] GT0: GuC CT communication channel enabled [ 930.067647] xe 0000:00:02.0: [drm:xe_guc_ct_enable [xe]] GT0: GuC CT safe-mode enabled [ 930.067697] xe 0000:00:02.0: [drm:xe_gt_topology_init [xe]] GT topology dss mask (geometry): 00000000,00000000,000000ff [ 930.067745] xe 0000:00:02.0: [drm:xe_gt_topology_init [xe]] GT topology dss mask (compute): 00000000,00000000,000000ff [ 930.067804] xe 0000:00:02.0: [drm:xe_gt_topology_init [xe]] GT topology EU mask per DSS: 0000ffff [ 930.067849] xe 0000:00:02.0: [drm:xe_gt_topology_init [xe]] GT topology L3 bank mask: 00000000,000000ff [ 930.069477] xe 0000:00:02.0: [drm] Using GuC firmware from intel-ci/xe/lnl_guc_70.bin version 70.26.5 [ 930.071759] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 0] = 0x00b34fd3 [ 930.071810] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 1] = 0x00000000 [ 930.071847] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 2] = 0x00000000 [ 930.071883] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 3] = 0x00000003 [ 930.071918] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 4] = 0x00001696 [ 930.071954] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 5] = 0x64a00004 [ 930.071989] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 6] = 0x00000000 [ 930.072023] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 7] = 0x00000000 [ 930.072057] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 8] = 0x00000000 [ 930.072091] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 9] = 0x00000000 [ 930.072126] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[10] = 0x00000000 [ 930.072162] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[11] = 0x00000000 [ 930.072196] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[12] = 0x00000000 [ 930.072231] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[13] = 0x00000000 [ 930.072424] xe 0000:00:02.0: Direct firmware load for intel-ci/xe/lnl_huc_gsc_9.4.6.bin failed with error -2 [ 930.072439] xe 0000:00:02.0: [drm] HuC firmware intel-ci/xe/lnl_huc_gsc_9.4.6.bin: fetch failed with error -2 [ 930.072441] xe 0000:00:02.0: [drm] HuC firmware(s) can be downloaded from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git [ 930.072444] xe 0000:00:02.0: [drm] *ERROR* GT1: HuC: initialization failed: -ENOENT [ 930.080197] xe 0000:00:02.0: [drm] *ERROR* GT1: Failed to initialize uC (-ENOENT) [ 930.087777] xe 0000:00:02.0: probe with driver xe failed with error -2 [ 930.094467] xe 0000:00:02.0: [drm:xe_guc_ct_disable [xe]] GT0: GuC CT safe-mode disabled [ 930.097148] BUG: unable to handle page fault for address: 000000000000a188 [ 930.104102] #PF: supervisor write access in kernel mode [ 930.109388] #PF: error_code(0x0002) - not-present page [ 930.114592] PGD 0 P4D 0 [ 930.117168] Oops: Oops: 0002 [#1] PREEMPT SMP [ 930.121586] CPU: 1 PID: 6036 Comm: modprobe Not tainted 6.10.0-rc7-zeh-xe+ #1382 [ 930.129070] Hardware name: Intel Corporation Lunar Lake Client Platform/LNL-M LP5 RVP1, BIOS LNLMFWI1.R00.3220.D92.2407090817 07/09/2024 [ 930.141444] RIP: 0010:xe_mmio_write32+0x6d/0x2a0 [xe] [ 930.146634] Code: 05 e8 e0 5a e2 0f 82 cb 00 00 00 41 89 ee 41 c1 ee 18 f7 c5 00 00 00 40 0f 84 83 00 00 00 45 84 f6 78 78 49 8b 47 28 4c 01 e0 <44> 89 28 48 83 c4 58 5b 5d 41 5c 41 5d 41 5e 41 5f c3 65 8b 05 b6 [ 930.165555] RSP: 0018:ffffc9000160f880 EFLAGS: 00010006 [ 930.170834] RAX: 000000000000a188 RBX: ffff888244da0028 RCX: ffff88811ac65d90 [ 930.178046] RDX: 0000000000010001 RSI: ffffffff823f2ebd RDI: ffffffff823f6e4c [ 930.185259] RBP: 000000000000a188 R08: 00000000000005ec R09: 0000000000000001 [ 930.192471] R10: ffff888244d88000 R11: ffff888244d88000 R12: 000000000000a188 [ 930.199677] R13: 0000000000010001 R14: 0000000000000000 R15: ffff888244d8a308 [ 930.206899] FS: 00007fcec0d0cc40(0000) GS:ffff88885e480000(0000) knlGS:0000000000000000 [ 930.215065] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 930.220883] CR2: 000000000000a188 CR3: 00000002a26c0002 CR4: 0000000000770ef0 [ 930.228108] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 930.235321] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 [ 930.242536] PKRU: 55555554 [ 930.245288] Call Trace: [ 930.247776] <TASK> [ 930.249921] ? __die_body.cold+0x19/0x21 [ 930.253896] ? page_fault_oops+0x9d/0x230 [ 930.257964] ? do_user_addr_fault+0x5f/0x700 [ 930.262290] ? stack_trace_save+0x45/0x70 [ 930.266359] ? exc_page_fault+0x68/0x210 [ 930.270348] ? asm_exc_page_fault+0x22/0x30 [ 930.274578] ? xe_mmio_write32+0x6d/0x2a0 [xe] [ 930.279151] xe_force_wake_get+0xc3/0x2c0 [xe] [ 930.283726] xe_gt_tlb_invalidation_ggtt+0x91/0x2b0 [xe] [ 930.289161] xe_ggtt_invalidate+0x19/0x40 [xe] [ 930.293719] xe_ggtt_remove_node+0xda/0xf0 [xe] [ 930.298357] xe_ttm_bo_destroy+0x117/0x210 [xe] [ 930.303011] drm_managed_release+0x99/0x150 [ 930.307250] devm_drm_dev_init_release+0x45/0x60 [ 930.311925] release_nodes+0x2b/0xf0 [ 930.315556] devres_release_all+0x87/0xc0 [ 930.319623] device_unbind_cleanup+0x9/0x70 [ 930.323875] really_probe+0x20d/0x320 [ 930.327598] ? pm_runtime_barrier+0x4b/0x80 [ 930.331839] ? __device_attach_driver+0xf0/0xf0 [ 930.336433] __driver_probe_device+0x73/0x110 [ 930.340846] driver_probe_device+0x1a/0x90 [ 930.344996] __driver_attach+0xaa/0x1b0 [ 930.348886] bus_for_each_dev+0x75/0xc0 [ 930.352779] bus_add_driver+0x108/0x1f0 [ 930.356666] driver_register+0x69/0xb0 [ 930.360478] xe_init+0x11/0x40 [xe] [ 930.364070] ? xe_hw_fence_module_init+0x30/0x30 [xe] [ 930.369239] do_one_initcall+0x56/0x280 [ 930.373133] ? kmalloc_trace_noprof+0x24f/0x300 [ 930.377721] do_init_module+0x5b/0x1f0 [ 930.381523] init_module_from_file+0x81/0xc0 [ 930.385847] idempotent_init_module+0x10c/0x2a0 [ 930.390436] __x64_sys_finit_module+0x55/0xb0 [ 930.394856] do_syscall_64+0x64/0x130 [ 930.398575] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 930.403692] RIP: 0033:0x7fcec051e88d [ 930.407318] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48 [ 930.426245] RSP: 002b:00007ffcd14beef8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 930.433894] RAX: ffffffffffffffda RBX: 000055d8222bca30 RCX: 00007fcec051e88d [ 930.441099] RDX: 0000000000000000 RSI: 000055d8007f1cd2 RDI: 000000000000000d [ 930.448318] RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000002 [ 930.455532] R10: 000000000000000d R11: 0000000000000246 R12: 000055d8007f1cd2 [ 930.462751] R13: 000055d8222bcb60 R14: 000055d8222bc590 R15: 000055d8222c64b0 [ 930.469975] </TASK> [ 930.472200] Modules linked in: xe(+) drm_ttm_helper gpu_sched drm_suballoc_helper drm_gpuvm drm_exec i2c_algo_bit drm_buddy video drm_display_helper ttm mei_gsc_proxy wmi_bmof x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core kvm_intel mei_me snd_pcm mei e1000e ptp pps_core intel_pmc_core intel_vsec pmt_telemetry wmi pmt_class fuse [last unloaded: ttm] [ 930.512215] CR2: 000000000000a188 [ 930.515580] ---[ end trace 0000000000000000 ]--- [ 930.538579] RIP: 0010:xe_mmio_write32+0x6d/0x2a0 [xe] [ 930.543770] Code: 05 e8 e0 5a e2 0f 82 cb 00 00 00 41 89 ee 41 c1 ee 18 f7 c5 00 00 00 40 0f 84 83 00 00 00 45 84 f6 78 78 49 8b 47 28 4c 01 e0 <44> 89 28 48 83 c4 58 5b 5d 41 5c 41 5d 41 5e 41 5f c3 65 8b 05 b6 [ 930.562700] RSP: 0018:ffffc9000160f880 EFLAGS: 00010006 [ 930.567986] RAX: 000000000000a188 RBX: ffff888244da0028 RCX: ffff88811ac65d90 [ 930.575199] RDX: 0000000000010001 RSI: ffffffff823f2ebd RDI: ffffffff823f6e4c [ 930.582420] RBP: 000000000000a188 R08: 00000000000005ec R09: 0000000000000001 [ 930.589634] R10: ffff888244d88000 R11: ffff888244d88000 R12: 000000000000a188 [ 930.596845] R13: 0000000000010001 R14: 0000000000000000 R15: ffff888244d8a308 [ 930.604065] FS: 00007fcec0d0cc40(0000) GS:ffff88885e480000(0000) knlGS:0000000000000000 [ 930.612244] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 930.618053] CR2: 000000000000a188 CR3: 00000002a26c0002 CR4: 0000000000770ef0 [ 930.625266] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 930.632480] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 [ 930.639692] PKRU: 55555554 [ 930.642452] note: modprobe[6036] exited with irqs disabled [ 930.648031] note: modprobe[6036] exited with preempt_count 1
- Author Developer
Then adding HUC but lacking GSC:
[ 202.184470] xe 0000:00:02.0: [drm] Found LUNARLAKE (device ID 64a0) display version 20.00 [ 202.184484] xe 0000:00:02.0: [drm:xe_pci_probe [xe]] LUNARLAKE 64a0:0004 dgfx:0 gfx:Xe2_LPG / Xe2_HPG (20.04) media:Xe2_LPM / Xe2_HPM (20.00) display:yes dma_m_s:46 tc:1 gscfi:0 cscfi:0 [ 202.184599] xe 0000:00:02.0: [drm:xe_pci_probe [xe]] Stepping = (G:B0, M:B0, D:**, B:**) [ 202.184671] xe 0000:00:02.0: [drm:xe_pci_probe [xe]] SR-IOV support: no (mode: none) [ 202.188611] xe 0000:00:02.0: [drm] Using GuC firmware from intel-ci/xe/lnl_guc_70.bin version 70.26.5 [ 202.191465] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 0] = 0x0024efd3 [ 202.191568] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 1] = 0x00000000 [ 202.191635] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 2] = 0x00000000 [ 202.191687] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 3] = 0x00000003 [ 202.191747] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 4] = 0x000004ca [ 202.191794] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 5] = 0x64a00004 [ 202.191840] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 6] = 0x00000000 [ 202.191886] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 7] = 0x00000000 [ 202.191931] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 8] = 0x00000000 [ 202.191977] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[ 9] = 0x00000000 [ 202.192022] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[10] = 0x00000000 [ 202.192068] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[11] = 0x00000000 [ 202.192113] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[12] = 0x00000000 [ 202.192159] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT0: GuC param[13] = 0x00000000 [ 202.192206] xe 0000:00:02.0: [drm:xe_wopcm_init [xe]] WOPCM: 2048K [ 202.192312] xe 0000:00:02.0: [drm:xe_wopcm_init [xe]] GuC WOPCM is already locked [6144K, 832K) [ 202.193775] xe 0000:00:02.0: [drm:__xe_guc_upload.isra.0 [xe]] GT0: load still in progress, timeouts = 0, freq = 1950MHz (req 1950MHz), status = 0x00000072 [0x39/00] [ 202.199021] xe 0000:00:02.0: [drm:__xe_guc_upload.isra.0 [xe]] GT0: load still in progress, timeouts = 0, freq = 1950MHz (req 1950MHz), status = 0x80000534 [0x1A/05] [ 202.201675] xe 0000:00:02.0: [drm:__xe_guc_upload.isra.0 [xe]] GT0: init took 7ms, freq = 1950MHz (req = 1950MHz), before = 1950MHz, status = 0x8002F034, timeouts = 0 [ 202.202413] xe 0000:00:02.0: [drm:xe_guc_ct_enable [xe]] GT0: GuC CT communication channel enabled [ 202.202492] xe 0000:00:02.0: [drm:xe_guc_ct_enable [xe]] GT0: GuC CT safe-mode enabled [ 202.202556] xe 0000:00:02.0: [drm:xe_gt_topology_init [xe]] GT topology dss mask (geometry): 00000000,00000000,000000ff [ 202.202613] xe 0000:00:02.0: [drm:xe_gt_topology_init [xe]] GT topology dss mask (compute): 00000000,00000000,000000ff [ 202.202662] xe 0000:00:02.0: [drm:xe_gt_topology_init [xe]] GT topology EU mask per DSS: 0000ffff [ 202.202709] xe 0000:00:02.0: [drm:xe_gt_topology_init [xe]] GT topology L3 bank mask: 00000000,000000ff [ 202.204668] xe 0000:00:02.0: [drm] Using GuC firmware from intel-ci/xe/lnl_guc_70.bin version 70.26.5 [ 202.207174] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 0] = 0x00b34fd3 [ 202.207243] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 1] = 0x00000000 [ 202.207295] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 2] = 0x00000000 [ 202.207343] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 3] = 0x00000003 [ 202.207389] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 4] = 0x00001696 [ 202.207436] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 5] = 0x64a00004 [ 202.207482] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 6] = 0x00000000 [ 202.207528] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 7] = 0x00000000 [ 202.207575] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 8] = 0x00000000 [ 202.207620] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[ 9] = 0x00000000 [ 202.207666] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[10] = 0x00000000 [ 202.207713] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[11] = 0x00000000 [ 202.207769] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[12] = 0x00000000 [ 202.207816] xe 0000:00:02.0: [drm:guc_print_params [xe]] GT1: GuC param[13] = 0x00000000 [ 202.212750] xe 0000:00:02.0: [drm] Using HuC firmware from intel-ci/xe/lnl_huc_gsc_9.4.6.bin version 9.4.6 [ 202.213519] xe 0000:00:02.0: Direct firmware load for intel-ci/xe/lnl_gsc_1.bin failed with error -2 [ 202.213544] xe 0000:00:02.0: [drm] GSC firmware intel-ci/xe/lnl_gsc_1.bin: fetch failed with error -2 [ 202.213547] xe 0000:00:02.0: [drm] GSC firmware(s) can be downloaded from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git [ 202.213550] xe 0000:00:02.0: [drm] *ERROR* GT1: GSC init failed with -2 [ 202.220275] xe 0000:00:02.0: [drm] *ERROR* GT1: Failed to initialize uC (-ENOENT) [ 202.227854] xe 0000:00:02.0: probe with driver xe failed with error -2 [ 202.234544] xe 0000:00:02.0: [drm:xe_guc_ct_disable [xe]] GT0: GuC CT safe-mode disabled [ 202.236907] BUG: unable to handle page fault for address: 0000000000380d8c [ 202.243850] #PF: supervisor write access in kernel mode [ 202.249136] #PF: error_code(0x0002) - not-present page [ 202.254328] PGD 0 P4D 0 [ 202.256894] Oops: Oops: 0002 [#1] PREEMPT SMP [ 202.261295] CPU: 4 PID: 1107 Comm: modprobe Not tainted 6.10.0-rc7-zeh-xe+ #1382 [ 202.268754] Hardware name: Intel Corporation Lunar Lake Client Platform/LNL-M LP5 RVP1, BIOS LNLMFWI1.R00.3220.D92.2407090817 07/09/2024 [ 202.281108] RIP: 0010:xe_mmio_write32+0x6d/0x2a0 [xe] [ 202.286303] Code: 05 e8 00 5c e2 0f 82 cb 00 00 00 41 89 ee 41 c1 ee 18 f7 c5 00 00 00 40 0f 84 83 00 00 00 45 84 f6 78 78 49 8b 47 28 4c 01 e0 <44> 89 28 48 83 c4 58 5b 5d 41 5c 41 5d 41 5e 41 5f c3 65 8b 05 b6 [ 202.305212] RSP: 0018:ffffc900018e7900 EFLAGS: 00010202 [ 202.310488] RAX: 0000000000380d8c RBX: ffff88811e588028 RCX: 0000000000000000 [ 202.317687] RDX: 0000000000000000 RSI: ffffffff823f2ebd RDI: ffffffff823f6e4c [ 202.324885] RBP: 0000000000000d8c R08: 00000000000005f3 R09: 00000000000110d6 [ 202.332086] R10: 0000000000000001 R11: ffff88811e660000 R12: 0000000000380d8c [ 202.339290] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88811e662308 [ 202.346496] FS: 00007f9ac40d2c40(0000) GS:ffff88885e600000(0000) knlGS:0000000000000000 [ 202.354659] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 202.360463] CR2: 0000000000380d8c CR3: 000000011890a006 CR4: 0000000000770ef0 [ 202.367665] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 202.374870] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 [ 202.382077] PKRU: 55555554 [ 202.384829] Call Trace: [ 202.387311] <TASK> [ 202.389439] ? __die_body.cold+0x19/0x21 [ 202.393408] ? page_fault_oops+0x9d/0x230 [ 202.397461] ? do_user_addr_fault+0x5f/0x700 [ 202.401777] ? exc_page_fault+0x68/0x210 [ 202.405742] ? asm_exc_page_fault+0x22/0x30 [ 202.409969] ? xe_mmio_write32+0x6d/0x2a0 [xe] [ 202.414526] xe_ggtt_set_pte_and_flush+0x90/0xa0 [xe] [ 202.419701] xe_ggtt_clear+0x74/0x210 [xe] [ 202.423898] ? xe_ggtt_remove_node+0x92/0xf0 [xe] [ 202.428696] ? lockdep_hardirqs_on+0xba/0x130 [ 202.433098] ? _raw_spin_unlock_irqrestore+0x37/0x70 [ 202.438113] xe_ggtt_remove_node+0xa2/0xf0 [xe] [ 202.442734] xe_ttm_bo_destroy+0x117/0x210 [xe] [ 202.447373] drm_managed_release+0x99/0x150 [ 202.451604] devm_drm_dev_init_release+0x45/0x60 [ 202.456268] release_nodes+0x2b/0xf0 [ 202.459886] devres_release_all+0x87/0xc0 [ 202.463941] device_unbind_cleanup+0x9/0x70 [ 202.468166] really_probe+0x20d/0x320 [ 202.471870] ? pm_runtime_barrier+0x4b/0x80 [ 202.476102] ? __device_attach_driver+0xf0/0xf0 [ 202.480679] __driver_probe_device+0x73/0x110 [ 202.485087] driver_probe_device+0x1a/0x90 [ 202.489232] __driver_attach+0xaa/0x1b0 [ 202.493116] bus_for_each_dev+0x75/0xc0 [ 202.496993] bus_add_driver+0x108/0x1f0 [ 202.500877] driver_register+0x69/0xb0 [ 202.504672] xe_init+0x11/0x40 [xe] [ 202.508253] ? xe_hw_fence_module_init+0x30/0x30 [xe] [ 202.513399] do_one_initcall+0x56/0x280 [ 202.517279] ? kmalloc_trace_noprof+0x24f/0x300 [ 202.521857] do_init_module+0x5b/0x1f0 [ 202.525653] init_module_from_file+0x81/0xc0 [ 202.529972] idempotent_init_module+0x10c/0x2a0 [ 202.534552] __x64_sys_finit_module+0x55/0xb0 [ 202.538953] do_syscall_64+0x64/0x130 [ 202.542660] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 202.547759] RIP: 0033:0x7f9ac391e88d [ 202.551374] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48 [ 202.570283] RSP: 002b:00007fffa5556f48 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 202.577926] RAX: ffffffffffffffda RBX: 0000562ea58d7a30 RCX: 00007f9ac391e88d [ 202.585128] RDX: 0000000000000000 RSI: 0000562e7ae9ccd2 RDI: 000000000000000c [ 202.592332] RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000002 [ 202.599534] R10: 000000000000000c R11: 0000000000000246 R12: 0000562e7ae9ccd2 [ 202.606741] R13: 0000562ea58d7b60 R14: 0000562ea58d7590 R15: 0000562ea58e1470 [ 202.613947] </TASK> [ 202.616163] Modules linked in: xe(+) drm_ttm_helper gpu_sched drm_suballoc_helper drm_gpuvm drm_exec i2c_algo_bit drm_buddy drm_display_helper ttm x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul mei_gsc_proxy ghash_clmulni_intel wmi_bmof snd_hda_intel snd_intel_dspcfg kvm_intel snd_hda_codec snd_hwdep e1000e snd_hda_core mei_me ptp snd_pcm mei pps_core video intel_pmc_core intel_vsec pmt_telemetry wmi pmt_class fuse [ 202.654332] CR2: 0000000000380d8c [ 202.657683] ---[ end trace 0000000000000000 ]--- [ 202.706338] RIP: 0010:xe_mmio_write32+0x6d/0x2a0 [xe] [ 202.711539] Code: 05 e8 00 5c e2 0f 82 cb 00 00 00 41 89 ee 41 c1 ee 18 f7 c5 00 00 00 40 0f 84 83 00 00 00 45 84 f6 78 78 49 8b 47 28 4c 01 e0 <44> 89 28 48 83 c4 58 5b 5d 41 5c 41 5d 41 5e 41 5f c3 65 8b 05 b6 [ 202.730448] RSP: 0018:ffffc900018e7900 EFLAGS: 00010202 [ 202.735727] RAX: 0000000000380d8c RBX: ffff88811e588028 RCX: 0000000000000000 [ 202.742931] RDX: 0000000000000000 RSI: ffffffff823f2ebd RDI: ffffffff823f6e4c [ 202.750128] RBP: 0000000000000d8c R08: 00000000000005f3 R09: 00000000000110d6 [ 202.757327] R10: 0000000000000001 R11: ffff88811e660000 R12: 0000000000380d8c [ 202.764534] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88811e662308 [ 202.771731] FS: 00007f9ac40d2c40(0000) GS:ffff88885e600000(0000) knlGS:0000000000000000 [ 202.779893] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 202.785696] CR2: 0000000000380d8c CR3: 000000011890a006 CR4: 0000000000770ef0 [ 202.792898] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 202.800097] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 [ 202.807301] PKRU: 55555554 [ 202.810042] note: modprobe[1107] exited with irqs disabled [ 202.816168] modprobe (1107) used greatest stack depth: 10304 bytes left
Collapse replies - Developer
It looks like the GGTT cleanup code trying to access stuff that is not initialized during the early abort. Considering that the stack trace has xe_force_wake_get and that 0xa188 is the GT forcewake register, it looks like the mmio.regs pointer is NULL.
The second log has a different path, but it's still failing on a register access; 380d8c is GMD_ID on the media tile, which is accessed from xe_ggtt_set_pte_and_flush() -> ggtt_update_access_counter()
No idea how the regs pointer ends up being NULL before the GGTT cleanup is completed. I wonder if it is because the mmio cleanup is done via a devm callback while the GGTT part uses drmm, so different lists.
- Daniele Ceraolo Spurio mentioned in issue #2410 (closed)
mentioned in issue #2410 (closed)
- Daniele Ceraolo Spurio mentioned in issue #2439 (closed)
mentioned in issue #2439 (closed)
- Owner
@zehortigoza this originally was about GuC and some people are confused now with the bug reports. I just tried on a DG2 without GuC and it doesn't explode, just fails the probe.
Could you update the issue description to something like "Bug when booting without firmware blob" and mention that GuC was fixed (if it is for you) and that we have HuC and GSC cleanup missing? Another alternative is to have separate issues, but IMO we can keep it simple and track them here.
Collapse replies - Author Developer
renamed.
2 weeks ago I have reopened and added the logs of crashes when HuC or GSC is not available.
1 - Owner
thanks... I also added a check list so we can track in the description (and avoid people taking wrong conclusion when not seeing all the comments).
IMO this should be considered a BUG, too. So I added the label.
- Developer
Note that this is not really a firmware specific bug and never has been. It has always been about the rest of the Xe driver, memory management in particular, not coping with an aborted driver load. It is simply that missing firmware files are the easiest and most obvious way to cause an aborted load.
- José Roberto de Souza changed title from 'BUG: kernel NULL pointer dereference' when loading Xe KMD without GuC firmware to 'BUG: kernel NULL pointer dereference' when loading Xe KMD without firmware blobs
changed title from 'BUG: kernel NULL pointer dereference' when loading Xe KMD without GuC firmware to 'BUG: kernel NULL pointer dereference' when loading Xe KMD without firmware blobs
- José Roberto de Souza changed the description
changed the description
- Lucas De Marchi changed the description
changed the description
- Lucas De Marchi marked the checklist item Make sure it doesn't crash when loading without GuC as completed
marked the checklist item Make sure it doesn't crash when loading without GuC as completed
- Lucas De Marchi added BUG label
added BUG label
- Daniele Ceraolo Spurio closed with commit 8d3a2d3d
closed with commit 8d3a2d3d
- Lucas De Marchi marked the checklist item Make sure it doesn't crash when loading without HuC as completed
marked the checklist item Make sure it doesn't crash when loading without HuC as completed
- Lucas De Marchi marked the checklist item Make sure it doesn't crash when loading without GSC as completed
marked the checklist item Make sure it doesn't crash when loading without GSC as completed
- Owner
@zehortigoza Just tested on LNL without guc, without huc and then without gsc and confirmed it doesn't explode anymore. Commit is 8d3a2d3d ("drm/xe: use devm instead of drmm for managed bo"). There are more fixes on the pipeline for other failures during probe (https://patchwork.freedesktop.org/series/137114/), but this should be sufficient for now.
Collapse replies - Author Developer
thank you
- Daniele Ceraolo Spurio mentioned in commit drm/nouveau@7a6e0b6f
mentioned in commit drm/nouveau@7a6e0b6f