refcount_t: underflow; use-after-free. from xe_sync_entry_cleanup+0xc9/0xf0
On LNL, while playing Marvel's Spider-Man Remastered on Steam, I occasionally get the following:
[ 485.674115] ------------[ cut here ]------------
[ 485.674120] refcount_t: underflow; use-after-free.
[ 485.674131] WARNING: CPU: 2 PID: 4431 at lib/refcount.c:28 refcount_warn_saturate+0xbe/0x110
[ 485.674139] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device rfkill qrtr binfmt_misc nls_ascii nls_cp437 vfat fat intel_uncore_frequency intel_uncore_frequency_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm snd_sof_pci_intel_lnl snd_sof_pci_intel_mtl snd_sof_intel_hda_generic snd_sof_pci crc32_pclmul snd_sof_xtensa_dsp crc32c_intel snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda snd_sof ghash_clmulni_intel sha512_ssse3 snd_sof_utils sha256_ssse3 snd_soc_acpi_intel_match sha1_ssse3 snd_soc_acpi snd_soc_core snd_compress snd_sof_intel_hda_mlink snd_hda_ext_core snd_hda_intel intel_rapl_msr snd_intel_dspcfg aesni_intel snd_hda_codec snd_hwdep crypto_simd wmi_bmof snd_hda_core cryptd snd_pcm rapl snd_timer ucsi_acpi pcspkr typec_ucsi snd processor_thermal_device_pci mei_me roles processor_thermal_device soundcore processor_thermal_wt_hint mei thunderbolt processor_thermal_rfim typec processor_thermal_rapl intel_rapl_common processor_thermal_wt_req
[ 485.674188] processor_thermal_power_floor processor_thermal_mbox int3403_thermal button battery int340x_thermal_zone intel_pmc_core intel_skl_int3472_tps68470 intel_hid int3400_thermal sparse_keymap acpi_thermal_rel intel_vsec pmt_telemetry intel_skl_int3472_discrete pmt_class joydev acpi_tad acpi_pad evdev msr parport_pc ppdev lp parport efi_pstore configfs nfnetlink efivarfs ip_tables x_tables autofs4 ax88796b asix phylink selftests usbnet mii libphy hid_generic usbhid hid xe nvme drm_ttm_helper ttm nvme_core i2c_algo_bit gpu_sched t10_pi drm_buddy drm_suballoc_helper drm_gpuvm xhci_pci drm_exec drm_display_helper xhci_hcd crc64_rocksoft drm_kms_helper crc64 crc_t10dif crct10dif_generic usbcore intel_ish_ipc crct10dif_pclmul intel_lpss_pci crct10dif_common drm intel_ishtp intel_lpss usb_common idma64 video wmi fan
[ 485.674240] CPU: 2 PID: 4431 Comm: vkd3d_queue Not tainted 6.10.0-rc7pz+ #45
[ 485.674242] Hardware name: Intel Corporation Lunar Lake Client Platform/LNL-M LP5 RVP1, BIOS LNLMFWI1.R00.3093.D87.2403190644 03/19/2024
[ 485.674243] RIP: 0010:refcount_warn_saturate+0xbe/0x110
[ 485.674247] Code: 01 01 e8 55 e6 92 ff 0f 0b c3 cc cc cc cc 80 3d 69 cf 65 01 00 75 85 48 c7 c7 78 60 70 b7 c6 05 59 cf 65 01 01 e8 32 e6 92 ff <0f> 0b c3 cc cc cc cc 80 3d 47 cf 65 01 00 0f 85 5e ff ff ff 48 c7
[ 485.674248] RSP: 0018:ffffad4507e739f0 EFLAGS: 00010282
[ 485.674250] RAX: 0000000000000000 RBX: ffff8c382eaa0540 RCX: 0000000000000027
[ 485.674251] RDX: ffff8c3a6f89c9c8 RSI: 0000000000000001 RDI: ffff8c3a6f89c9c0
[ 485.674252] RBP: ffff8c382eaa0540 R08: 0000000000000000 R09: 0000000000000003
[ 485.674253] R10: ffffad4507e73888 R11: ffffffffb7defd68 R12: 00000000fffffff4
[ 485.674254] R13: 00007f9ee015f700 R14: ffff8c3768f42c20 R15: 0000000000000002
[ 485.674255] FS: 000000012bbff6c0(0000) GS:ffff8c3a6f880000(0000) knlGS:000000007fe90000
[ 485.674256] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 485.674257] CR2: 00007fbbc4a10000 CR3: 000000037617a006 CR4: 0000000000f70ef0
[ 485.674258] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 485.674258] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[ 485.674259] PKRU: 55555554
[ 485.674260] Call Trace:
[ 485.674262] <TASK>
[ 485.674265] ? __warn+0x8c/0x180
[ 485.674268] ? refcount_warn_saturate+0xbe/0x110
[ 485.674270] ? report_bug+0x164/0x190
[ 485.674274] ? prb_read_valid+0x1b/0x30
[ 485.674278] ? handle_bug+0x3a/0x70
[ 485.674280] ? exc_invalid_op+0x17/0x70
[ 485.674281] ? asm_exc_invalid_op+0x1a/0x20
[ 485.674287] ? refcount_warn_saturate+0xbe/0x110
[ 485.674290] xe_sync_entry_cleanup+0xc9/0xf0 [xe]
[ 485.674359] xe_exec_ioctl+0x240/0xa90 [xe]
[ 485.674397] ? __pfx_xe_exec_fn+0x10/0x10 [xe]
[ 485.674431] ? __pfx_xe_exec_ioctl+0x10/0x10 [xe]
[ 485.674466] drm_ioctl_kernel+0xb5/0x110 [drm]
[ 485.674496] drm_ioctl+0x27a/0x4e0 [drm]
[ 485.674516] ? __pfx_xe_exec_ioctl+0x10/0x10 [xe]
[ 485.674556] xe_drm_ioctl+0x56/0x80 [xe]
[ 485.674589] __x64_sys_ioctl+0x94/0xd0
[ 485.674593] do_syscall_64+0x90/0x1a0
[ 485.674595] ? __mutex_unlock_slowpath+0x3a/0x290
[ 485.674599] ? dma_buf_ioctl+0x33b/0x400
[ 485.674603] ? lockdep_hardirqs_on_prepare+0xda/0x190
[ 485.674606] ? syscall_exit_to_user_mode+0xb8/0x290
[ 485.674608] ? do_syscall_64+0x9c/0x1a0
[ 485.674609] ? lockdep_hardirqs_on_prepare+0xda/0x190
[ 485.674611] ? syscall_exit_to_user_mode+0xb8/0x290
[ 485.674613] ? do_syscall_64+0x9c/0x1a0
[ 485.674614] ? xe_drm_ioctl+0x61/0x80 [xe]
[ 485.674646] ? lockdep_hardirqs_on_prepare+0xda/0x190
[ 485.674648] ? syscall_exit_to_user_mode+0xb8/0x290
[ 485.674650] ? do_syscall_64+0x9c/0x1a0
[ 485.674651] ? do_syscall_64+0x9c/0x1a0
[ 485.674653] ? do_syscall_64+0x9c/0x1a0
[ 485.674654] ? lockdep_hardirqs_on_prepare+0xda/0x190
[ 485.674656] entry_SYSCALL_64_after_hwframe+0x71/0x79
[ 485.674658] RIP: 0033:0x7f9f0d50b71b
[ 485.674660] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 485.674661] RSP: 002b:000000012bbfcab0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 485.674663] RAX: ffffffffffffffda RBX: 000055558f8a29c0 RCX: 00007f9f0d50b71b
[ 485.674664] RDX: 000000012bbfcb80 RSI: 0000000040386449 RDI: 000000000000007c
[ 485.674665] RBP: 000000012bbfcc30 R08: 0000000000000000 R09: 0000000000000000
[ 485.674666] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 485.674666] R13: 00007f9f065a05c0 R14: 00007f9ee015f680 R15: 000055558f95a0f0
[ 485.674670] </TASK>
[ 485.674671] irq event stamp: 11301667
[ 485.674672] hardirqs last enabled at (11301673): [<ffffffffb635553b>] console_unlock+0x11b/0x140
[ 485.674674] hardirqs last disabled at (11301678): [<ffffffffb6355520>] console_unlock+0x100/0x140
[ 485.674675] softirqs last enabled at (11298990): [<ffffffffb62a28ea>] __irq_exit_rcu+0x9a/0xc0
[ 485.674677] softirqs last disabled at (11298979): [<ffffffffb62a28ea>] __irq_exit_rcu+0x9a/0xc0
[ 485.674678] ---[ end trace 0000000000000000 ]---
This does not always happen, but if you launch the game multiple times, it will eventually happen. Please notice that we're observing GPU hangs in this game, and the hangs may be related to this bug: mesa/mesa#11526
I think this could also be #495 (closed) coming back to haunt us, except that the reproducer for it does not reproduce the problem anymore.
(gdb) list *(xe_sync_entry_cleanup+0xc9)
0x6b959 is in xe_sync_entry_cleanup (../include/linux/refcount.h:275).
270 smp_acquire__after_ctrl_dep();
271 return true;
272 }
273
274 if (unlikely(old < 0 || old - i < 0))
275 refcount_warn_saturate(r, REFCOUNT_SUB_UAF);
276
277 return false;
278 }
279
(gdb)