[DG2] Frequent hangs/crashes on A380 (small BAR), possibly during OpenGL surface resize (Mesa _iris_batch_flush or kernel refcount_t: underflow; use-after-free.)
This seems to be a regression of some kind, possibly in userspace (maybe Mesa or KWin), as using previously-working kernels doesn't seem to trigger it. It does, however, sometimes lead to a kernel BUG.
Running OpenGL applications (under Mesa 24.1.3, iris) can cause either:
- The application to crash
- The wayland compositor to crash (KWin 6.1.2)
- XWayland (
SUSE LINUX Xwayland Version 24.1.0 (12401000)
) to crash (if the application is running under XWayland) - Some combination of the above.
It's been difficult to find an exact repro case (for example, launching Steam used to reproduce it ~90% of the time, now seems stable), but I can relatively reliably cause a crash with:
- Building SDL, with the test programs enabled (-DSDL_TESTS=ON)
- Running 'testgl', either under XWayland (SDL_VIDEO_DRIVER=x11) or native Wayland (SDL_VIDEO_DRIVER=wayland).
- Repeatedly toggle fullscreen with Alt+Enter
- The system will slow down, and eventually a crash will occur (usually in KWin)
- Sometimes, this will be accompanied by a kernel BUG (
refcount_t: underflow; use-after-free.
inxe_sync_entry_cleanup
).
The 'testvulkan' app can (thus far) resize indefinitely with no slowdown or crash. Other SDL applications will crash if SDL_RENDER_DRIVER=opengl, but not if SDL_RENDER_DRIVER=vulkan. Using MESA_LOADER_DRIVER_OVERRIDE=zink does not crash, but does have severe graphical glitches (looks like issues with EGL_EXT_present_opaque).
None of the above issues have occurred under i915, though it is not really usable due to something like drm/i915/kernel#11055
I have tried this with several kernels:
- OpenSUSE's 6.9.7-1-vanilla
- 6.10-rc3 (this had previously worked)
- 6.10-rc6
- linux-next-20240709
- linux-next-20240712 with the xe BO shrinker patchset
The kernel version does not appear to affect the issue at all.
The system info (as reported by KDE) is:
Operating System: openSUSE Tumbleweed 20240712
KDE Plasma Version: 6.1.2
KDE Frameworks Version: 6.3.0
Qt Version: 6.7.2
Kernel Version: 6.10.0-rc7-next-20240712-sulix-00012-g7aaba425bacb (64-bit)
Graphics Platform: Wayland
Processors: 8 × Intel® Core™ i7-4770K CPU @ 3.50GHz
Memory: 31.0 GiB of RAM
Graphics Processor: Mesa Intel® Arc
Manufacturer: Gigabyte Technology Co., Ltd.
Product Name: Z87X-UD5H
The Gigabyte Technology Co., Ltd. Z87X-UD5H
motherboard does not support ReBAR, and the HSW CPU's integrated GPU (Intel(R) HD Graphics 4600 (HSW GT2)
) is also enabled (but both the crashing application and KWin are running on the DG2).
lspci -vnn -d :*:0300
returns:
00:02.0 VGA compatible controller [0300]: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller [8086:0412] (rev 06) (prog-if 00 [VGA controller])
Subsystem: Gigabyte Technology Co., Ltd Device [1458:d000]
Flags: bus master, fast devsel, latency 0, IRQ 33
Memory at f7400000 (64-bit, non-prefetchable) [size=4M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
I/O ports at f000 [size=64]
Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: i915
Kernel modules: i915
03:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc A380] [8086:56a5] (rev 05) (prog-if 00 [VGA controller])
Subsystem: Device [172f:3941]
Flags: bus master, fast devsel, latency 0, IRQ 36
Memory at f6000000 (64-bit, non-prefetchable) [size=16M]
Memory at e0000000 (64-bit, prefetchable) [size=256M]
Expansion ROM at f7000000 [disabled] [size=2M]
Capabilities: <access denied>
Kernel driver in use: xe
Kernel modules: i915, xe
The boot log of the current session (with several occurrences of the crash in KWin, and one kernel BUG) can be found here: https://davidgow.net/stuff/sparky-xe-bugs.txt
The relevant kernel stacktrace is:
[ 789.180148] [ T12663] refcount_t: underflow; use-after-free.
[ 789.180178] [ T12663] WARNING: CPU: 6 PID: 12663 at lib/refcount.c:28 refcount_warn_saturate+0xbe/0x110
[ 789.180184] [ T12663] Modules linked in: snd_seq_dummy(E) snd_hrtimer(E) af_packet(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) rfkill(E) nf_tables(E) libcrc32c(E) qrtr(E) snd_opl3_synth(E) snd_seq_midi_emul(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_seq(E) binfmt_misc(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) intel_rapl_msr(E) intel_rapl_common(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) spi_nor(E) coretemp(E) mtd(E) kvm_intel(E) snd_hda_codec_realtek(E) iTCO_wdt(E) spi_intel_platform(E) snd_hda_codec_generic(E) intel_pmc_bxt(E) kvm(E) spi_intel(E) iTCO_vendor_support(E) mei_hdcp(E) mei_pxp(E) at24(E) snd_hda_scodec_component(E) snd_hda_codec_hdmi(E) snd_usb_audio(E) ppdev(E) snd_cmipci(E) snd_hda_intel(E) snd_intel_dspcfg(E) igb(E) snd_mpu401_uart(E) snd_usbmidi_lib(E) snd_hda_codec(E)
[ 789.180241] [ T12663] snd_opl3_lib(E) snd_ump(E) i2c_i801(E) mxm_wmi(E) efi_pstore(E) snd_rawmidi(E) mei_gsc(E) i2c_smbus(E) pcspkr(E) lpc_ich(E) dca(E) snd_seq_device(E) snd_hda_core(E) e1000e(E) mei_me(E) parport_serial(E) mc(E) snd_hwdep(E) mei(E) snd_pcm(E) snd_timer(E) snd(E) tiny_power_button(E) parport_pc(E) soundcore(E) joydev(E) thermal(E) parport(E) fan(E) button(E) fuse(E) loop(E) configfs(E) nfnetlink(E) dmi_sysfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) dm_crypt(E) essiv(E) authenc(E) trusted(E) asn1_encoder(E) tee(E) hid_microsoft(E) hid_steam(E) ff_memless(E) uas(E) usb_storage(E) hid_generic(E) usbhid(E) xe(E) crct10dif_pclmul(E) drm_ttm_helper(E) sr_mod(E) crc32_pclmul(E) cdrom(E) gpu_sched(E) crc32c_intel(E) drm_suballoc_helper(E) polyval_clmulni(E) drm_gpuvm(E) polyval_generic(E) ahci(E) libahci(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) libata(E) sha1_ssse3(E) sd_mod(E) scsi_dh_emc(E) scsi_dh_rdac(E) xhci_pci(E) scsi_dh_alua(E) xhci_pci_renesas(E) ehci_pci(E) sg(E)
[ 789.180278] [ T12663] aesni_intel(E) firewire_ohci(E) gf128mul(E) i915(E) ehci_hcd(E) xhci_hcd(E) crypto_simd(E) firewire_core(E) scsi_mod(E) cryptd(E) crc_itu_t(E) scsi_common(E) i2c_algo_bit(E) usbcore(E) ttm(E) video(E) wmi(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) br_netfilter(E) bridge(E) stp(E) llc(E) msr(E) i2c_dev(E) efivarfs(E)
[ 789.180288] [ T12663] CPU: 6 UID: 1000 PID: 12663 Comm: Xwayland Tainted: G UD W E N 6.10.0-rc7-next-20240712-sulix-00012-g7aaba425bacb #7 fe53d1c8a23ee27f8feb38b2980b99a70b1c9d7e
[ 789.180292] [ T12663] Tainted: [U]=USER, [D]=DIE, [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
[ 789.180292] [ T12663] Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD5H/Z87X-UD5H-CF, BIOS 10c 06/12/2014
[ 789.180293] [ T12663] RIP: 0010:refcount_warn_saturate+0xbe/0x110
[ 789.180296] [ T12663] Code: 01 01 e8 c5 5f a6 ff 0f 0b c3 cc cc cc cc 80 3d 6d 8a a6 01 00 75 85 48 c7 c7 80 85 4d 8c c6 05 5d 8a a6 01 01 e8 a2 5f a6 ff <0f> 0b c3 cc cc cc cc 80 3d 4b 8a a6 01 00 0f 85 5e ff ff ff 48 c7
[ 789.180296] [ T12663] RSP: 0018:ffff9ed30f9ef9c8 EFLAGS: 00010282
[ 789.180298] [ T12663] RAX: 0000000000000000 RBX: ffff899623e11100 RCX: 0000000000000027
[ 789.180299] [ T12663] RDX: ffff899c4f727808 RSI: 0000000000000001 RDI: ffff899c4f727800
[ 789.180299] [ T12663] RBP: ffff9ed30f9efb80 R08: 0000000000000000 R09: ffff9ed30f9ef878
[ 789.180300] [ T12663] R10: ffff9ed30f9ef870 R11: 0000000000000003 R12: ffff899623e11100
[ 789.180300] [ T12663] R13: 00000000fffffe00 R14: 0000000000000001 R15: 00000000fffffe00
[ 789.180301] [ T12663] FS: 00007fbb90feca80(0000) GS:ffff899c4f700000(0000) knlGS:0000000000000000
[ 789.180302] [ T12663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 789.180303] [ T12663] CR2: 00007fbb8ea0c000 CR3: 000000010a608003 CR4: 00000000001706f0
[ 789.180304] [ T12663] DR0: ffffffff8d954558 DR1: ffffffff8d954559 DR2: ffffffff8d95455a
[ 789.180305] [ T12663] DR3: ffffffff8d95455b DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 789.180305] [ T12663] Call Trace:
[ 789.180306] [ T12663] <TASK>
[ 789.180307] [ T12663] ? refcount_warn_saturate+0xbe/0x110
[ 789.180308] [ T12663] ? __warn.cold+0xb0/0x10a
[ 789.180310] [ T12663] ? refcount_warn_saturate+0xbe/0x110
[ 789.180312] [ T12663] ? report_bug+0xd8/0x150
[ 789.180315] [ T12663] ? handle_bug+0x3c/0x80
[ 789.180318] [ T12663] ? exc_invalid_op+0x17/0x70
[ 789.180319] [ T12663] ? asm_exc_invalid_op+0x1a/0x20
[ 789.180322] [ T12663] ? refcount_warn_saturate+0xbe/0x110
[ 789.180323] [ T12663] ? refcount_warn_saturate+0xbe/0x110
[ 789.180324] [ T12663] xe_sync_entry_cleanup+0xc9/0xf0 [xe ddbb1105acb27db1a5f3645534d3eb084780c175]
[ 789.180429] [ T12663] xe_vm_bind_ioctl+0x1745/0x1f00 [xe ddbb1105acb27db1a5f3645534d3eb084780c175]
[ 789.180517] [ T12663] ? update_load_avg+0x7e/0x7e0
[ 789.180520] [ T12663] ? sched_balance_newidle+0x2d7/0x410
[ 789.180524] [ T12663] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe ddbb1105acb27db1a5f3645534d3eb084780c175]
[ 789.180606] [ T12663] ? drm_ioctl_kernel+0xad/0x100
[ 789.180609] [ T12663] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe ddbb1105acb27db1a5f3645534d3eb084780c175]
[ 789.180695] [ T12663] drm_ioctl_kernel+0xad/0x100
[ 789.180698] [ T12663] drm_ioctl+0x269/0x500
[ 789.180699] [ T12663] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe ddbb1105acb27db1a5f3645534d3eb084780c175]
[ 789.180779] [ T12663] ? drm_ioctl_kernel+0xad/0x100
[ 789.180781] [ T12663] ? __check_object_size+0x50/0x220
[ 789.180783] [ T12663] ? _copy_to_user+0x24/0x40
[ 789.180785] [ T12663] xe_drm_ioctl+0x53/0x80 [xe ddbb1105acb27db1a5f3645534d3eb084780c175]
[ 789.180846] [ T12663] __x64_sys_ioctl+0x97/0xd0
[ 789.180849] [ T12663] do_syscall_64+0x82/0x190
[ 789.180851] [ T12663] ? __pfx_read_tsc+0x10/0x10
[ 789.180853] [ T12663] ? ktime_get_mono_fast_ns+0x37/0xb0
[ 789.180855] [ T12663] ? __pm_runtime_idle+0x6f/0xd0
[ 789.180857] [ T12663] ? xe_drm_ioctl+0x5e/0x80 [xe ddbb1105acb27db1a5f3645534d3eb084780c175]
[ 789.180918] [ T12663] ? syscall_exit_to_user_mode+0x10/0x220
[ 789.180920] [ T12663] ? do_syscall_64+0x8e/0x190
[ 789.180921] [ T12663] ? do_writev+0xe4/0x140
[ 789.180923] [ T12663] ? do_writev+0x101/0x140
[ 789.180925] [ T12663] ? __fdget+0xb6/0xe0
[ 789.180927] [ T12663] ? __sys_recvmsg+0x8a/0xa0
[ 789.180929] [ T12663] ? syscall_exit_to_user_mode+0x10/0x220
[ 789.180931] [ T12663] ? do_syscall_64+0x8e/0x190
[ 789.180932] [ T12663] ? do_syscall_64+0x8e/0x190
[ 789.180933] [ T12663] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Let me know if this looks like it should belong on the Mesa tracker instead: as it only occurs with both xe and iris, it seems it could be either.