App freezing with callstack inside syscall and NULL pointer dereference in dmesg
Brief summary of the problem:
In the past week I've seen apps freeze multiple times a day. When this happens I see logs in dmesg like this:
[81280.418131] BUG: kernel NULL pointer dereference, address: 0000000000000078
[81280.418144] #PF: supervisor read access in kernel mode
[81280.418149] #PF: error_code(0x0000) - not-present page
[81280.418152] PGD 0 P4D 0
[81280.418159] Oops: 0000 [#12] PREEMPT SMP NOPTI
[81280.418165] CPU: 15 PID: 83078 Comm: love:cs0 Tainted: G D W 6.2.0-26-generic #26~22.04.1-Ubuntu
[81280.418172] Hardware name: LENOVO 20UF0014US/20UF0014US, BIOS R1CET75W(1.44 ) 06/13/2023
[81280.418176] RIP: 0010:drm_sched_job_cleanup+0x26/0x140 [gpu_sched]
[81280.418200] Code: 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 53 48 83 ec 10 48 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 e8 31 c0 <8b> 47 78 85 c0 0f 84 d3 00 00 00 48 83 ff c0 74 1f 4c 8d 47 78 b8
[81280.418206] RSP: 0018:ffffafdc0a0d7a98 EFLAGS: 00010246
[81280.418211] RAX: 0000000000000000 RBX: ffffafdc0a0d7b28 RCX: 0000000000000000
[81280.418215] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[81280.418218] RBP: ffffafdc0a0d7ab8 R08: 0000000000000000 R09: 0000000000000000
[81280.418221] R10: 0000000000000000 R11: 0000000000000000 R12: ffff989991916c00
[81280.418224] R13: 0000000000000000 R14: ffff989991916c00 R15: ffff9896cd680000
[81280.418228] FS: 00007f5bd1fff640(0000) GS:ffff9899bfbc0000(0000) knlGS:0000000000000000
[81280.418232] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[81280.418236] CR2: 0000000000000078 CR3: 0000000275280000 CR4: 0000000000350ee0
[81280.418240] Call Trace:
[81280.418244] <TASK>
[81280.418254] amdgpu_job_free+0x1a/0x100 [amdgpu]
[81280.418984] amdgpu_cs_parser_fini+0x15c/0x200 [amdgpu]
[81280.419605] ? ttm_bo_move_to_lru_tail+0x1a/0x30 [ttm]
[81280.419629] amdgpu_cs_ioctl+0xd9/0x340 [amdgpu]
[81280.420265] ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
[81280.420887] drm_ioctl_kernel+0xc3/0x160 [drm]
[81280.420987] drm_ioctl+0x27b/0x4c0 [drm]
[81280.421073] ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
[81280.421636] ? do_futex+0xd7/0x230
[81280.421648] amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
[81280.422188] __x64_sys_ioctl+0x9d/0xe0
[81280.422197] do_syscall_64+0x5c/0x90
[81280.422205] ? irqentry_exit+0x43/0x50
[81280.422211] ? sysvec_call_function+0x4e/0xb0
[81280.422217] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[81280.422228] RIP: 0033:0x7f5be0f1aaff
[81280.422234] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
[81280.422240] RSP: 002b:00007f5bd1ffea60 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[81280.422246] RAX: ffffffffffffffda RBX: 00007f5bd1ffeb20 RCX: 00007f5be0f1aaff
[81280.422250] RDX: 00007f5bd1ffeb20 RSI: 00000000c0186444 RDI: 000000000000000f
[81280.422253] RBP: 00000000c0186444 R08: 00007f5bd1ffec70 R09: 00007f5bd1ffeb00
[81280.422256] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f5bd1ffec40
[81280.422259] R13: 000000000000000f R14: 00007f5bd1ffec18 R15: 000055db3620f7c0
[81280.422266] </TASK>
[81280.422269] Modules linked in: ses enclosure scsi_transport_sas uas usb_storage nft_masq nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc nf_tables libcrc32c nfnetlink ccm rfcomm cmac algif_hash algif_skcipher af_alg bnep amdgpu snd_acp3x_rn snd_acp3x_pdm_dma snd_soc_dmic snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils iommu_v2 binfmt_misc snd_ctl_led drm_buddy snd_soc_core snd_hda_codec_realtek gpu_sched drm_ttm_helper snd_compress snd_hda_codec_generic snd_hda_codec_hdmi ttm ac97_bus iwlmvm snd_pcm_dmaengine intel_rapl_msr snd_hda_intel drm_display_helper intel_rapl_common snd_pci_ps snd_intel_dspcfg cec snd_rpl_pci_acp6x snd_intel_sdw_acpi edac_mce_amd snd_acp_pci rc_core snd_hda_codec thinkpad_acpi mac80211 snd_pci_acp6x kvm_amd drm_kms_helper snd_pci_acp5x nvram snd_hda_core i2c_algo_bit snd_hwdep snd_rn_pci_acp3x syscopyarea snd_acp_config snd_seq_midi sysfillrect snd_seq_midi_event
[81280.422383] snd_soc_acpi snd_pcm libarc4 ccp nls_iso8859_1 sysimgblt kvm snd_pci_acp3x uvcvideo videobuf2_vmalloc snd_rawmidi irqbypass videobuf2_memops crct10dif_pclmul videobuf2_v4l2 polyval_clmulni snd_seq polyval_generic ghash_clmulni_intel sha512_ssse3 snd_seq_device videodev iwlwifi aesni_intel crypto_simd snd_timer videobuf2_common tps6598x joydev input_leds btusb cryptd rapl serio_raw mc snd wmi_bmof btrtl cfg80211 ucsi_acpi btbcm think_lmi typec_ucsi soundcore firmware_attributes_class btintel ipmi_devintf ledtrig_audio btmtk k10temp ipmi_msghandler typec platform_profile mac_hid serial_multi_instantiate bluetooth ecdh_generic ecc sch_fq_codel msr parport_pc ppdev lp drm parport pstore_blk ramoops pstore_zone reed_solomon efi_pstore ip_tables x_tables autofs4 rtsx_pci_sdmmc nvme crc32_pclmul psmouse nvme_core rtsx_pci xhci_pci r8169 i2c_piix4 xhci_pci_renesas realtek nvme_common video wmi i2c_scmi
[81280.422516] CR2: 0000000000000078
[81280.422521] ---[ end trace 0000000000000000 ]---
[81280.501549] RIP: 0010:drm_sched_job_cleanup+0x26/0x140 [gpu_sched]
[81280.501582] Code: 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 53 48 83 ec 10 48 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 e8 31 c0 <8b> 47 78 85 c0 0f 84 d3 00 00 00 48 83 ff c0 74 1f 4c 8d 47 78 b8
[81280.501588] RSP: 0018:ffffafdc021e7a50 EFLAGS: 00010246
[81280.501594] RAX: 0000000000000000 RBX: ffffafdc021e7ae0 RCX: 0000000000000000
[81280.501598] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[81280.501601] RBP: ffffafdc021e7a70 R08: 0000000000000000 R09: 0000000000000000
[81280.501604] R10: 0000000000000000 R11: 0000000000000000 R12: ffff989716c5cc00
[81280.501607] R13: 0000000000000000 R14: ffff989716c5cc00 R15: ffff9896cd680000
[81280.501610] FS: 00007f5bd1fff640(0000) GS:ffff9899bfbc0000(0000) knlGS:0000000000000000
[81280.501615] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[81280.501618] CR2: 0000000000000078 CR3: 0000000275280000 CR4: 0000000000350ee0
[81280.501623] note: love:cs0[83078] exited with irqs disabled
I've managed to run an app inside gdb
and grab a call stack when it freezes up:
(gdb) bt
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00007ffff12ba3a6 in ?? () from /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so
#2 0x00007ffff12bf8ef in ?? () from /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so
#3 0x00007ffff1afb233 in ?? () from /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so
#4 0x00007ffff1375e90 in ?? () from /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so
#5 0x00007ffff12b4375 in ?? () from /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so
#6 0x00007ffff62891e2 in glLabelObjectEXT () from /lib/x86_64-linux-gnu/libGLX_mesa.so.0
#7 0x00007ffff627af95 in ?? () from /lib/x86_64-linux-gnu/libGLX_mesa.so.0
#8 0x00007ffff626a71f in ?? () from /lib/x86_64-linux-gnu/libGLX_mesa.so.0
#9 0x00007ffff6d49ac3 in ?? () from /tmp/.mount_love0nKuxZ/lib/libSDL2-2.0.so.0
#10 0x00007ffff79d27ca in love::graphics::opengl::Graphics::present(void*) () from /tmp/.mount_love0nKuxZ/lib/liblove-11.4.so
#11 0x00007ffff79e26d8 in love::graphics::w_present(lua_State*) () from /tmp/.mount_love0nKuxZ/lib/liblove-11.4.so
#12 0x00007ffff740a946 in ?? () from /tmp/.mount_love0nKuxZ/lib/libluajit-5.1.so.2
#13 0x00005555554010c5 in ?? ()
#14 0x00007ffff7029d90 in __libc_start_call_main (main=main@entry=0x555555400e50, argc=argc@entry=2, argv=argv@entry=0x7fffffffd6a8)
at ../sysdeps/nptl/libc_start_call_main.h:58
#15 0x00007ffff7029e40 in __libc_start_main_impl (main=0x555555400e50, argc=2, argv=0x7fffffffd6a8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
stack_end=0x7fffffffd698) at ../csu/libc-start.c:392
#16 0x000055555540130a in ?? ()
I've seen multiple applications affected, but they're all built out of LÖVE which uses SDL2. I haven't yet seen crashes in test apps build using just SDL2. I hope this tracker is the right place to report this.
Once the issue happens once, it seems to be very likely to happen again in the near future, in multiple LÖVE apps and across restarts. After a while the freezing subsides.
Hardware description:
- CPU: "AMD® Ryzen 7 pro 4750u with radeon graphics × 16"
- GPU:
$ sudo lshw -C display -numeric
*-display
description: VGA compatible controller
product: Renoir [1002:1636]
vendor: Advanced Micro Devices, Inc. [AMD/ATI] [1002]
physical id: 0
bus info: pci@0000:06:00.0
logical name: /dev/fb0
version: d1
width: 64 bits
clock: 33MHz
capabilities: pm pciexpress msi msix vga_controller bus_master cap_list fb
configuration: depth=32 driver=amdgpu latency=0 mode=1920x1080 resolution=1920,1080 visual=truecolor xres=1920 yres=1080
resources: iomemory:40-3f iomemory:40-3f irq:57 memory:460000000-46fffffff memory:470000000-4701fffff ioport:1000(size=256) memory:fd300000-fd37ffff
- System Memory: 16GiB
- Display(s): builtin display for Thinkpad X13 Gen 1 laptop
System information:
- Distro name and Version: Ubuntu 22.04.3 LTS
- Kernel version:
$ uname -a
Linux X13 6.2.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
How to reproduce the issue:
- Install LÖVE (5MB binary on Linux) 11.4 (AppImage)
git clone https://github.com/akkartik/lines.love
love lines.love
- Leave it running for a few hours.
Attached files:
dmesg.before, dmesg.after: snapshots of dmesg before and after a freeze