intermittent crash in ttm_bo_put ttm_bo_delayed_delete, black screens after logging in, possible kref reference counting error
Brief summary of the problem:
Laptop is Lenovo Legion 7 16ACHg6 with hybrid graphics (nouveau) connected to a Dell WD19 dock and multiple monitors. On around 15% of boots - at GDM login screen the laptop and external monitors are on and showing an image, then immediately after logging in, the laptop screen and external monitors go black, and the kernel log has some kind of oops or lock error from ttm_bo_put
/ ttm_bo_delayed_delete
.
[ 404.943404] ------------[ cut here ]------------
[ 404.943411] DEBUG_LOCKS_WARN_ON(mutex_is_locked(lock))
[ 404.943417] WARNING: CPU: 12 PID: 3882 at kernel/locking/mutex-debug.c:114 mutex_destroy+0x57/0x60
[ 404.943428] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables br_netfilter bridge stp llc ccm overlay qrtr rfcomm cmac algif_hash algif_skcipher af_alg bnep binfmt_misc snd_hda_codec_realtek intel_rapl_msr snd_hda_codec_generic intel_rapl_common snd_hda_scodec_component iwlmvm snd_hda_codec_hdmi uvcvideo btusb kvm_amd snd_usb_audio snd_hda_intel btrtl videobuf2_vmalloc snd_intel_dspcfg mac80211 snd_usbmidi_lib btintel videobuf2_memops kvm snd_hda_codec snd_ump btbcm libarc4 uvc snd_hwdep snd_rawmidi rapl btmtk videobuf2_v4l2 ee1004 snd_hda_core snd_seq_device wmi_bmof videodev snd_pcm pcspkr bluetooth iwlwifi videobuf2_common k10temp snd_timer sp5100_tco ucsi_acpi snd typec_ucsi cfg80211 soundcore ccp mc typec joydev input_leds serio_raw mac_hid sch_fq_codel msr parport_pc ppdev lp parport nvme_fabrics efi_pstore nfnetlink dmi_sysfs ip_tables x_tables
[ 404.943537] autofs4 btrfs blake2b_generic xor raid6_pq dm_crypt hid_microsoft ff_memless usbmouse hid_cmedia r8153_ecm cdc_ether usbnet usbkbd r8152 mii usbhid amdgpu drm_panel_backlight_quirks drm_suballoc_helper cec rc_core amdxcp drm_buddy nouveau mxm_wmi drm_gpuvm i2c_algo_bit drm_ttm_helper hid_multitouch ttm hid_generic drm_exec polyval_clmulni polyval_generic nvme ahci i2c_piix4 gpu_sched i2c_hid_acpi r8169 ghash_clmulni_intel nvme_core video libahci i2c_smbus drm_display_helper i2c_hid realtek nvme_auth wmi hid uas usb_storage aesni_intel crypto_simd cryptd
[ 404.943595] CPU: 12 UID: 125 PID: 3882 Comm: gnome-shell Not tainted 6.13.0-09950-g60c828cf80c0 #185
[ 404.943599] Hardware name: LENOVO 82N6/LNVNB161216, BIOS GKCN65WW 01/16/2024
[ 404.943602] RIP: 0010:mutex_destroy+0x57/0x60
[ 404.943608] Code: 84 c0 74 e1 e8 da 54 6f 00 85 c0 74 d8 8b 05 10 75 e7 01 85 c0 75 ce 48 c7 c6 77 de 99 a5 48 c7 c7 7a 29 99 a5 e8 89 7d f5 ff <0f> 0b eb b7 0f 1f 44 00 00 55 48 89 f2 48 c7 c1 aa de 99 a5 48 89
[ 404.943611] RSP: 0018:ffffb55c8d383a78 EFLAGS: 00010246
[ 404.943615] RAX: 0000000000000000 RBX: ffff903394a9c980 RCX: 0000000000000000
[ 404.943618] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 404.943620] RBP: ffffb55c8d383a80 R08: 0000000000000000 R09: 0000000000000000
[ 404.943622] R10: 0000000000000000 R11: 0000000000000000 R12: ffff903394a9c800
[ 404.943624] R13: ffff9033a718ffd8 R14: ffff903394a9c848 R15: ffff903394a9c980
[ 404.943626] FS: 00007e6a38510e80(0000) GS:ffff903992000000(0000) knlGS:0000000000000000
[ 404.943629] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 404.943631] CR2: 00007e6a38140000 CR3: 000000010bf80000 CR4: 0000000000f50ef0
[ 404.943634] PKRU: 55555554
[ 404.943637] Call Trace:
[ 404.943639] <TASK>
[ 404.943643] ? show_regs.cold+0x19/0x24
[ 404.943647] ? mutex_destroy+0x57/0x60
[ 404.943651] ? __warn.cold+0xce/0x188
[ 404.943656] ? mutex_destroy+0x57/0x60
[ 404.943660] ? report_bug+0x110/0x160
[ 404.943665] ? handle_bug+0x6a/0xb0
[ 404.943669] ? exc_invalid_op+0x18/0x80
[ 404.943673] ? asm_exc_invalid_op+0x1b/0x20
[ 404.943681] ? mutex_destroy+0x57/0x60
[ 404.943685] ? mutex_destroy+0x57/0x60
[ 404.943689] dma_resv_fini+0x2b/0x40
[ 404.943693] drm_gem_object_release+0x31/0x60
[ 404.943698] amdgpu_bo_destroy+0x4a/0x70 [amdgpu]
[ 404.943912] amdgpu_bo_user_destroy+0x21/0x30 [amdgpu]
[ 404.944029] ttm_bo_release+0x69/0x320 [ttm]
[ 404.944033] ttm_bo_put+0x38/0x60 [ttm]
[ 404.944035] amdgpu_gem_object_free+0x1e/0x30 [amdgpu]
[ 404.944124] drm_gem_object_free+0x1a/0x30
[ 404.944125] drm_gem_dmabuf_release+0x45/0x70
[ 404.944127] dma_buf_release+0x3b/0xa0
[ 404.944129] __dentry_kill+0x95/0x1b0
[ 404.944131] ? dput.part.0+0x203/0x460
[ 404.944132] dput.part.0+0x23c/0x460
[ 404.944134] dput+0x13/0x20
[ 404.944135] __fput+0x157/0x2e0
[ 404.944137] ____fput+0x15/0x20
[ 404.944139] task_work_run+0x5d/0xb0
[ 404.944141] syscall_exit_to_user_mode+0x20a/0x210
[ 404.944143] do_syscall_64+0x93/0x140
[ 404.944145] ? ktime_get_mono_fast_ns+0x39/0xd0
[ 404.944147] ? __pm_runtime_suspend+0xf6/0x140
[ 404.944149] ? amdgpu_drm_ioctl+0x70/0x90 [amdgpu]
[ 404.944235] ? trace_hardirqs_off+0x59/0xe0
[ 404.944238] ? syscall_exit_to_user_mode+0xcc/0x210
[ 404.944240] ? do_syscall_64+0x93/0x140
[ 404.944242] ? ktime_get_mono_fast_ns+0x39/0xd0
[ 404.944243] ? __pm_runtime_suspend+0xf6/0x140
[ 404.944245] ? amdgpu_drm_ioctl+0x70/0x90 [amdgpu]
[ 404.944331] ? trace_hardirqs_off+0x59/0xe0
[ 404.944333] ? syscall_exit_to_user_mode+0xcc/0x210
[ 404.944335] ? do_syscall_64+0x93/0x140
[ 404.944336] ? trace_hardirqs_off+0x59/0xe0
[ 404.944338] ? syscall_exit_to_user_mode+0xcc/0x210
[ 404.944340] ? do_syscall_64+0x93/0x140
[ 404.944342] ? do_syscall_64+0x93/0x140
[ 404.944343] ? do_syscall_64+0x93/0x140
[ 404.944345] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 404.944347] RIP: 0033:0x7e6a4731637b
[ 404.944349] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 404.944350] RSP: 002b:00007ffdad9a5c40 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 404.944352] RAX: 0000000000000000 RBX: 000056850b5e61c0 RCX: 00007e6a4731637b
[ 404.944353] RDX: 00007ffdad9a5cd0 RSI: 0000000040086409 RDI: 0000000000000012
[ 404.944354] RBP: 00007ffdad9a5cd0 R08: 00000005685090dd R09: 0000000000000000
[ 404.944355] R10: 0000000000000007 R11: 0000000000000246 R12: 0000000040086409
[ 404.944356] R13: 0000000000000012 R14: 00005685091929cc R15: 0000568509676080
[ 404.944358] </TASK>
[ 404.944359] irq event stamp: 21600713
[ 404.944360] hardirqs last enabled at (21600713): [<ffffffffa52d3b3d>] _raw_spin_unlock_irqrestore+0x4d/0x70
[ 404.944362] hardirqs last disabled at (21600712): [<ffffffffa4544408>] kvfree_call_rcu+0x1a8/0x3d0
[ 404.944365] softirqs last enabled at (21600000): [<ffffffffa4211958>] __irq_exit_rcu+0xb8/0xf0
[ 404.944366] softirqs last disabled at (21599681): [<ffffffffa4211958>] __irq_exit_rcu+0xb8/0xf0
[ 404.944368] ---[ end trace 0000000000000000 ]---
- CPU: Ryzen 7 5800H
- GPU:
- 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA104M [GeForce RTX 3070 Mobile / Max-Q] [10de:24dd] (rev a1)
- 05:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] [1002:1638] (rev c5)
- System Memory: 32GB
- Display(s): laptop display and docked monitors
- Type of Display Connection: eDP and USB-C DP
System information:
- Distro name and Version: Debian Trixie
- Kernel version: 6.12.12 and 6.14.0-rc4
How to reproduce the issue:
Happens intermittently on around 15% of boots. At GDM login screen the laptop and external monitors are on and showing an image, then immediately after logging in, the laptop screen and external monitors go black, and the kernel log has some kind of oops from ttm_bo_delayed_delete.