hung tasks in linux-next-20231115+
Since using linux-next-20231115+ my MSI Alpha 15 Laptop is suffering from frequent hungs tasks. After 30min - 150min the GUI locks up, but ssh'ing into the machine is still possible. This occured 5 times so far, 3 times of which I collected a backtrace in dmesg (the other time I was to impatient the backtrace is printed after 120s of hanging by default). The events that triggered it were 2x starting libreoffice writer while the system was mostly idle otherwise, 1x starting the steam client, 1x a steam process was reported as the hung task and 1x Xorg hung while I was compiling a kernel.
Hardware:
$ lspci
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 51)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c3)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] (rev c3)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
04:00.0 Network controller: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
06:00.0 Non-Volatile memory controller: Micron/Crucial Technology P1 NVMe PCIe SSD[Frampton] (rev 03)
07:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. OM8PCP Design-In PCIe 3 NVMe SSD (DRAM-less) (rev 01)
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c5)
08:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller
08:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
08:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
08:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
08:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] ACP/ACP3X/ACP6x Audio Coprocessor (rev 01)
08:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller
08:00.7 Signal processing controller: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub
Logs:
hungtask_libreoffice.log hungtask_Xorg.log hungtask_steam.log
I'm reporting this here because all the hung task backtraces contained function related to amdgpu:
libreoffice:
2023-11-17T13:11:58.451267+01:00 lisa kernel: [ 8971.149025][ T113] INFO: task kworker/u32:0:8969 blocked for more than 122 seconds.
2023-11-17T13:11:58.451291+01:00 lisa kernel: [ 8971.149046][ T113] Not tainted 6.7.0-rc1-next-20231116 #930
2023-11-17T13:11:58.451296+01:00 lisa kernel: [ 8971.149054][ T113] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2023-11-17T13:11:58.451299+01:00 lisa kernel: [ 8971.149059][ T113] task:kworker/u32:0 state:D stack:0 pid:8969 tgid:8969 ppid:2 flags:0x00004000
2023-11-17T13:11:58.451304+01:00 lisa kernel: [ 8971.149073][ T113] Workqueue: events_unbound commit_work
2023-11-17T13:11:58.451307+01:00 lisa kernel: [ 8971.149090][ T113] Call Trace:
2023-11-17T13:11:58.451310+01:00 lisa kernel: [ 8971.149094][ T113] <TASK>
2023-11-17T13:11:58.451314+01:00 lisa kernel: [ 8971.149103][ T113] __schedule+0x2ae/0x800
2023-11-17T13:11:58.451316+01:00 lisa kernel: [ 8971.149120][ T113] schedule+0x22/0xa0
2023-11-17T13:11:58.451320+01:00 lisa kernel: [ 8971.149126][ T113] schedule_timeout+0xe2/0xf0
2023-11-17T13:11:58.451323+01:00 lisa kernel: [ 8971.149136][ T113] ? srso_alias_return_thunk+0x5/0xfbef5
2023-11-17T13:11:58.451326+01:00 lisa kernel: [ 8971.149144][ T113] ? dma_fence_chain_enable_signaling+0xf9/0x240
2023-11-17T13:11:58.451329+01:00 lisa kernel: [ 8971.149154][ T113] dma_fence_default_wait+0x1aa/0x1f0
2023-11-17T13:11:58.451332+01:00 lisa kernel: [ 8971.149162][ T113] ? dma_fence_signal+0x50/0x50
2023-11-17T13:11:58.451335+01:00 lisa kernel: [ 8971.149171][ T113] drm_atomic_helper_wait_for_fences+0x14a/0x1d0
2023-11-17T13:11:58.451338+01:00 lisa kernel: [ 8971.149187][ T113] commit_tail+0x2a/0x120
2023-11-17T13:11:58.451342+01:00 lisa kernel: [ 8971.149196][ T113] process_one_work+0x15e/0x280
2023-11-17T13:11:58.451345+01:00 lisa kernel: [ 8971.149209][ T113] worker_thread+0x2ec/0x400
2023-11-17T13:11:58.451348+01:00 lisa kernel: [ 8971.149219][ T113] ? flush_delayed_work+0x40/0x40
2023-11-17T13:11:58.451351+01:00 lisa kernel: [ 8971.149226][ T113] kthread+0xcd/0x100
2023-11-17T13:11:58.451354+01:00 lisa kernel: [ 8971.149234][ T113] ? kthread_complete_and_exit+0x20/0x20
2023-11-17T13:11:58.451357+01:00 lisa kernel: [ 8971.149242][ T113] ret_from_fork+0x2f/0x50
2023-11-17T13:11:58.451359+01:00 lisa kernel: [ 8971.149251][ T113] ? kthread_complete_and_exit+0x20/0x20
2023-11-17T13:11:58.451362+01:00 lisa kernel: [ 8971.149257][ T113] ret_from_fork_asm+0x11/0x20
2023-11-17T13:11:58.451365+01:00 lisa kernel: [ 8971.149274][ T113] </TASK>
steam:
2023-11-16T21:47:41.069273+01:00 lisa kernel: [ 9708.417451][ T113] INFO: task steamweb:shlo0:16104 blocked for more than 122 seconds.
2023-11-16T21:47:41.069296+01:00 lisa kernel: [ 9708.417480][ T113] Not tainted 6.7.0-rc1-next-20231115 #929
2023-11-16T21:47:41.069303+01:00 lisa kernel: [ 9708.417490][ T113] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2023-11-16T21:47:41.069306+01:00 lisa kernel: [ 9708.417496][ T113] task:steamweb:shlo0 state:D stack:0 pid:16104 tgid:16096 ppid:16081 flags:0x00004002
2023-11-16T21:47:41.069309+01:00 lisa kernel: [ 9708.417515][ T113] Call Trace:
2023-11-16T21:47:41.069312+01:00 lisa kernel: [ 9708.417533][ T113] <TASK>
2023-11-16T21:47:41.069316+01:00 lisa kernel: [ 9708.417539][ T113] ? __schedule+0x299/0x7e0
2023-11-16T21:47:41.069319+01:00 lisa kernel: [ 9708.417562][ T113] ? schedule+0x22/0xa0
2023-11-16T21:47:41.069321+01:00 lisa kernel: [ 9708.417568][ T113] ? schedule_timeout+0xe2/0xf0
2023-11-16T21:47:41.069348+01:00 lisa kernel: [ 9708.417580][ T113] ? srso_alias_return_thunk+0x5/0xfbef5
2023-11-16T21:47:41.069351+01:00 lisa kernel: [ 9708.417588][ T113] ? __xa_erase+0x57/0xa0
2023-11-16T21:47:41.069355+01:00 lisa kernel: [ 9708.417598][ T113] ? dma_fence_default_wait+0x1aa/0x1f0
2023-11-16T21:47:41.069358+01:00 lisa kernel: [ 9708.417607][ T113] ? dma_fence_signal+0x50/0x50
2023-11-16T21:47:41.069361+01:00 lisa kernel: [ 9708.417619][ T113] ? amdgpu_vm_fini+0xee/0x530 [amdgpu]
2023-11-16T21:47:41.069364+01:00 lisa kernel: [ 9708.417932][ T113] ? srso_alias_return_thunk+0x5/0xfbef5
2023-11-16T21:47:41.069367+01:00 lisa kernel: [ 9708.417939][ T113] ? idr_destroy+0x77/0xb0
2023-11-16T21:47:41.069370+01:00 lisa kernel: [ 9708.417952][ T113] ? amdgpu_driver_postclose_kms+0x17b/0x270 [amdgpu]
2023-11-16T21:47:41.069373+01:00 lisa kernel: [ 9708.418279][ T113] ? drm_file_free+0x1f1/0x240
2023-11-16T21:47:41.069376+01:00 lisa kernel: [ 9708.418292][ T113] ? drm_release+0xbb/0x140
2023-11-16T21:47:41.069379+01:00 lisa kernel: [ 9708.418300][ T113] ? __fput+0x8d/0x2b0
2023-11-16T21:47:41.069382+01:00 lisa kernel: [ 9708.418312][ T113] ? task_work_run+0x57/0x80
2023-11-16T21:47:41.069385+01:00 lisa kernel: [ 9708.418324][ T113] ? do_exit+0x2dd/0x9a0
2023-11-16T21:47:41.069388+01:00 lisa kernel: [ 9708.418336][ T113] ? do_group_exit+0x2b/0x80
2023-11-16T21:47:41.069391+01:00 lisa kernel: [ 9708.418344][ T113] ? get_signal+0x7be/0x8a0
2023-11-16T21:47:41.069394+01:00 lisa kernel: [ 9708.418352][ T113] ? futex_wait+0x67/0x110
2023-11-16T21:47:41.069396+01:00 lisa kernel: [ 9708.418363][ T113] ? arch_do_signal_or_restart+0x29/0x230
2023-11-16T21:47:41.069448+01:00 lisa kernel: [ 9708.418377][ T113] ? exit_to_user_mode_prepare+0x11f/0x170
2023-11-16T21:47:41.069453+01:00 lisa kernel: [ 9708.418386][ T113] ? syscall_exit_to_user_mode+0x16/0x40
2023-11-16T21:47:41.069457+01:00 lisa kernel: [ 9708.418394][ T113] ? do_syscall_64+0x52/0xf0
2023-11-16T21:47:41.069460+01:00 lisa kernel: [ 9708.418403][ T113] ? entry_SYSCALL_64_after_hwframe+0x4b/0x53
2023-11-16T21:47:41.069466+01:00 lisa kernel: [ 9708.418424][ T113] </TASK>
Xorg:
2023-11-17T14:07:40.669600+01:00 lisa kernel: [ 3072.988364][ T113] INFO: task Xorg:1523 blocked for more than 122 seconds.
2023-11-17T14:07:40.669623+01:00 lisa kernel: [ 3072.988388][ T113] Not tainted 6.7.0-rc1-next-20231116 #930
2023-11-17T14:07:40.669625+01:00 lisa kernel: [ 3072.988397][ T113] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2023-11-17T14:07:40.669630+01:00 lisa kernel: [ 3072.988403][ T113] task:Xorg state:D stack:0 pid:1523 tgid:1523 ppid:1519 flags:0x00000006
2023-11-17T14:07:40.669633+01:00 lisa kernel: [ 3072.988416][ T113] Call Trace:
2023-11-17T14:07:40.669637+01:00 lisa kernel: [ 3072.988421][ T113] <TASK>
2023-11-17T14:07:40.669641+01:00 lisa kernel: [ 3072.988430][ T113] __schedule+0x2ae/0x800
2023-11-17T14:07:40.669644+01:00 lisa kernel: [ 3072.988444][ T113] ? __schedule+0x2b6/0x800
2023-11-17T14:07:40.669646+01:00 lisa kernel: [ 3072.988454][ T113] schedule+0x22/0xa0
2023-11-17T14:07:40.669649+01:00 lisa kernel: [ 3072.988460][ T113] schedule_preempt_disabled+0x10/0x20
2023-11-17T14:07:40.669653+01:00 lisa kernel: [ 3072.988466][ T113] __mutex_lock.constprop.0+0x2e1/0x420
2023-11-17T14:07:40.669656+01:00 lisa kernel: [ 3072.988478][ T113] ? amdgpu_ctx_get_entity+0x6a/0x350 [amdgpu]
2023-11-17T14:07:40.669658+01:00 lisa kernel: [ 3072.988789][ T113] ? dma_fence_default_wait+0x140/0x1f0
2023-11-17T14:07:40.669661+01:00 lisa kernel: [ 3072.988801][ T113] ? amdgpu_cs_report_moved_bytes+0x70/0x70 [amdgpu]
2023-11-17T14:07:40.670246+01:00 lisa kernel: [ 3072.989138][ T113] amdgpu_ctx_get+0x21/0x90 [amdgpu]
2023-11-17T14:07:40.670259+01:00 lisa kernel: [ 3072.989459][ T113] amdgpu_cs_wait_ioctl+0x41/0x170 [amdgpu]
2023-11-17T14:07:40.671274+01:00 lisa kernel: [ 3072.989828][ T113] ? amdgpu_cs_report_moved_bytes+0x70/0x70 [amdgpu]
2023-11-17T14:07:40.671290+01:00 lisa kernel: [ 3072.990183][ T113] drm_ioctl_kernel+0xc9/0x170
2023-11-17T14:07:40.671294+01:00 lisa kernel: [ 3072.990198][ T113] drm_ioctl+0x258/0x4c0
2023-11-17T14:07:40.671298+01:00 lisa kernel: [ 3072.990209][ T113] ? amdgpu_cs_report_moved_bytes+0x70/0x70 [amdgpu]
2023-11-17T14:07:40.671302+01:00 lisa kernel: [ 3072.990508][ T113] amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
2023-11-17T14:07:40.671306+01:00 lisa kernel: [ 3072.990799][ T113] __x64_sys_ioctl+0x94/0xb0
2023-11-17T14:07:40.671311+01:00 lisa kernel: [ 3072.990844][ T113] do_syscall_64+0x46/0xf0
2023-11-17T14:07:40.671319+01:00 lisa kernel: [ 3072.990856][ T113] entry_SYSCALL_64_after_hwframe+0x4b/0x53
2023-11-17T14:07:40.671325+01:00 lisa kernel: [ 3072.990866][ T113] RIP: 0033:0x7f626c31b51b
2023-11-17T14:07:40.671328+01:00 lisa kernel: [ 3072.990873][ T113] RSP: 002b:00007ffcc9d9f030 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
2023-11-17T14:07:40.671334+01:00 lisa kernel: [ 3072.990882][ T113] RAX: ffffffffffffffda RBX: 00007ffcc9d9f114 RCX: 00007f626c31b51b
2023-11-17T14:07:40.671337+01:00 lisa kernel: [ 3072.990886][ T113] RDX: 00007ffcc9d9f0c0 RSI: 00000000c0206449 RDI: 0000000000000010
2023-11-17T14:07:40.671340+01:00 lisa kernel: [ 3072.990891][ T113] RBP: 00007ffcc9d9f0c0 R08: 000056155841d9dc R09: 00007f624803cc20
2023-11-17T14:07:40.671347+01:00 lisa kernel: [ 3072.990895][ T113] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000c0206449
2023-11-17T14:07:40.671381+01:00 lisa kernel: [ 3072.990899][ T113] R13: 0000000000000010 R14: 0000000000000001 R15: 00005615594e1a90
2023-11-17T14:07:40.671389+01:00 lisa kernel: [ 3072.990912][ T113] </TASK>
I've not yet confirmed if linux-6.7-rc1 is affected, too. I'm trying to bisect this but due to the long time it takes to trigger this, this may take some time. I'm enabling CONFIG_LOCKDEP in the kernel .config, will this help?