GPU temporary (for several minutes) hanged when I played in Ratchet & Clank: Rift Apart
Brief summary of the problem:
When I played Ratchet & Clank: Rift Apart the picture freezed and gnome-shell not responded on the <Super>
button.
From another computer I connected to SSH and dumped tasks that are in uninterruptible (blocked) state <SysRq + w>.
Backtace
[13346.575340] =====================================================
[13346.575348] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
[13346.575355] 6.5.0-0.rc5.36.fc39.x86_64+debug #1 Tainted: G W L ------- ---
[13346.575362] -----------------------------------------------------
[13346.575368] kworker/14:0/32457 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[13346.575376] ffff8882ea58f8c0 (&xa->xa_lock#12){+.+.}-{2:2}, at: xa_erase+0x11/0x40
[13346.575391]
and this task is already holding:
[13346.575398] ffff8884d0e65230 (&fence->lock){-.-.}-{2:2}, at: dma_fence_signal+0x42/0xc0
[13346.575409] which would create a new lock dependency:
[13346.575416] (&fence->lock){-.-.}-{2:2} -> (&xa->xa_lock#12){+.+.}-{2:2}
[13346.575426]
but this new dependency connects a HARDIRQ-irq-safe lock:
[13346.575432] (&fence->lock){-.-.}-{2:2}
[13346.575433]
... which became HARDIRQ-irq-safe at:
[13346.575445] lock_acquire+0x1a6/0x4f0
[13346.575454] _raw_spin_lock_irqsave+0x51/0xa0
[13346.575462] dma_fence_signal+0x42/0xc0
[13346.575468] drm_sched_job_done.isra.0+0x10c/0x260 [gpu_sched]
[13346.575479] dma_fence_signal_timestamp_locked+0x241/0x430
[13346.575486] dma_fence_signal+0x55/0xc0
[13346.575492] amdgpu_fence_process+0x2d8/0x450 [amdgpu]
[13346.575705] sdma_v6_0_process_trap_irq+0x193/0x1e0 [amdgpu]
[13346.575911] amdgpu_irq_dispatch+0x29b/0x580 [amdgpu]
[13346.576114] amdgpu_ih_process+0x1b8/0x390 [amdgpu]
[13346.576311] amdgpu_irq_handler+0x27/0xb0 [amdgpu]
[13346.576507] __handle_irq_event_percpu+0x1c0/0x520
[13346.576514] handle_irq_event+0xa9/0x1c0
[13346.576520] handle_edge_irq+0x214/0xbc0
[13346.576527] __common_interrupt+0x9b/0x1f0
[13346.576535] common_interrupt+0xa9/0xd0
[13346.576543] asm_common_interrupt+0x26/0x40
[13346.576550] cpuidle_enter_state+0x2a3/0x340
[13346.576557] cpuidle_enter+0x4e/0xa0
[13346.576565] do_idle+0x360/0x450
[13346.576573] cpu_startup_entry+0x1d/0x20
[13346.576579] start_secondary+0x215/0x290
[13346.576587] secondary_startup_64_no_verify+0x17e/0x18b
[13346.576595]
to a HARDIRQ-irq-unsafe lock:
[13346.576601] (&xa->xa_lock#12){+.+.}-{2:2}
[13346.576603]
... which became HARDIRQ-irq-unsafe at:
[13346.576614] ...
[13346.576615] lock_acquire+0x1a6/0x4f0
[13346.576630] _raw_spin_lock+0x37/0x80
[13346.576636] drm_sched_job_add_dependency+0x14f/0x3d0 [gpu_sched]
[13346.576645] drm_sched_job_add_resv_dependencies+0xf9/0x210 [gpu_sched]
[13346.576654] amdgpu_fill_buffer+0x6a5/0xf70 [amdgpu]
[13346.576845] amdgpu_bo_release_notify+0x32e/0x4e0 [amdgpu]
[13346.577032] ttm_bo_release+0x265/0x9e0 [ttm]
[13346.577043] amdgpu_bo_unref+0x35/0x70 [amdgpu]
[13346.577236] amdgpu_vm_fini+0x822/0x1040 [amdgpu]
[13346.577432] amdgpu_mes_self_test+0x636/0x9f0 [amdgpu]
[13346.577641] mes_v11_0_late_init+0xb8/0xe0 [amdgpu]
[13346.577844] amdgpu_device_ip_late_init+0x100/0x7b0 [amdgpu]
[13346.578029] amdgpu_device_init+0x785c/0x8850 [amdgpu]
[13346.578217] amdgpu_driver_load_kms+0x1d/0x4b0 [amdgpu]
[13346.578401] amdgpu_pci_probe+0x287/0x9e0 [amdgpu]
[13346.578589] local_pci_probe+0xda/0x190
[13346.578596] pci_device_probe+0x23a/0x770
[13346.578603] really_probe+0x3df/0xb80
[13346.578610] __driver_probe_device+0x18c/0x450
[13346.578618] driver_probe_device+0x4a/0x120
[13346.578626] __driver_attach+0x1e5/0x4a0
[13346.578632] bus_for_each_dev+0x106/0x190
[13346.578639] bus_add_driver+0x2a1/0x570
[13346.578645] driver_register+0x134/0x460
[13346.578652] do_one_initcall+0xd2/0x430
[13346.578659] do_init_module+0x238/0x770
[13346.578666] load_module+0x5581/0x6f10
[13346.578673] __do_sys_init_module+0x1f2/0x220
[13346.578680] do_syscall_64+0x5d/0x90
[13346.578686] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[13346.578693]
other info that might help us debug this:
[13346.578699] Possible interrupt unsafe locking scenario:
[13346.578706] CPU0 CPU1
[13346.578712] ---- ----
[13346.578718] lock(&xa->xa_lock#12);
[13346.578725] local_irq_disable();
[13346.578731] lock(&fence->lock);
[13346.578740] lock(&xa->xa_lock#12);
[13346.578748] <Interrupt>
[13346.578754] lock(&fence->lock);
[13346.578761]
*** DEADLOCK ***
[13346.578767] 4 locks held by kworker/14:0/32457:
[13346.578773] #0: ffff888100063d48 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x81f/0x1540
[13346.578785] #1: ffffc9001e2ffd98 ((work_completion)(&job->work)){+.+.}-{0:0}, at: process_one_work+0x84d/0x1540
[13346.578793] #2: ffffffff96359f80 (dma_fence_map){++++}-{0:0}, at: dma_fence_signal+0x20/0xc0
[13346.578803] #3: ffff8884d0e65230 (&fence->lock){-.-.}-{2:2}, at: dma_fence_signal+0x42/0xc0
[13346.578812]
the dependencies between HARDIRQ-irq-safe lock and the holding lock:
[13346.578819] -> (&fence->lock){-.-.}-{2:2} {
[13346.578827] IN-HARDIRQ-W at:
[13346.578834] lock_acquire+0x1a6/0x4f0
[13346.578842] _raw_spin_lock_irqsave+0x51/0xa0
[13346.578849] dma_fence_signal+0x42/0xc0
[13346.578856] drm_sched_job_done.isra.0+0x10c/0x260 [gpu_sched]
[13346.578865] dma_fence_signal_timestamp_locked+0x241/0x430
[13346.578872] dma_fence_signal+0x55/0xc0
[13346.578878] amdgpu_fence_process+0x2d8/0x450 [amdgpu]
[13346.579070] sdma_v6_0_process_trap_irq+0x193/0x1e0 [amdgpu]
[13346.579271] amdgpu_irq_dispatch+0x29b/0x580 [amdgpu]
[13346.579472] amdgpu_ih_process+0x1b8/0x390 [amdgpu]
[13346.579674] amdgpu_irq_handler+0x27/0xb0 [amdgpu]
[13346.579871] __handle_irq_event_percpu+0x1c0/0x520
[13346.579878] handle_irq_event+0xa9/0x1c0
[13346.579884] handle_edge_irq+0x214/0xbc0
[13346.579892] __common_interrupt+0x9b/0x1f0
[13346.579899] common_interrupt+0xa9/0xd0
[13346.579905] asm_common_interrupt+0x26/0x40
[13346.579912] cpuidle_enter_state+0x2a3/0x340
[13346.579920] cpuidle_enter+0x4e/0xa0
[13346.579930] do_idle+0x360/0x450
[13346.579937] cpu_startup_entry+0x1d/0x20
[13346.579944] start_secondary+0x215/0x290
[13346.579950] secondary_startup_64_no_verify+0x17e/0x18b
[13346.579958] IN-SOFTIRQ-W at:
[13346.579964] lock_acquire+0x1a6/0x4f0
[13346.579971] _raw_spin_lock_irqsave+0x51/0xa0
[13346.579978] dma_fence_signal+0x42/0xc0
[13346.579984] drm_sched_job_done.isra.0+0x10c/0x260 [gpu_sched]
[13346.579994] dma_fence_signal_timestamp_locked+0x241/0x430
[13346.580001] dma_fence_signal+0x55/0xc0
[13346.580008] amdgpu_fence_process+0x2d8/0x450 [amdgpu]
[13346.580196] sdma_v6_0_process_trap_irq+0x193/0x1e0 [amdgpu]
[13346.580396] amdgpu_irq_dispatch+0x29b/0x580 [amdgpu]
[13346.580597] amdgpu_ih_process+0x1b8/0x390 [amdgpu]
[13346.580797] amdgpu_irq_handler+0x27/0xb0 [amdgpu]
[13346.580994] __handle_irq_event_percpu+0x1c0/0x520
[13346.581000] handle_irq_event+0xa9/0x1c0
[13346.581007] handle_edge_irq+0x214/0xbc0
[13346.581014] __common_interrupt+0x9b/0x1f0
[13346.581021] common_interrupt+0x51/0xd0
[13346.581028] asm_common_interrupt+0x26/0x40
[13346.581034] __do_softirq+0x1df/0x8bb
[13346.581041] __irq_exit_rcu+0xc8/0x1d0
[13346.581049] irq_exit_rcu+0xe/0x30
[13346.581056] sysvec_apic_timer_interrupt+0x93/0xc0
[13346.581062] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[13346.581069] cpuidle_enter_state+0x2a3/0x340
[13346.581076] cpuidle_enter+0x4e/0xa0
[13346.581084] do_idle+0x360/0x450
[13346.581090] cpu_startup_entry+0x1d/0x20
[13346.581097] start_secondary+0x215/0x290
[13346.581104] secondary_startup_64_no_verify+0x17e/0x18b
[13346.581110] INITIAL USE at:
[13346.581117] lock_acquire+0x1a6/0x4f0
[13346.581124] _raw_spin_lock_irqsave+0x51/0xa0
[13346.581130] dma_fence_signal+0x42/0xc0
[13346.581138] drm_sched_main+0x3b3/0x920 [gpu_sched]
[13346.581146] kthread+0x2eb/0x3c0
[13346.581154] ret_from_fork+0x31/0x70
[13346.581161] ret_from_fork_asm+0x1b/0x30
[13346.581168] }
[13346.581174] ... key at: [<ffffffffc0caafc0>] __key.24+0x0/0xffffffffffd32040 [gpu_sched]
[13346.581183]
the dependencies between the lock to be acquired
[13346.581184] and HARDIRQ-irq-unsafe lock:
[13346.581197] -> (&xa->xa_lock#12){+.+.}-{2:2} {
[13346.581206] HARDIRQ-ON-W at:
[13346.581213] lock_acquire+0x1a6/0x4f0
[13346.581221] _raw_spin_lock+0x37/0x80
[13346.581227] drm_sched_job_add_dependency+0x14f/0x3d0 [gpu_sched]
[13346.581236] drm_sched_job_add_resv_dependencies+0xf9/0x210 [gpu_sched]
[13346.581245] amdgpu_fill_buffer+0x6a5/0xf70 [amdgpu]
[13346.581442] amdgpu_bo_release_notify+0x32e/0x4e0 [amdgpu]
[13346.581640] ttm_bo_release+0x265/0x9e0 [ttm]
[13346.581655] amdgpu_bo_unref+0x35/0x70 [amdgpu]
[13346.582005] amdgpu_vm_fini+0x822/0x1040 [amdgpu]
[13346.582358] amdgpu_mes_self_test+0x636/0x9f0 [amdgpu]
[13346.582711] mes_v11_0_late_init+0xb8/0xe0 [amdgpu]
[13346.583060] amdgpu_device_ip_late_init+0x100/0x7b0 [amdgpu]
[13346.583392] amdgpu_device_init+0x785c/0x8850 [amdgpu]
[13346.583651] amdgpu_driver_load_kms+0x1d/0x4b0 [amdgpu]
[13346.583841] amdgpu_pci_probe+0x287/0x9e0 [amdgpu]
[13346.584028] local_pci_probe+0xda/0x190
[13346.584035] pci_device_probe+0x23a/0x770
[13346.584043] really_probe+0x3df/0xb80
[13346.584049] __driver_probe_device+0x18c/0x450
[13346.584056] driver_probe_device+0x4a/0x120
[13346.584063] __driver_attach+0x1e5/0x4a0
[13346.584070] bus_for_each_dev+0x106/0x190
[13346.584077] bus_add_driver+0x2a1/0x570
[13346.584083] driver_register+0x134/0x460
[13346.584091] do_one_initcall+0xd2/0x430
[13346.584098] do_init_module+0x238/0x770
[13346.584105] load_module+0x5581/0x6f10
[13346.584112] __do_sys_init_module+0x1f2/0x220
[13346.584119] do_syscall_64+0x5d/0x90
[13346.584126] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[13346.584133] SOFTIRQ-ON-W at:
[13346.584139] lock_acquire+0x1a6/0x4f0
[13346.584146] _raw_spin_lock+0x37/0x80
[13346.584152] drm_sched_job_add_dependency+0x14f/0x3d0 [gpu_sched]
[13346.584162] drm_sched_job_add_resv_dependencies+0xf9/0x210 [gpu_sched]
[13346.584171] amdgpu_fill_buffer+0x6a5/0xf70 [amdgpu]
[13346.584364] amdgpu_bo_release_notify+0x32e/0x4e0 [amdgpu]
[13346.584557] ttm_bo_release+0x265/0x9e0 [ttm]
[13346.584568] amdgpu_bo_unref+0x35/0x70 [amdgpu]
[13346.584759] amdgpu_vm_fini+0x822/0x1040 [amdgpu]
[13346.584954] amdgpu_mes_self_test+0x636/0x9f0 [amdgpu]
[13346.585156] mes_v11_0_late_init+0xb8/0xe0 [amdgpu]
[13346.585357] amdgpu_device_ip_late_init+0x100/0x7b0 [amdgpu]
[13346.585542] amdgpu_device_init+0x785c/0x8850 [amdgpu]
[13346.585736] amdgpu_driver_load_kms+0x1d/0x4b0 [amdgpu]
[13346.585922] amdgpu_pci_probe+0x287/0x9e0 [amdgpu]
[13346.586111] local_pci_probe+0xda/0x190
[13346.586117] pci_device_probe+0x23a/0x770
[13346.586124] really_probe+0x3df/0xb80
[13346.586131] __driver_probe_device+0x18c/0x450
[13346.586137] driver_probe_device+0x4a/0x120
[13346.586144] __driver_attach+0x1e5/0x4a0
[13346.586150] bus_for_each_dev+0x106/0x190
[13346.586157] bus_add_driver+0x2a1/0x570
[13346.586166] driver_register+0x134/0x460
[13346.586172] do_one_initcall+0xd2/0x430
[13346.586179] do_init_module+0x238/0x770
[13346.586186] load_module+0x5581/0x6f10
[13346.586193] __do_sys_init_module+0x1f2/0x220
[13346.586200] do_syscall_64+0x5d/0x90
[13346.586206] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[13346.586213] INITIAL USE at:
[13346.586219] lock_acquire+0x1a6/0x4f0
[13346.586226] _raw_spin_lock_irqsave+0x51/0xa0
[13346.586233] xa_destroy+0xb0/0x2d0
[13346.586239] drm_sched_job_cleanup+0x1e5/0x2a0 [gpu_sched]
[13346.586247] amdgpu_job_free_cb+0x13/0xb0 [amdgpu]
[13346.586469] drm_sched_main+0xfb/0x920 [gpu_sched]
[13346.586479] kthread+0x2eb/0x3c0
[13346.586486] ret_from_fork+0x31/0x70
[13346.586493] ret_from_fork_asm+0x1b/0x30
[13346.586499] }
[13346.586505] ... key at: [<ffffffffc0caaf80>] __key.8+0x0/0xffffffffffd32080 [gpu_sched]
[13346.586515] ... acquired at:
[13346.586522] lock_acquire+0x1a6/0x4f0
[13346.586533] _raw_spin_lock+0x37/0x80
[13346.586539] xa_erase+0x11/0x40
[13346.586546] drm_sched_entity_kill_jobs_cb+0x138/0x490 [gpu_sched]
[13346.586555] dma_fence_signal_timestamp_locked+0x241/0x430
[13346.586561] dma_fence_signal+0x55/0xc0
[13346.586568] drm_sched_entity_kill_jobs_work+0x41/0x130 [gpu_sched]
[13346.586577] process_one_work+0x924/0x1540
[13346.586583] worker_thread+0x104/0x12c0
[13346.586589] kthread+0x2eb/0x3c0
[13346.586595] ret_from_fork+0x31/0x70
[13346.586601] ret_from_fork_asm+0x1b/0x30
[13346.586613]
stack backtrace:
[13346.586620] CPU: 14 PID: 32457 Comm: kworker/14:0 Tainted: G W L ------- --- 6.5.0-0.rc5.36.fc39.x86_64+debug #1
[13346.586627] Hardware name: Micro-Star International Co., Ltd. MS-7D73/MPG B650I EDGE WIFI (MS-7D73), BIOS 1.44 08/02/2023
[13346.586634] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched]
[13346.586644] Call Trace:
[13346.586650] <TASK>
[13346.586657] dump_stack_lvl+0x76/0xd0
[13346.586665] check_irq_usage+0x115f/0x1970
[13346.586672] ? __pfx_mark_lock+0x10/0x10
[13346.586683] ? __pfx_check_irq_usage+0x10/0x10
[13346.586690] ? __pfx___bfs+0x10/0x10
[13346.586700] ? lockdep_lock+0xca/0x1c0
[13346.586707] ? __pfx_lockdep_lock+0x10/0x10
[13346.586713] ? __lock_acquire+0x2e34/0x5b50
[13346.586721] __lock_acquire+0x2e34/0x5b50
[13346.586729] ? __pfx___lock_acquire+0x10/0x10
[13346.586737] ? xas_start+0x199/0x500
[13346.586744] lock_acquire+0x1a6/0x4f0
[13346.586751] ? xa_erase+0x11/0x40
[13346.586758] ? __pfx_lock_acquire+0x10/0x10
[13346.586764] ? xa_find+0x14d/0x2b0
[13346.586771] ? __pfx_xa_find+0x10/0x10
[13346.586778] _raw_spin_lock+0x37/0x80
[13346.586784] ? xa_erase+0x11/0x40
[13346.586791] xa_erase+0x11/0x40
[13346.586797] drm_sched_entity_kill_jobs_cb+0x138/0x490 [gpu_sched]
[13346.586808] ? __pfx_drm_sched_entity_kill_jobs_cb+0x10/0x10 [gpu_sched]
[13346.586817] ? local_clock_noinstr+0xd/0xc0
[13346.586824] ? dma_fence_signal+0x42/0xc0
[13346.586831] dma_fence_signal_timestamp_locked+0x241/0x430
[13346.586838] ? __pfx_dma_fence_signal_timestamp_locked+0x10/0x10
[13346.586846] ? seqcount_lockdep_reader_access.constprop.0+0x4b/0xb0
[13346.586854] dma_fence_signal+0x55/0xc0
[13346.586861] drm_sched_entity_kill_jobs_work+0x41/0x130 [gpu_sched]
[13346.586870] process_one_work+0x924/0x1540
[13346.586878] ? worker_thread+0x2c8/0x12c0
[13346.586884] ? __pfx_process_one_work+0x10/0x10
[13346.586892] worker_thread+0x104/0x12c0
[13346.586900] ? __kthread_parkme+0xc1/0x1f0
[13346.586907] ? __pfx_worker_thread+0x10/0x10
[13346.586913] kthread+0x2eb/0x3c0
[13346.586920] ? __pfx_kthread+0x10/0x10
[13346.586927] ret_from_fork+0x31/0x70
[13346.586934] ? __pfx_kthread+0x10/0x10
[13346.586941] ret_from_fork_asm+0x1b/0x30
[13346.586948] </TASK>
Surprisingly, after a few minutes the game resumed.
Hardware description:
- CPU: Ryzen 7950X
- GPU: Radeon 7900 XTX
- System Memory: 64Gb
- Display(s): Philips 436M6VBPAB
- Type of Display Connection: DP
System information:
- Distro name and Version: Fedora Rawhide
- Kernel version: 6.5-rc5
Edited by Mikhail Gavrilov