changes in commit e4b6a0a8 leads to GPU hang
Today after update to new mesa snapshot my computer hanged after launching vulkan application.
I did't believe my eyes but after reboot when I launch vkcube
hang repeated.
Bisect helped find first bad commit:
git bisect start
# status: waiting for both good and bad commits
# bad: [87ac5d7d0a14be1457385ccf3e11059aedd95acb] nir: Remove unnecessary assert in nir_before_src
git bisect bad 87ac5d7d0a14be1457385ccf3e11059aedd95acb
# status: waiting for good commit(s), bad commit known
# good: [653a37412617dbd72cc6a89d4d8ed2ee5a1b5aeb] dzn: Fix incremental binding of VBs
git bisect good 653a37412617dbd72cc6a89d4d8ed2ee5a1b5aeb
# good: [ba373a298daa9e8c5812366465dcedefa647197d] iris: Add iris_implicit_sync struct and functions to do implicit synchronization for Xe kmd
git bisect good ba373a298daa9e8c5812366465dcedefa647197d
# bad: [e4b6a0a82457b3ef40c5857412e20bc344ff302c] compiler: Getting shader_prim to be PACKED that consistence with pipe_prim_type
git bisect bad e4b6a0a82457b3ef40c5857412e20bc344ff302c
# good: [39e057028cb7fe2ee827722a5a909cb968aad39a] vulkan/wsi: fix double free on error condition
git bisect good 39e057028cb7fe2ee827722a5a909cb968aad39a
# good: [fcef3f040befff0871dd8d0d331cd8c50c150d62] microsoft/compiler: Getting function impl to be consistence with decl in dxil_enums.*
git bisect good fcef3f040befff0871dd8d0d331cd8c50c150d62
# first bad commit: [e4b6a0a82457b3ef40c5857412e20bc344ff302c] compiler: Getting shader_prim to be PACKED that consistence with pipe_prim_type
e4b6a0a82457b3ef40c5857412e20bc344ff302c is the first bad commit
commit e4b6a0a82457b3ef40c5857412e20bc344ff302c
Author: Yonggang Luo <luoyonggang@gmail.com>
Date: Thu Jun 1 21:44:15 2023 +0800
compiler: Getting shader_prim to be PACKED that consistence with pipe_prim_type
This is a prepare step for replace all usage of pipe_prim_type and shader_prim with mesa_prim
Signed-off-by: Yonggang Luo <luoyonggang@gmail.com>
Acked-by: Marek Olšák <marek.olsak@amd.com>
Acked-by: Jesse Natalie <jenatali@microsoft.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23369>
src/compiler/shader_enums.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
In kernel log when GPU hang occurred I didn't saw any page faults so it look like on kernel side as real dead lock.
[ 3444.957955] INFO: task kworker/24:1H:502 blocked for more than 122 seconds.
[ 3444.958091] Tainted: G W L ------- --- 6.4.0-0.rc4.20230601git929ed21dfdb6.38.fc39.x86_64+debug #1
[ 3444.958097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3444.958101] task:kworker/24:1H state:D stack:27664 pid:502 ppid:2 flags:0x00004000
[ 3444.958115] Workqueue: ttm ttm_bo_delayed_delete [ttm]
[ 3444.958135] Call Trace:
[ 3444.958139] <TASK>
[ 3444.958150] __schedule+0x10ac/0x5e80
[ 3444.958161] ? mark_lock+0x101/0x16e0
[ 3444.958170] ? __pfx_mark_lock+0x10/0x10
[ 3444.958177] ? lock_acquire+0x1a6/0x4f0
[ 3444.958185] ? find_held_lock+0x34/0x120
[ 3444.958194] ? __pfx___schedule+0x10/0x10
[ 3444.958207] ? mark_held_locks+0x96/0xe0
[ 3444.958219] schedule+0x137/0x220
[ 3444.958227] schedule_timeout+0x240/0x280
[ 3444.958234] ? __pfx_schedule_timeout+0x10/0x10
[ 3444.958249] ? _raw_spin_unlock_irqrestore+0x66/0x80
[ 3444.958259] dma_fence_default_wait+0x4a6/0x720
[ 3444.958270] ? __pfx_dma_fence_default_wait+0x10/0x10
[ 3444.958277] ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 3444.958286] ? _raw_spin_unlock_irqrestore+0x66/0x80
[ 3444.958298] dma_fence_wait_timeout+0x2a4/0x310
[ 3444.958306] dma_resv_wait_timeout+0xcc/0x170
[ 3444.958313] ? __pfx_dma_resv_wait_timeout+0x10/0x10
[ 3444.958318] ? __pfx_lock_acquire+0x10/0x10
[ 3444.958328] ? __pfx___lock_acquire+0x10/0x10
[ 3444.958338] ttm_bo_delayed_delete+0x56/0x130 [ttm]
[ 3444.958352] process_one_work+0x885/0x1460
[ 3444.958364] ? worker_thread+0x2c8/0x12c0
[ 3444.958369] ? __pfx_process_one_work+0x10/0x10
[ 3444.958388] worker_thread+0x104/0x12c0
[ 3444.958400] ? __kthread_parkme+0xc1/0x1f0
[ 3444.958408] ? __pfx_worker_thread+0x10/0x10
[ 3444.958414] kthread+0x2eb/0x3c0
[ 3444.958419] ? __pfx_kthread+0x10/0x10
[ 3444.958427] ret_from_fork+0x29/0x50
[ 3444.958445] </TASK>
[ 3444.958573] INFO: task kworker/u64:24:65819 blocked for more than 122 seconds.
[ 3444.958579] Tainted: G W L ------- --- 6.4.0-0.rc4.20230601git929ed21dfdb6.38.fc39.x86_64+debug #1
[ 3444.958583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3444.958586] task:kworker/u64:24 state:D stack:25104 pid:65819 ppid:2 flags:0x00004000
[ 3444.958596] Workqueue: events_unbound commit_work
[ 3444.958604] Call Trace:
[ 3444.958608] <TASK>
[ 3444.958615] __schedule+0x10ac/0x5e80
[ 3444.958621] ? __pfx_mark_lock+0x10/0x10
[ 3444.958626] ? seqcount_lockdep_reader_access.constprop.0+0xa5/0xb0
[ 3444.958645] ? __pfx___schedule+0x10/0x10
[ 3444.958652] ? mark_lock+0x101/0x16e0
[ 3444.958658] ? __pfx___lock_acquire+0x10/0x10
[ 3444.958666] ? __pfx_mark_lock+0x10/0x10
[ 3444.958678] schedule+0x137/0x220
[ 3444.958686] schedule_timeout+0x240/0x280
[ 3444.958692] ? __pfx_schedule_timeout+0x10/0x10
[ 3444.958707] ? _raw_spin_unlock_irqrestore+0x66/0x80
[ 3444.958716] dma_fence_default_wait+0x4a6/0x720
[ 3444.958726] ? __pfx_dma_fence_default_wait+0x10/0x10
[ 3444.958733] ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 3444.958742] ? _raw_spin_unlock_irqrestore+0x66/0x80
[ 3444.958751] dma_fence_wait_timeout+0x2a4/0x310
[ 3444.958759] drm_atomic_helper_wait_for_fences+0x480/0x710
[ 3444.958769] ? __pfx_drm_atomic_helper_wait_for_fences+0x10/0x10
[ 3444.958775] ? seqcount_lockdep_reader_access.constprop.0+0xa5/0xb0
[ 3444.958781] ? lockdep_hardirqs_on+0x81/0x110
[ 3444.958787] ? seqcount_lockdep_reader_access.constprop.0+0xa5/0xb0
[ 3444.958797] commit_tail+0x79/0x310
[ 3444.958807] process_one_work+0x885/0x1460
[ 3444.958817] ? worker_thread+0x2c8/0x12c0
[ 3444.958823] ? __pfx_process_one_work+0x10/0x10
[ 3444.958840] worker_thread+0x104/0x12c0
[ 3444.958873] ? __kthread_parkme+0xc1/0x1f0
[ 3444.958882] ? __pfx_worker_thread+0x10/0x10
[ 3444.958888] kthread+0x2eb/0x3c0
[ 3444.958893] ? __pfx_kthread+0x10/0x10
[ 3444.958900] ret_from_fork+0x29/0x50
[ 3444.958916] </TASK>
[ 3444.958934]
Showing all locks held in the system:
[ 3444.958940] 1 lock held by rcu_tasks_kthre/12:
[ 3444.958944] #0: ffffffffa0c8cfc0 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x31/0xde0
[ 3444.958962] 1 lock held by rcu_tasks_rude_/13:
[ 3444.958966] #0: ffffffffa0c8cce0 (rcu_tasks_rude.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x31/0xde0
[ 3444.958980] 1 lock held by rcu_tasks_trace/14:
[ 3444.958984] #0: ffffffffa0c8c9a0 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x31/0xde0
[ 3444.959004] 1 lock held by khungtaskd/215:
[ 3444.959008] #0: ffffffffa0c8dc80 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x55/0x340
[ 3444.959023] 2 locks held by kworker/24:1H/502:
[ 3444.959027] #0: ffff88818276d948 ((wq_completion)ttm){+.+.}-{0:0}, at: process_one_work+0x7ab/0x1460
[ 3444.959041] #1: ffffc9000393fdb8 ((work_completion)(&bo->delayed_delete)){+.+.}-{0:0}, at: process_one_work+0x7d8/0x1460
[ 3444.959056] 1 lock held by systemd-journal/891:
[ 3444.959075] 3 locks held by Xwayland:cs0/15892:
[ 3444.959079] #0: ffff888131ac70a8 (&list->bo_list_mutex){+.+.}-{3:3}, at: amdgpu_cs_ioctl+0x1c70/0x55e0 [amdgpu]
[ 3444.959412] #1: ffffc900318df940 (reservation_ww_class_acquire){+.+.}-{0:0}, at: amdgpu_cs_ioctl+0x2050/0x55e0 [amdgpu]
[ 3444.959730] #2: ffff8883c47f4208 (reservation_ww_class_mutex){+.+.}-{3:3}, at: ttm_eu_reserve_buffers+0xb1a/0x1190 [ttm]
[ 3444.959753] 2 locks held by nvtop/36278:
[ 3444.959757] #0: ffff888c3d6357d0 (&p->lock){+.+.}-{3:3}, at: seq_read_iter+0xca/0x11c0
[ 3444.959772] #1: ffff888171548208 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_show_fdinfo+0x2ea/0x900 [amdgpu]
[ 3444.960120] 2 locks held by kworker/u64:24/65819:
[ 3444.960124] #0: ffff8881080a7148 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x7ab/0x1460
[ 3444.960138] #1: ffffc9001a76fdb8 ((work_completion)(&state->commit_work)){+.+.}-{0:0}, at: process_one_work+0x7d8/0x1460
[ 3444.960158] =============================================
Full kernel log:
- Captured from 7900XTX dmesg-gpu-hang-1-7900XTX.txt
- Captured from 7900XTX dmesg-gpu-hang-2-7900XTX.txt
- Captured from 7900XTX dmesg-gpu-hang-3-7900XTX.txt
- Captured from 7900XTX dmesg-gpu-hang-4-7900XTX.txt
- Captured from 6900XT dmesg-gpu-hang-5-6900XT.txt
- Captured from 6900XT dmesg-gpu-hang-6-6900XT.txt
Build log and compilation options:
build.log
GPU: 7900XTX or 6900XT
CPU: 7950X
Interesting that RADV_DEBUG=llvm vkcube
didn't leads 7900XTX to hang. So it look like as ACO issue.