GPU hang not detected on compute ring
I have what seems like a compute hang. amdgpu_fence_info shows
--- ring 5 (comp_1.0.1) ---
Last signaled fence 0x00000004
Last emitted 0x00000005
When I close the hanging process everything using the GPU gets stuck like:
[ 491.116717] INFO: task Xorg:1734 blocked for more than 122 seconds.
[ 491.116722] Tainted: G OE 5.8.12-arch1-1 #1
[ 491.116725] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 491.116728] Xorg D 0 1734 1733 0x00000084
[ 491.116733] Call Trace:
[ 491.116746] __schedule+0x2a6/0x810
[ 491.116750] schedule+0x46/0xf0
[ 491.116753] schedule_timeout+0x12d/0x170
[ 491.116759] dma_fence_wait_any_timeout+0x248/0x2b0
[ 491.116765] ? kmem_cache_alloc_trace+0x17c/0x220
[ 491.116860] amdgpu_sa_bo_new+0x48b/0x550 [amdgpu]
[ 491.116952] amdgpu_ib_get+0x3b/0x80 [amdgpu]
[ 491.117056] amdgpu_job_alloc_with_ib+0x53/0x80 [amdgpu]
[ 491.117144] amdgpu_vm_sdma_prepare+0x28/0x60 [amdgpu]
[ 491.117229] amdgpu_vm_bo_update_mapping.constprop.0+0x18c/0xa70 [amdgpu]
[ 491.117316] ? amdgpu_vm_del_from_lru_notify+0xe/0x70 [amdgpu]
[ 491.117318] ? _raw_spin_unlock+0x16/0x30
[ 491.117324] ? ttm_bo_init_reserved+0x1f0/0x330 [ttm]
[ 491.117407] ? amdgpu_vm_del_from_lru_notify+0xe/0x70 [amdgpu]
[ 491.117412] ? ttm_bo_move_to_lru_tail+0x21/0xc0 [ttm]
[ 491.117414] ? _raw_spin_unlock+0x16/0x30
[ 491.117494] ? amdgpu_bo_do_create+0x3aa/0x520 [amdgpu]
[ 491.117578] amdgpu_vm_bo_update+0x30e/0x6b0 [amdgpu]
[ 491.117662] amdgpu_gem_va_ioctl+0x4d6/0x500 [amdgpu]
[ 491.117746] ? amdgpu_gem_va_map_flags+0x60/0x60 [amdgpu]
[ 491.117765] drm_ioctl_kernel+0xb2/0x100 [drm]
[ 491.117783] drm_ioctl+0x208/0x360 [drm]
[ 491.117865] ? amdgpu_gem_va_map_flags+0x60/0x60 [amdgpu]
[ 491.117946] amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[ 491.117952] ksys_ioctl+0x82/0xc0
[ 491.117955] __x64_sys_ioctl+0x16/0x20
[ 491.117959] do_syscall_64+0x44/0x70
[ 491.117961] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 491.117964] RIP: 0033:0x7f01dfdc7f6b
[ 491.117966] Code: Bad RIP value.
[ 491.117967] RSP: 002b:00007fff93d014f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 491.117969] RAX: ffffffffffffffda RBX: 00007fff93d01540 RCX: 00007f01dfdc7f6b
[ 491.117970] RDX: 00007fff93d01540 RSI: 00000000c0286448 RDI: 0000000000000010
[ 491.117971] RBP: 00000000c0286448 R08: ffff800105600000 R09: 000000000000008e
[ 491.117972] R10: 0000000000000023 R11: 0000000000000246 R12: 000055dcc9c2c240
[ 491.117972] R13: 0000000000000010 R14: 00000000003d1000 R15: 000055dcc94f5310
[ 491.118035] INFO: task steamwebhelper:2542 blocked for more than 122 seconds.
[ 491.118036] Tainted: G OE 5.8.12-arch1-1 #1
[ 491.118037] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
(edit: this all is on 5.8.12 with a navi10 GPU)