deadlock occurs during GPU reset
@Mikhail
Submitted by Mikhail Gavrilov Assigned to Default DRI bug account
Link to original bug (#109692)
Description
Created attachment 143419
dmesg
Steps for reproduce:
- $ git clone git://people.freedesktop.org/~agd5f/linux -b amd-staging-drm-next
- $ make bzImage && make module
-
- Launch "Shadow of the Tomb Raider"
--- Here GPU hung occurs ---
and after few time
--- Here start GPU reset ---
--- Here Deadlock occurs ---
[ 291.746741] amdgpu 0000:0b:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:7 pasid:32774, for process SOTTR.exe pid 5250 thread SOTTR.exe pid 5250)
[ 291.746750] amdgpu 0000:0b:00.0: in page starting at address 0x0000000000002000 from 27
[ 291.746754] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0070113C
[ 297.135183] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
[ 302.255032] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
[ 302.265813] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=13292, emitted seq=13293
[ 302.265950] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process SOTTR.exe pid 5250 thread SOTTR.exe pid 5250
[ 302.265974] amdgpu 0000:0b:00.0: GPU reset begin!
[ 302.266337] ======================================================
[ 302.266338] WARNING: possible circular locking dependency detected
[ 302.266340] 5.0.0-rc1-drm-next-kernel+ #1 (closed) Tainted: G C
[ 302.266341] ------------------------------------------------------
[ 302.266343] kworker/5:2/871 is trying to acquire lock:
[ 302.266345] 000000000abbb16a (&(&ring->fence_drv.lock)->rlock){-.-.}, at: dma_fence_remove_callback+0x1a/0x60
[ 302.266352]
but task is already holding lock:
[ 302.266353] 000000006e32ba38 (&(&sched->job_list_lock)->rlock){-.-.}, at: drm_sched_stop+0x34/0x140 [gpu_sched]
[ 302.266358]
which lock already depends on the new lock.
[ 302.266360]
the existing dependency chain (in reverse order) is:
[ 302.266361]
-> #1 (closed) (&(&sched->job_list_lock)->rlock){-.-.}:
[ 302.266366] drm_sched_process_job+0x4d/0x180 [gpu_sched]
[ 302.266368] dma_fence_signal+0x111/0x1a0
[ 302.266414] amdgpu_fence_process+0xa3/0x100 [amdgpu]
[ 302.266470] sdma_v4_0_process_trap_irq+0x6e/0xa0 [amdgpu]
[ 302.266523] amdgpu_irq_dispatch+0xc0/0x250 [amdgpu]
[ 302.266576] amdgpu_ih_process+0x84/0xf0 [amdgpu]
[ 302.266628] amdgpu_irq_handler+0x1b/0x50 [amdgpu]
[ 302.266632] __handle_irq_event_percpu+0x3f/0x290
[ 302.266635] handle_irq_event_percpu+0x31/0x80
[ 302.266637] handle_irq_event+0x34/0x51
[ 302.266639] handle_edge_irq+0x7c/0x1a0
[ 302.266643] handle_irq+0xbf/0x100
[ 302.266646] do_IRQ+0x61/0x120
[ 302.266648] ret_from_intr+0x0/0x22
[ 302.266651] cpuidle_enter_state+0xbf/0x470
[ 302.266654] do_idle+0x1ec/0x280
[ 302.266657] cpu_startup_entry+0x19/0x20
[ 302.266660] start_secondary+0x1b3/0x200
[ 302.266663] secondary_startup_64+0xa4/0xb0
[ 302.266664]
-> #0 (&(&ring->fence_drv.lock)->rlock){-.-.}:
[ 302.266668] _raw_spin_lock_irqsave+0x49/0x83
[ 302.266670] dma_fence_remove_callback+0x1a/0x60
[ 302.266673] drm_sched_stop+0x59/0x140 [gpu_sched]
[ 302.266717] amdgpu_device_pre_asic_reset+0x4f/0x240 [amdgpu]
[ 302.266761] amdgpu_device_gpu_recover+0x88/0x7d0 [amdgpu]
[ 302.266822] amdgpu_job_timedout+0x109/0x130 [amdgpu]
[ 302.266827] drm_sched_job_timedout+0x40/0x70 [gpu_sched]
[ 302.266831] process_one_work+0x272/0x5d0
[ 302.266834] worker_thread+0x50/0x3b0
[ 302.266836] kthread+0x108/0x140
[ 302.266839] ret_from_fork+0x27/0x50
[ 302.266840]
other info that might help us debug this:
[ 302.266841] Possible unsafe locking scenario:
[ 302.266842] CPU0 CPU1
[ 302.266843] ---- ----
[ 302.266844] lock(&(&sched->job_list_lock)->rlock);
[ 302.266846] lock(&(&ring->fence_drv.lock)->rlock);
[ 302.266847] lock(&(&sched->job_list_lock)->rlock);
[ 302.266849] lock(&(&ring->fence_drv.lock)->rlock);
[ 302.266850]
*** DEADLOCK ***
[ 302.266852] 5 locks held by kworker/5:2/871:
[ 302.266853] #0: 00000000d133fb6e ((wq_completion)"events"){+.+.}, at: process_one_work+0x1e9/0x5d0
[ 302.266857] #1 (closed): 000000008a5c3f7e ((work_completion)(&(&sched->work_tdr)->work)){+.+.}, at: process_one_work+0x1e9/0x5d0
[ 302.266862] #2: 00000000b9b2c76f (&adev->lock_reset){+.+.}, at: amdgpu_device_lock_adev+0x17/0x40 [amdgpu]
[ 302.266908] #3 (closed): 00000000ac637728 (&dqm->lock_hidden){+.+.}, at: kgd2kfd_pre_reset+0x30/0x60 [amdgpu]
[ 302.266965] #4 (closed): 000000006e32ba38 (&(&sched->job_list_lock)->rlock){-.-.}, at: drm_sched_stop+0x34/0x140 [gpu_sched]
[ 302.266971]
stack backtrace:
[ 302.266975] CPU: 5 PID: 871 Comm: kworker/5:2 Tainted: G C 5.0.0-rc1-drm-next-kernel+ #1 (closed)
[ 302.266976] Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 1103 11/16/2018
[ 302.266980] Workqueue: events drm_sched_job_timedout [gpu_sched]
[ 302.266982] Call Trace:
[ 302.266987] dump_stack+0x85/0xc0
[ 302.266991] print_circular_bug.isra.0.cold+0x15c/0x195
[ 302.266994] __lock_acquire+0x134c/0x1660
[ 302.266998] ? add_lock_to_list.isra.0+0x67/0xb0
[ 302.267003] lock_acquire+0xa2/0x1b0
[ 302.267006] ? dma_fence_remove_callback+0x1a/0x60
[ 302.267011] _raw_spin_lock_irqsave+0x49/0x83
[ 302.267013] ? dma_fence_remove_callback+0x1a/0x60
[ 302.267016] dma_fence_remove_callback+0x1a/0x60
[ 302.267020] drm_sched_stop+0x59/0x140 [gpu_sched]
[ 302.267065] amdgpu_device_pre_asic_reset+0x4f/0x240 [amdgpu]
[ 302.267110] amdgpu_device_gpu_recover+0x88/0x7d0 [amdgpu]
[ 302.267173] amdgpu_job_timedout+0x109/0x130 [amdgpu]
[ 302.267178] drm_sched_job_timedout+0x40/0x70 [gpu_sched]
[ 302.267183] process_one_work+0x272/0x5d0
[ 302.267188] worker_thread+0x50/0x3b0
[ 302.267191] kthread+0x108/0x140
[ 302.267194] ? process_one_work+0x5d0/0x5d0
[ 302.267196] ? kthread_park+0x90/0x90
[ 302.267199] ret_from_fork+0x27/0x50
[ 302.692194] amdgpu 0000:0b:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[ 302.692234] [drm:gfx_v9_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[ 302.768931] amdgpu 0000:0b:00.0: GPU BACO reset
[ 303.278874] amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
[ 303.279006] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
[ 303.279072] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
[ 303.279234] [drm] PSP is resuming...
[ 303.426601] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
[ 303.572227] [drm] UVD and UVD ENC initialized successfully.
[ 303.687727] [drm] VCE initialized successfully.
[ 303.689585] [drm] recover vram bo from shadow start
[ 303.722757] [drm] recover vram bo from shadow done
[ 303.722761] [drm] Skip scheduling IBs!
[ 303.722791] amdgpu 0000:0b:00.0: GPU reset(2) succeeded!
[ 303.722811] [drm] Skip scheduling IBs!
[ 303.722838] [drm] Skip scheduling IBs!
[ 303.722846] [drm] Skip scheduling IBs!
[ 303.722854] [drm] Skip scheduling IBs!
[ 303.722863] [drm] Skip scheduling IBs!
[ 303.722871] [drm] Skip scheduling IBs!
**Attachment 143419**, "dmesg":
dmesg2.txt