Possible circular locking dependency detected
[ 204.534310] xe 0000:00:02.0: [drm] Timedout job: seqno=34, guc_id=6, flags=0x4
[ 209.653856] ------------[ cut here ]------------
[ 209.653859] WARNING: CPU: 6 PID: 89 at drivers/gpu/drm/xe/xe_guc_submit.c:846 guc_engine_timedout_job+0x471/0x520 [xe]
[ 209.653887] Modules linked in: xe drm_ttm_helper drm_suballoc_helper gpu_sched ttm snd_hda_codec_hdmi mei_pxp mei_hdcp pmt_telemetry pmt_class x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel kvm_intel snd_intel_dspcfg snd_hda_codec wmi_bmof e1000e snd_hwdep i2c_i801 ptp snd_hda_core i2c_smbus pps_core mei_me snd_pcm mei intel_vsec wmi fuse [last unloaded: ttm]
[ 209.653932] CPU: 6 PID: 89 Comm: kworker/u16:2 Not tainted 6.1.0-rc1+zeh-xe+ #857
[ 209.653935] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U DDR4 SODIMM RVP, BIOS TGLSFWI1.R00.4183.A01.2104282326 04/28/2021
[ 209.653937] Workqueue: gt-ordered-wq drm_sched_job_timedout [gpu_sched]
[ 209.653945] RIP: 0010:guc_engine_timedout_job+0x471/0x520 [xe]
[ 209.653959] Code: 24 18 48 89 4c 24 08 49 8d ae 58 03 00 00 e8 56 d5 e7 e0 48 8b 4c 24 08 4d 8d 7d 08 48 89 ee 4c 89 ff 85 c9 0f 85 ba fc ff ff <0f> 0b 49 8b ae 58 03 00 00 49 c7 46 18 01 00 00 00 48 89 74 24 10
[ 209.653961] RSP: 0018:ffffc900014f7db8 EFLAGS: 00010286
[ 209.653965] RAX: 0000000080000000 RBX: ffff888113ce8800 RCX: 0000000000000000
[ 209.653967] RDX: 0000000000000001 RSI: ffff888132116b58 RDI: ffff888112ec7c48
[ 209.653968] RBP: ffff88811a3a3da8 R08: 0000000000000001 R09: 0000000000000001
[ 209.653970] R10: 00000000e07ca948 R11: 000000000d2fbeb7 R12: 00000000ffffffc2
[ 209.653971] R13: ffff888112ec7c40 R14: ffff888132116800 R15: ffff888112ec7c48
[ 209.653973] FS: 0000000000000000(0000) GS:ffff88849e900000(0000) knlGS:0000000000000000
[ 209.653975] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 209.653977] CR2: 00007fe0ccb26a20 CR3: 0000000005812006 CR4: 0000000000770ee0
[ 209.653978] PKRU: 55555554
[ 209.653980] Call Trace:
[ 209.653982] <TASK>
[ 209.653986] ? prepare_to_wait_exclusive+0xd0/0xd0
[ 209.653994] drm_sched_job_timedout+0x77/0x110 [gpu_sched]
[ 209.654002] process_one_work+0x282/0x580
[ 209.654012] worker_thread+0x4a/0x390
[ 209.654017] ? process_one_work+0x580/0x580
[ 209.654020] kthread+0xed/0x120
[ 209.654023] ? kthread_complete_and_exit+0x20/0x20
[ 209.654027] ret_from_fork+0x1f/0x30
[ 209.654039] </TASK>
[ 209.654040] irq event stamp: 326627
[ 209.654042] hardirqs last enabled at (326633): [<ffffffff811a8c89>] __up_console_sem+0x59/0x80
[ 209.654045] hardirqs last disabled at (326638): [<ffffffff811a8c6e>] __up_console_sem+0x3e/0x80
[ 209.654047] softirqs last enabled at (325596): [<ffffffff8112f1e1>] __irq_exit_rcu+0xb1/0x120
[ 209.654051] softirqs last disabled at (325571): [<ffffffff8112f1e1>] __irq_exit_rcu+0xb1/0x120
[ 209.654053] ---[ end trace 0000000000000000 ]---
[ 209.654244] xe 0000:00:02.0: [drm] Try GT reset
[ 209.654256] xe 0000:00:02.0: [drm] Doing GT reset
[ 209.654411] xe 0000:00:02.0: [drm] GT reset started
[ 209.656224] ======================================================
[ 209.662417] WARNING: possible circular locking dependency detected
[ 209.668610] 6.1.0-rc1+zeh-xe+ #857 Tainted: G W
[ 209.674545] ------------------------------------------------------
[ 209.680743] kworker/u16:2/89 is trying to acquire lock:
[ 209.685984] ffff888113d53a88 ((work_completion)(&sched->work_run)){+.+.}-{0:0}, at: __flush_work+0x4d/0x4c0
[ 209.695750]
but task is already holding lock:
[ 209.701596] ffff88811a3a47d0 (&guc->submission_state.lock){+.+.}-{3:3}, at: xe_guc_submit_stop+0x51/0x3c0 [xe]
[ 209.711632]
which lock already depends on the new lock.
[ 209.719827]
the existing dependency chain (in reverse order) is:
[ 209.727324]
-> #2 (&guc->submission_state.lock){+.+.}-{3:3}:
[ 209.734480] __mutex_lock+0x9d/0x890
[ 209.738592] xe_guc_submit_init+0x109/0x180 [xe]
[ 209.743760] xe_uc_init+0x60/0x70 [xe]
[ 209.748062] xe_gt_init+0x276/0x450 [xe]
[ 209.752532] xe_device_probe+0x177/0x1d0 [xe]
[ 209.757441] xe_pci_probe+0x22c/0x3d0 [xe]
[ 209.762087] pci_device_probe+0x9f/0x150
[ 209.766545] really_probe+0xd6/0x380
[ 209.770658] __driver_probe_device+0x73/0x170
[ 209.775553] driver_probe_device+0x1a/0x90
[ 209.780184] __driver_attach+0xd0/0x1d0
[ 209.784556] bus_for_each_dev+0x71/0xc0
[ 209.788925] bus_add_driver+0x1ac/0x200
[ 209.793297] driver_register+0x84/0xe0
[ 209.797580] do_one_initcall+0x55/0x2f0
[ 209.801953] do_init_module+0x45/0x1c0
[ 209.806238] __do_sys_finit_module+0xac/0x120
[ 209.811129] do_syscall_64+0x37/0x90
[ 209.815240] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 209.820827]
-> #1 (dma_fence_map){++++}-{0:0}:
[ 209.826763] dma_fence_begin_signalling+0x51/0x60
[ 209.832003] xe_guc_ct_send+0x29/0x90 [xe]
[ 209.836652] __register_engine+0x64/0x90 [xe]
[ 209.841555] guc_engine_run_job+0x7f2/0xae0 [xe]
[ 209.846726] drm_sched_main+0x223/0x6a0 [gpu_sched]
[ 209.852142] process_one_work+0x282/0x580
[ 209.856689] worker_thread+0x4a/0x390
[ 209.860886] kthread+0xed/0x120
[ 209.864559] ret_from_fork+0x1f/0x30
[ 209.868667]
-> #0 ((work_completion)(&sched->work_run)){+.+.}-{0:0}:
[ 209.876515] __lock_acquire+0x16b5/0x28e0
[ 209.881059] lock_acquire+0xd1/0x2f0
[ 209.885166] __flush_work+0x77/0x4c0
[ 209.889274] __cancel_work_timer+0x12d/0x1d0
[ 209.894078] xe_guc_submit_stop+0x8c/0x3c0 [xe]
[ 209.899160] gt_reset_worker.cold+0xd2/0x193 [xe]
[ 209.904417] process_one_work+0x282/0x580
[ 209.908962] worker_thread+0x4a/0x390
[ 209.913156] kthread+0xed/0x120
[ 209.916835] ret_from_fork+0x1f/0x30
[ 209.920941]
other info that might help us debug this:
[ 209.928964] Chain exists of:
(work_completion)(&sched->work_run) --> dma_fence_map --> &guc->submission_state.lock
[ 209.942377] Possible unsafe locking scenario:
[ 209.948312] CPU0 CPU1
[ 209.952854] ---- ----
[ 209.957400] lock(&guc->submission_state.lock);
[ 209.962029] lock(dma_fence_map);
[ 209.967969] lock(&guc->submission_state.lock);
[ 209.975118] lock((work_completion)(&sched->work_run));
[ 209.980449]
*** DEADLOCK ***
[ 209.986386] 4 locks held by kworker/u16:2/89:
[ 209.990754] #0: ffff888113d11138 ((wq_completion)gt-ordered-wq){+.+.}-{0:0}, at: process_one_work+0x1fd/0x580
[ 210.000777] #1: ffffc900014f7e78 ((work_completion)(>->reset.worker)){+.+.}-{0:0}, at: process_one_work+0x1fd/0x580
[ 210.011496] #2: ffffffff829ee800 (dma_fence_map){++++}-{0:0}, at: xe_guc_submit_stop+0x26/0x3c0 [xe]
[ 210.020749] #3: ffff88811a3a47d0 (&guc->submission_state.lock){+.+.}-{3:3}, at: xe_guc_submit_stop+0x51/0x3c0 [xe]
[ 210.031219]
stack backtrace:
[ 210.035589] CPU: 6 PID: 89 Comm: kworker/u16:2 Tainted: G W 6.1.0-rc1+zeh-xe+ #857
[ 210.044567] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U DDR4 SODIMM RVP, BIOS TGLSFWI1.R00.4183.A01.2104282326 04/28/2021
[ 210.057980] Workqueue: gt-ordered-wq gt_reset_worker [xe]
[ 210.063410] Call Trace:
[ 210.065868] <TASK>
[ 210.067983] dump_stack_lvl+0x6a/0x9f
[ 210.071657] check_noncircular+0x132/0x150
[ 210.075770] __lock_acquire+0x16b5/0x28e0
[ 210.079796] lock_acquire+0xd1/0x2f0
[ 210.083384] ? __flush_work+0x4d/0x4c0
[ 210.087146] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 210.092474] ? lockdep_hardirqs_on+0xbf/0x140
[ 210.096846] __flush_work+0x77/0x4c0
[ 210.100438] ? __flush_work+0x4d/0x4c0
[ 210.104203] __cancel_work_timer+0x12d/0x1d0
[ 210.108491] xe_guc_submit_stop+0x8c/0x3c0 [xe]
[ 210.113049] gt_reset_worker.cold+0xd2/0x193 [xe]
[ 210.117785] process_one_work+0x282/0x580
[ 210.121809] worker_thread+0x4a/0x390
[ 210.125486] ? process_one_work+0x580/0x580
[ 210.129687] kthread+0xed/0x120
[ 210.132841] ? kthread_complete_and_exit+0x20/0x20
[ 210.137645] ret_from_fork+0x1f/0x30
[ 210.141234] </TASK>
[ 210.152786] xe 0000:00:02.0: [drm] GT reset done
[ 210.157440] xe 0000:00:02.0: [drm] Timedout job: seqno=34, guc_id=6, flags=0x5
[ 210.166717] xe 0000:00:02.0: [drm] Timedout job: seqno=119, guc_id=6, flags=0x5
[ 210.175710] xe 0000:00:02.0: [drm] Timedout job: seqno=204, guc_id=6, flags=0x5
Never saw this before but saw one time after rebase on the latest Xe, could be something is this new patches:
2022-10-28 12:37 +0000 Philippe Lecluse M─┐ [xe] {gitlab-xe/xe} Merge branch 'more_guc_options' into 'xe'
2022-10-28 12:19 +0000 Matthew Brost M─│─┐ Merge branch 'pvc_fixes' into 'xe'
2022-10-28 13:24 +0200 Philippe Lecluse │ o │ drm/xe: add more guc options
2022-10-27 15:17 -0700 Matthew Brost │ │ o drm/xe: Don't define MEDIA_GUC_HOST_INTERRUPT twice
2022-10-27 15:16 -0700 Matthew Brost │ │ o drm/xe: Don't shadow variables
2022-10-27 15:09 -0700 Matthew Brost │ │ o drm/xe: Initialize master_ctl in IRQ handler
2022-10-27 13:04 -0700 Matthew Brost │ │ o drm/xe: Fixup xe_migrate dma-resv usage
2022-10-27 12:54 -0700 Matthew Brost │ │ o drm/xe: Wait on VM's kernel dma-resv slots
2022-10-27 10:28 -0700 Matthew Brost │ │ o drm/xe: Fix munmap style unbinds
2022-10-26 19:35 -0700 Matthew Brost │ │ o drm/xe: Start using vfuncs in xe_ring_ops
2022-10-26 18:05 -0700 Matthew Brost │ │ o drm/xe: Validate BO on CPU fault
2022-10-26 17:07 -0700 Matthew Brost │ │ o drm/xe: Only handle faults on VMs in fault mode
2022-10-26 17:04 -0700 Matthew Brost │ │ o drm/xe: Optimize bind path to use CPU if in-fences signaled
2022-10-26 16:48 -0700 Matthew Brost │ │ o drm/xe: Update xe_sync to work with multiple GTs and binding
2022-10-26 15:25 -0700 Matthew Brost │ │ o drm/xe: Sync up ring ops with i915 implementation
Will share more information if I can find a better way to reproduce it.