Possible circular locking dependency detected

[  204.534310] xe 0000:00:02.0: [drm] Timedout job: seqno=34, guc_id=6, flags=0x4
[  209.653856] ------------[ cut here ]------------
[  209.653859] WARNING: CPU: 6 PID: 89 at drivers/gpu/drm/xe/xe_guc_submit.c:846 guc_engine_timedout_job+0x471/0x520 [xe]
[  209.653887] Modules linked in: xe drm_ttm_helper drm_suballoc_helper gpu_sched ttm snd_hda_codec_hdmi mei_pxp mei_hdcp pmt_telemetry pmt_class x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel kvm_intel snd_intel_dspcfg snd_hda_codec wmi_bmof e1000e snd_hwdep i2c_i801 ptp snd_hda_core i2c_smbus pps_core mei_me snd_pcm mei intel_vsec wmi fuse [last unloaded: ttm]
[  209.653932] CPU: 6 PID: 89 Comm: kworker/u16:2 Not tainted 6.1.0-rc1+zeh-xe+ #857
[  209.653935] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U DDR4 SODIMM RVP, BIOS TGLSFWI1.R00.4183.A01.2104282326 04/28/2021
[  209.653937] Workqueue: gt-ordered-wq drm_sched_job_timedout [gpu_sched]
[  209.653945] RIP: 0010:guc_engine_timedout_job+0x471/0x520 [xe]
[  209.653959] Code: 24 18 48 89 4c 24 08 49 8d ae 58 03 00 00 e8 56 d5 e7 e0 48 8b 4c 24 08 4d 8d 7d 08 48 89 ee 4c 89 ff 85 c9 0f 85 ba fc ff ff <0f> 0b 49 8b ae 58 03 00 00 49 c7 46 18 01 00 00 00 48 89 74 24 10
[  209.653961] RSP: 0018:ffffc900014f7db8 EFLAGS: 00010286
[  209.653965] RAX: 0000000080000000 RBX: ffff888113ce8800 RCX: 0000000000000000
[  209.653967] RDX: 0000000000000001 RSI: ffff888132116b58 RDI: ffff888112ec7c48
[  209.653968] RBP: ffff88811a3a3da8 R08: 0000000000000001 R09: 0000000000000001
[  209.653970] R10: 00000000e07ca948 R11: 000000000d2fbeb7 R12: 00000000ffffffc2
[  209.653971] R13: ffff888112ec7c40 R14: ffff888132116800 R15: ffff888112ec7c48
[  209.653973] FS:  0000000000000000(0000) GS:ffff88849e900000(0000) knlGS:0000000000000000
[  209.653975] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  209.653977] CR2: 00007fe0ccb26a20 CR3: 0000000005812006 CR4: 0000000000770ee0
[  209.653978] PKRU: 55555554
[  209.653980] Call Trace:
[  209.653982]  <TASK>
[  209.653986]  ? prepare_to_wait_exclusive+0xd0/0xd0
[  209.653994]  drm_sched_job_timedout+0x77/0x110 [gpu_sched]
[  209.654002]  process_one_work+0x282/0x580
[  209.654012]  worker_thread+0x4a/0x390
[  209.654017]  ? process_one_work+0x580/0x580
[  209.654020]  kthread+0xed/0x120
[  209.654023]  ? kthread_complete_and_exit+0x20/0x20
[  209.654027]  ret_from_fork+0x1f/0x30
[  209.654039]  </TASK>
[  209.654040] irq event stamp: 326627
[  209.654042] hardirqs last  enabled at (326633): [<ffffffff811a8c89>] __up_console_sem+0x59/0x80
[  209.654045] hardirqs last disabled at (326638): [<ffffffff811a8c6e>] __up_console_sem+0x3e/0x80
[  209.654047] softirqs last  enabled at (325596): [<ffffffff8112f1e1>] __irq_exit_rcu+0xb1/0x120
[  209.654051] softirqs last disabled at (325571): [<ffffffff8112f1e1>] __irq_exit_rcu+0xb1/0x120
[  209.654053] ---[ end trace 0000000000000000 ]---
[  209.654244] xe 0000:00:02.0: [drm] Try GT reset
[  209.654256] xe 0000:00:02.0: [drm] Doing GT reset
[  209.654411] xe 0000:00:02.0: [drm] GT reset started

[  209.656224] ======================================================
[  209.662417] WARNING: possible circular locking dependency detected
[  209.668610] 6.1.0-rc1+zeh-xe+ #857 Tainted: G        W
[  209.674545] ------------------------------------------------------
[  209.680743] kworker/u16:2/89 is trying to acquire lock:
[  209.685984] ffff888113d53a88 ((work_completion)(&sched->work_run)){+.+.}-{0:0}, at: __flush_work+0x4d/0x4c0
[  209.695750]
               but task is already holding lock:
[  209.701596] ffff88811a3a47d0 (&guc->submission_state.lock){+.+.}-{3:3}, at: xe_guc_submit_stop+0x51/0x3c0 [xe]
[  209.711632]
               which lock already depends on the new lock.

[  209.719827]
               the existing dependency chain (in reverse order) is:
[  209.727324]
               -> #2 (&guc->submission_state.lock){+.+.}-{3:3}:
[  209.734480]        __mutex_lock+0x9d/0x890
[  209.738592]        xe_guc_submit_init+0x109/0x180 [xe]
[  209.743760]        xe_uc_init+0x60/0x70 [xe]
[  209.748062]        xe_gt_init+0x276/0x450 [xe]
[  209.752532]        xe_device_probe+0x177/0x1d0 [xe]
[  209.757441]        xe_pci_probe+0x22c/0x3d0 [xe]
[  209.762087]        pci_device_probe+0x9f/0x150
[  209.766545]        really_probe+0xd6/0x380
[  209.770658]        __driver_probe_device+0x73/0x170
[  209.775553]        driver_probe_device+0x1a/0x90
[  209.780184]        __driver_attach+0xd0/0x1d0
[  209.784556]        bus_for_each_dev+0x71/0xc0
[  209.788925]        bus_add_driver+0x1ac/0x200
[  209.793297]        driver_register+0x84/0xe0
[  209.797580]        do_one_initcall+0x55/0x2f0
[  209.801953]        do_init_module+0x45/0x1c0
[  209.806238]        __do_sys_finit_module+0xac/0x120
[  209.811129]        do_syscall_64+0x37/0x90
[  209.815240]        entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  209.820827]
               -> #1 (dma_fence_map){++++}-{0:0}:
[  209.826763]        dma_fence_begin_signalling+0x51/0x60
[  209.832003]        xe_guc_ct_send+0x29/0x90 [xe]
[  209.836652]        __register_engine+0x64/0x90 [xe]
[  209.841555]        guc_engine_run_job+0x7f2/0xae0 [xe]
[  209.846726]        drm_sched_main+0x223/0x6a0 [gpu_sched]
[  209.852142]        process_one_work+0x282/0x580
[  209.856689]        worker_thread+0x4a/0x390
[  209.860886]        kthread+0xed/0x120
[  209.864559]        ret_from_fork+0x1f/0x30
[  209.868667]
               -> #0 ((work_completion)(&sched->work_run)){+.+.}-{0:0}:
[  209.876515]        __lock_acquire+0x16b5/0x28e0
[  209.881059]        lock_acquire+0xd1/0x2f0
[  209.885166]        __flush_work+0x77/0x4c0
[  209.889274]        __cancel_work_timer+0x12d/0x1d0
[  209.894078]        xe_guc_submit_stop+0x8c/0x3c0 [xe]
[  209.899160]        gt_reset_worker.cold+0xd2/0x193 [xe]
[  209.904417]        process_one_work+0x282/0x580
[  209.908962]        worker_thread+0x4a/0x390
[  209.913156]        kthread+0xed/0x120
[  209.916835]        ret_from_fork+0x1f/0x30
[  209.920941]
               other info that might help us debug this:

[  209.928964] Chain exists of:
                 (work_completion)(&sched->work_run) --> dma_fence_map --> &guc->submission_state.lock

[  209.942377]  Possible unsafe locking scenario:

[  209.948312]        CPU0                    CPU1
[  209.952854]        ----                    ----
[  209.957400]   lock(&guc->submission_state.lock);
[  209.962029]                                lock(dma_fence_map);
[  209.967969]                                lock(&guc->submission_state.lock);
[  209.975118]   lock((work_completion)(&sched->work_run));
[  209.980449]
                *** DEADLOCK ***

[  209.986386] 4 locks held by kworker/u16:2/89:
[  209.990754]  #0: ffff888113d11138 ((wq_completion)gt-ordered-wq){+.+.}-{0:0}, at: process_one_work+0x1fd/0x580
[  210.000777]  #1: ffffc900014f7e78 ((work_completion)(&gt->reset.worker)){+.+.}-{0:0}, at: process_one_work+0x1fd/0x580
[  210.011496]  #2: ffffffff829ee800 (dma_fence_map){++++}-{0:0}, at: xe_guc_submit_stop+0x26/0x3c0 [xe]
[  210.020749]  #3: ffff88811a3a47d0 (&guc->submission_state.lock){+.+.}-{3:3}, at: xe_guc_submit_stop+0x51/0x3c0 [xe]
[  210.031219]
               stack backtrace:
[  210.035589] CPU: 6 PID: 89 Comm: kworker/u16:2 Tainted: G        W          6.1.0-rc1+zeh-xe+ #857
[  210.044567] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U DDR4 SODIMM RVP, BIOS TGLSFWI1.R00.4183.A01.2104282326 04/28/2021
[  210.057980] Workqueue: gt-ordered-wq gt_reset_worker [xe]
[  210.063410] Call Trace:
[  210.065868]  <TASK>
[  210.067983]  dump_stack_lvl+0x6a/0x9f
[  210.071657]  check_noncircular+0x132/0x150
[  210.075770]  __lock_acquire+0x16b5/0x28e0
[  210.079796]  lock_acquire+0xd1/0x2f0
[  210.083384]  ? __flush_work+0x4d/0x4c0
[  210.087146]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[  210.092474]  ? lockdep_hardirqs_on+0xbf/0x140
[  210.096846]  __flush_work+0x77/0x4c0
[  210.100438]  ? __flush_work+0x4d/0x4c0
[  210.104203]  __cancel_work_timer+0x12d/0x1d0
[  210.108491]  xe_guc_submit_stop+0x8c/0x3c0 [xe]
[  210.113049]  gt_reset_worker.cold+0xd2/0x193 [xe]
[  210.117785]  process_one_work+0x282/0x580
[  210.121809]  worker_thread+0x4a/0x390
[  210.125486]  ? process_one_work+0x580/0x580
[  210.129687]  kthread+0xed/0x120
[  210.132841]  ? kthread_complete_and_exit+0x20/0x20
[  210.137645]  ret_from_fork+0x1f/0x30
[  210.141234]  </TASK>
[  210.152786] xe 0000:00:02.0: [drm] GT reset done
[  210.157440] xe 0000:00:02.0: [drm] Timedout job: seqno=34, guc_id=6, flags=0x5
[  210.166717] xe 0000:00:02.0: [drm] Timedout job: seqno=119, guc_id=6, flags=0x5
[  210.175710] xe 0000:00:02.0: [drm] Timedout job: seqno=204, guc_id=6, flags=0x5

Never saw this before but saw one time after rebase on the latest Xe, could be something is this new patches:

2022-10-28 12:37 +0000 Philippe Lecluse           M─┐ [xe] {gitlab-xe/xe} Merge branch 'more_guc_options' into 'xe'
2022-10-28 12:19 +0000 Matthew Brost              M─│─┐ Merge branch 'pvc_fixes' into 'xe'
2022-10-28 13:24 +0200 Philippe Lecluse           │ o │ drm/xe: add more guc options
2022-10-27 15:17 -0700 Matthew Brost              │ │ o drm/xe: Don't define MEDIA_GUC_HOST_INTERRUPT twice
2022-10-27 15:16 -0700 Matthew Brost              │ │ o drm/xe: Don't shadow variables
2022-10-27 15:09 -0700 Matthew Brost              │ │ o drm/xe: Initialize master_ctl in IRQ handler
2022-10-27 13:04 -0700 Matthew Brost              │ │ o drm/xe: Fixup xe_migrate dma-resv usage
2022-10-27 12:54 -0700 Matthew Brost              │ │ o drm/xe: Wait on VM's kernel dma-resv slots
2022-10-27 10:28 -0700 Matthew Brost              │ │ o drm/xe: Fix munmap style unbinds
2022-10-26 19:35 -0700 Matthew Brost              │ │ o drm/xe: Start using vfuncs in xe_ring_ops
2022-10-26 18:05 -0700 Matthew Brost              │ │ o drm/xe: Validate BO on CPU fault
2022-10-26 17:07 -0700 Matthew Brost              │ │ o drm/xe: Only handle faults on VMs in fault mode
2022-10-26 17:04 -0700 Matthew Brost              │ │ o drm/xe: Optimize bind path to use CPU if in-fences signaled
2022-10-26 16:48 -0700 Matthew Brost              │ │ o drm/xe: Update xe_sync to work with multiple GTs and binding
2022-10-26 15:25 -0700 Matthew Brost              │ │ o drm/xe: Sync up ring ops with i915 implementation

Will share more information if I can find a better way to reproduce it.

Admin message

Possible circular locking dependency detected