igt@xe_pm@d3[cold|hot]-mocs - abort - kworker.* is trying to acquire lock:, at: drm_modeset_lock_all, but task is already holding lock:, at: xe_pm_runtime_suspend

added feature: power/runtime PM platform: DG2 labels

The CI Bug Log issue associated to this bug has been updated by rveesamx.

New filters associated

DG2 : igt@xe_pm@d3[cold|hot]-mocs - abort - kworker.* is trying to acquire lock:, at: drm_modeset_lock_all, but task is already holding lock:, at: xe_pm_runtime_suspend

@ideak I need your help on this case here. I had seen since 6.10-rc3 on my DG2 #2255 (closed) but only now I stopped to "bisect" and end up in your commit b1d90a86 ("drm/xe: Use the encoder suspend helper also used by the i915 driver").

I know, it makes absolutely no sense to me as well. but I run multiple experiments and the attached revert is what makes this lockdep warning to go away.

Thoughts?

@rodrigovivi , yes, it's odd why there is a lockdep issue with b1d90a86 and not without it. It's about the dependency between drm_modeset_lock_all -> mode_config.mutex and a runtime PM reference, which is a problem with and without that commit. In any case the root cause is that intel_encoder_suspend_all() / drm_mode_config_reset() should be only called during system suspend/resume not during runtime s/r.

@ideak yeap, at least I'm not crazy alone! :)
I had also tried that, but then not all references are returned and I end up with this error:

  155.333358] xe 0000:03:00.0: [drm] drm_WARN_ON(power_domains->init_wakeref)
[  155.333535] WARNING: CPU: 10 PID: 151 at drivers/gpu/drm/i915/display/intel_display_power.c:2049 intel_power_domains_disable+0x111/0x170 [xe]
[  155.354223] Modules linked in: rfcomm snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc vfat fat snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl iwlmvm snd_sof_intel_hda_generic snd_sof_pci snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda snd_sof intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common x86_pkg_temp_thermal snd_sof_utils intel_powerclamp mac80211 snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core snd_hda_codec_hdmi coretemp snd_compress snd_sof_intel_hda_mlink snd_hda_ext_core kvm_intel snd_hda_intel snd_intel_dspcfg libarc4 snd_hda_codec kvm snd_hwdep snd_hda_core iwlwifi snd_seq btusb btrtl snd_seq_device btintel iTCO_wdt rapl intel_pmc_bxt snd_pcm btbcm pmt_telemetry mei_hdcp mei_pxp ee1004
[  155.355051]  iTCO_vendor_support intel_cstate btmtk pmt_class cfg80211 bluetooth gigabyte_wmi intel_uncore wmi_bmof i2c_i801 snd_timer pcspkr mei_me i2c_smbus snd mei rfkill soundcore idma64 intel_vsec joydev intel_hid sparse_keymap acpi_pad acpi_tad loop xe drm_gpuvm crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni hid_logitech_hidpp polyval_generic i915 nvme r8169 ghash_clmulni_intel nvme_core sha512_ssse3 pinctrl_alderlake hid_logitech_dj fuse
[  155.486252] CPU: 10 UID: 0 PID: 151 Comm: kworker/10:1 Not tainted 6.11.0-rc5+ #38
[  155.493889] Hardware name: iBUYPOWER INTEL/B660 DS3H AC DDR4-Y1, BIOS F5 12/17/2021
[  155.501599] Workqueue: pm pm_runtime_work
[  155.505653] RIP: 0010:intel_power_domains_disable+0x111/0x170 [xe]
[  155.512001] Code: 4c 8b 6d 50 4d 85 ed 74 28 48 89 ef e8 58 b7 86 e8 48 c7 c1 60 6f 70 c1 4c 89 ea 48 c7 c7 60 64 70 c1 48 89 c6 e8 2f a7 eb e4 <0f> 0b e9 22 ff ff ff 48 b8 00 00 00 00 00 fc ff df 48 89 ea 48 c1
[  155.530870] RSP: 0018:ffffc90000c27a40 EFLAGS: 00010282
[  155.536137] RAX: dffffc0000000000 RBX: ffff888179ec8000 RCX: 0000000000000000
[  155.543320] RDX: 0000000000000002 RSI: 0000000000000004 RDI: 0000000000000001
[  155.550514] RBP: ffff88810de0a0c8 R08: 0000000000000001 R09: ffffed11cec00359
[  155.557700] R10: ffff888e76001acb R11: ffffffffac799650 R12: ffff888179ecb240
[  155.564895] R13: ffff88810ac03860 R14: ffff888179eca7e8 R15: ffff888179eca748
[  155.572081] FS:  0000000000000000(0000) GS:ffff888e75e00000(0000) knlGS:0000000000000000
[  155.580226] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  155.586015] CR2: 00007f18b4fffc38 CR3: 0000000b89670000 CR4: 0000000000750ef0
[  155.593200] PKRU: 55555554
[  155.595937] Call Trace:
[  155.598414]  <TASK>
[  155.600543]  ? __warn+0xc8/0x2c0
[  155.603804]  ? intel_power_domains_disable+0x111/0x170 [xe]
[  155.609530]  ? report_bug+0x2e6/0x390
[  155.613235]  ? handle_bug+0x79/0xa0
[  155.616762]  ? exc_invalid_op+0x13/0x40
[  155.620634]  ? asm_exc_invalid_op+0x16/0x20
[  155.624861]  ? intel_power_domains_disable+0x111/0x170 [xe]
[  155.630592]  xe_display_pm_suspend+0x69/0x1a0 [xe]
[  155.635520]  xe_pm_runtime_suspend+0x893/0xb60 [xe]
[  155.640531]  xe_pci_runtime_suspend+0x3b/0x1e0 [xe]
[  155.645553]  pci_pm_runtime_suspend+0x168/0x540
[  155.650126]  ? _raw_spin_unlock_irq+0x24/0x50
[  155.654528]  ? __pfx_pci_pm_runtime_suspend+0x10/0x10
[  155.659625]  __rpm_callback+0xa9/0x390
[  155.663421]  ? __pfx_pci_pm_runtime_suspend+0x10/0x10
[  155.668524]  rpm_callback+0x168/0x1b0
[  155.672227]  ? __pfx_pci_pm_runtime_suspend+0x10/0x10
[  155.677328]  rpm_suspend+0x227/0xdf0
[  155.680949]  ? __pfx_rpm_suspend+0x10/0x10
[  155.685092]  ? __pfx_lock_acquired+0x10/0x10
[  155.689410]  pm_runtime_work+0x100/0x120
[  155.693375]  process_one_work+0x83e/0x1740
[  155.697522]  ? worker_thread+0x299/0x1250
[  155.701577]  ? __pfx_process_one_work+0x10/0x10
[  155.706171]  ? assign_work+0x16c/0x240
[  155.709971]  worker_thread+0x717/0x1250
[  155.713845]  ? lockdep_hardirqs_on+0xc6/0x140
[  155.718252]  ? __kthread_parkme+0xba/0x1f0
[  155.722395]  ? __pfx_worker_thread+0x10/0x10
[  155.726708]  kthread+0x2e9/0x3d0
[  155.729979]  ? _raw_spin_unlock_irq+0x24/0x50
[  155.734384]  ? __pfx_kthread+0x10/0x10
[  155.738177]  ret_from_fork+0x2d/0x70
[  155.741787]  ? __pfx_kthread+0x10/0x10
[  155.745579]  ret_from_fork_asm+0x1a/0x30
[  155.749559]  </TASK>
[  155.751776] irq event stamp: 36163
[  155.755213] hardirqs last  enabled at (36197): [<ffffffffa63e99e9>] console_unlock+0x1e9/0x260
[  155.763890] hardirqs last disabled at (36230): [<ffffffffa63e99ce>] console_unlock+0x1ce/0x260
[  155.772572] softirqs last  enabled at (36264): [<ffffffffa621c7f8>] __irq_exit_rcu+0xc8/0x1d0
[  155.781156] softirqs last disabled at (36275): [<ffffffffa621c7f8>] __irq_exit_rcu+0xc8/0x1d0
[  155.789746] ---[ end trace 0000000000000000 ]---

what is totally strange given that intel_power_domains_disable() call during suspend comes before the encoder suspend anyway, :( so, something is really off

Yes, looks off. That wakeref is taken / released in intel_power_domains_disable() / enable() which are called from xe_display_pm_suspend()/resume(). So I can't see how the reference could be leaked if these calls are properly paired. intel_power_domains_disable() / enable() shouldn't be called either during runtime suspend / resume. Both of these issues are unrelated to the original lockdep issue though.

well, these are only called on d3cold runtime suspend/resume... we need them for when the power is lost. But right, the thing isn't related to the the lockdep indeed... The most obscure thing here is why the local old version of encoder_off didn't caused the lockdep to complain. And how to get to disable the encoders without locking issues.

oh! I just noticed that the old code was really entirely skipping this encoder suspend... I had a but on:

if (has_display(xe))
    return;

instead of

if (!has_display(xe))
   return;

at least it makes some sense on the behavior change...

Ok, that explains.

Ftr, again unrelated to the issue on this ticket, but I still think neither intel_display_power_disable()/enable() should be called, nor encoders should be suspended / resumed during runtime suspend/resume, whether d3cold is enabled or not. Both of these are relevant only for system suspend / resume.

okay, for now let's just remove the encoder supend/resume, and then next we look forward to reduce the sequence even further. As for the strange side effect that I notice when attempting that, the culprit was a bad conflict resolution in drm-tip rebuild. There the suspend sequence was getting called twice for d3cold case:

Refs: drm-intel-next-2024-08-29-185-gc1445d62a761
Merge: 947a38a1578f 87d8ecf01544
Author:     Rodrigo Vivi <rodrigo.vivi@intel.com>
AuthorDate: Thu Aug 29 09:52:19 2024 -0400
Commit:     Rodrigo Vivi <rodrigo.vivi@intel.com>
CommitDate: Thu Aug 29 09:52:19 2024 -0400

    Merge remote-tracking branch 'drm-xe/drm-xe-next' into drm-tip

    # Conflicts:
    #       drivers/gpu/drm/xe/display/xe_display.c
    #       drivers/gpu/drm/xe/xe_gt_pagefault.c
    #       drivers/gpu/drm/xe/xe_lrc.c
    #       drivers/gpu/drm/xe/xe_mmio.c
    #       drivers/gpu/drm/xe/xe_sync.c
    #       drivers/gpu/drm/xe/xe_wa.c

 drivers/gpu/drm/xe/xe_pm.c | 90 ++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 75 insertions(+), 15 deletions(-)

diff --cc drivers/gpu/drm/xe/xe_pm.c
index fcfb49af8c89,2600c936527e..2e2accd76fb2
--- a/drivers/gpu/drm/xe/xe_pm.c
+++ b/drivers/gpu/drm/xe/xe_pm.c
@@@ -366,9 -389,9 +389,11 @@@ int xe_pm_runtime_suspend(struct xe_dev
                xe_bo_runtime_pm_release_mmap_offset(bo);
        mutex_unlock(&xe->mem_access.vram_userfault.lock);

+       xe_display_pm_runtime_suspend(xe);
+
        if (xe->d3cold.allowed) {
 +              xe_display_pm_suspend(xe, true);
 +

I'm going to fix those now...

closed with commit 8da19441

The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

mentioned in commit 4bfc9c55

igt@xe_pm@d3[cold|hot]-mocs - abort - kworker.* is trying to acquire lock:, at: drm_modeset_lock_all, but task is already holding lock:, at: xe_pm_runtime_suspend

Designs

Child items 0

Activity

New filters associated

Admin message

Admin message

igt@xe_pm@d3[cold|hot]-mocs - abort - kworker.* is trying to acquire lock:, at: drm_modeset_lock_all, but task is already holding lock:, at: xe_pm_runtime_suspend

Activity

New filters associated