[CI][SHARDS] All tests - dmesg-warn/dmesg-fail - ERROR LSPCON mode hasn't settled

added CI feature: display/LSPCON platform: BXT priority::low severity::normal + 1 deleted label

Martin Peres @mupuf said:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5032/shard-apl7/igt@pm_rpm@system-suspend-execbuf.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5029/shard-apl7/igt@pm_rpm@system-suspend-execbuf.html

<3> [633.225687] [drm:lspcon_wait_mode [i915]] ERROR LSPCON mode hasn't settled
<3> [633.362409] [drm:lspcon_change_mode.constprop.4 [i915]] ERROR Error reading LSPCON mode
<3> [633.362506] [drm:intel_dp_detect [i915]] ERROR LSPCON resume failed

Swati2 Sharma @swati2.sharma said:

Updated CI results?

LAKSHMINARAYANA VUDUM @l4kshmi said:

(In reply to Swati Sharma from comment 2)

Updated CI results?

Last seen on IGT_4777_full (2 months / 1284 runs ago), this issue used to occur once in 1-3 weeks or 3-974 runs.
Dropping the priority to Medium.

CI Bug Log said:

A CI Bug Log filter associated to this bug has been updated:

APL: random tests - dmesg-warn - LSPCON mode hasn't settled
APL: random tests - dmesg-warn - LSPCON mode hasn't settled

New failures caught by the filter:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6361/shard-apl5/igt@kms_flip@flip-vs-suspend-interruptible.html

Matt Roper @mattrope said:

LSPCON refers to a DP -> HDMI adapter used on these systems ("Level Shifter and Protocol CONverter"); it's a separate downstream device and when we perform a suspend/resume cycle, we need to settle into its PCON mode before using it. The messages here indicate that although the LSPCON is responding to DPCD reads on the aux channel following resume, when we try to check the mode (LS or PCON) by doing DPCD reads of offset 41, all of those reads return "defer" until we eventually give up and declare a timeout.

Higher level logic does itself retry probing the LSPCON mode and the LSPCON finally starts responding again after more than a second has passed (658.672242 -> 659.860423).

It's hard to say why the LSPCON flakes out for over a second and fails to respond to us, but there have been a few upstream changes to extend the timeouts in places (e.g., "drm/i915: Increase LSPCON timeout"). From the CI database, it looks like the issue became significantly less common once those timeouts were extended (last seen two months ago, and the previous occurrence was five months before that); we could probably eliminate this completely if we kept extending timeouts far enough, but that would likely lead to poor user experience in situations where we legitimately do need to timeout for an operation (the commit message for the commit above does indicate they chose 400ms rather than the original 1000ms for this reason).

Due to the rarity of this problem, the lack of user-visible impact (the higher-level code does retry further and get a response as we can see in the logs), I think it's safe to downgrade this bug to 'low' exposure.

assigned to @swati2.sharma

A CI Bug Log filter associated to this bug has been updated by Lakshmi Vudum:

Description: BXT APL: random tests - dmesg-warn - LSPCON mode hasn't settled

Equivalent query: runconfig_tag IS IN ["DRM-TIP"] AND (machine_name IS IN ["shard-apl5", "shard-apl7", "fi-bxt-dsi", "shard-apl8", "shard-apl6", "shard-apl1", "fi-apl-guc", "shard-apl2", "shard-apl4", "fi-bxt-j4205", "shard-apl", "shard-apl3", "fi-apl-nasher"] OR machine_tag IS IN ["APL"] AND ((testsuite_name = "IGT" AND test_name IS IN ["igt@kms_flip@flip-vs-suspend-interruptible", "igt@pm_rpm@system-suspend-execbuf", "igt@i915_pm_rpm@system-suspend-execbuf", "igt@kms_frontbuffer_tracking@fbc-1p-rteBXT"])) AND ((testsuite_name = "IGT" AND status_name IS IN ["dmesg-warn"])) AND dmesg ~= '\*ERROR\* LSPCON mode hasn't settled'

New failures caught by the filter:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_432/fi-bxt-dsi/igt@kms_flip@flip-vs-suspend.html

@l4kshmi can we please close issue#188 issue#92 and issue#56 as duplicates and keep only issue#165. All these issues have same signature, LSPCON mode hasn't settled though they are occurring on different platforms.

issue#188 filter can be updated only for link training.

With this we will be able to concentrate only on one gitlab issue.

marked #188 (closed) as a duplicate of this issue

marked #56 (closed) as a duplicate of this issue

The CI Bug Log issue associated to this bug has been updated by Lakshmi Vudum.

New filters associated

SKL: all tests - dmesg-warn / dmesg-fail - ERROR LSPCON mode hasn't settled (No new failures associated)
KBL: All tests - LSPCON mode hasn't settled (No new failures associated)

A CI Bug Log filter associated to this bug has been updated by Lakshmi Vudum:

Description: SKL CFL: all tests - dmesg-warn / dmesg-fail - *ERROR* LSPCON mode hasn't settled

Equivalent query: runconfig_tag IS IN ["DRM-TIP"] AND (machine_name IS IN ["shard-skl", "shard-skl1", "shard-skl2", "shard-skl3", "shard-skl4", "shard-skl5", "shard-skl6", "fi-skl-6770hq", "shard-skl7", "shard-skl8", "shard-skl9", "fi-skl-6260u", "shard-skl10", "fi-skl-lmem", "fi-skl-6700hq", "fi-skl-6600u", "fi-skl-6700k2", "fi-skl-gvtdvm", "fi-skl-guc", "fi-cfl-8700k", "fi-cfl-u", "fi-cfl-s3", "fi-cfl-guc", "fi-cfl-u2", "fi-cfl-8109u", "fi-skl-iommu", "fi-skl-caroline"] OR machine_tag IS IN ["CFL", "SKL"]) AND ((testsuite_name = "IGT" AND status_name IS IN ["dmesg-warn", "dmesg-fail"])) AND dmesg ~= '\*ERROR\* LSPCON mode hasn't settled'

New failures caught by the filter: