Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
LSPCON refers to a DP -> HDMI adapter used on these systems ("Level Shifter and Protocol CONverter"); it's a separate downstream device and when we perform a suspend/resume cycle, we need to settle into its PCON mode before using it. The messages here indicate that although the LSPCON is responding to DPCD reads on the aux channel following resume, when we try to check the mode (LS or PCON) by doing DPCD reads of offset 41, all of those reads return "defer" until we eventually give up and declare a timeout.
Higher level logic does itself retry probing the LSPCON mode and the LSPCON finally starts responding again after more than a second has passed (658.672242 -> 659.860423).
It's hard to say why the LSPCON flakes out for over a second and fails to respond to us, but there have been a few upstream changes to extend the timeouts in places (e.g., "drm/i915: Increase LSPCON timeout"). From the CI database, it looks like the issue became significantly less common once those timeouts were extended (last seen two months ago, and the previous occurrence was five months before that); we could probably eliminate this completely if we kept extending timeouts far enough, but that would likely lead to poor user experience in situations where we legitimately do need to timeout for an operation (the commit message for the commit above does indicate they chose 400ms rather than the original 1000ms for this reason).
Due to the rarity of this problem, the lack of user-visible impact (the higher-level code does retry further and get a response as we can see in the logs), I think it's safe to downgrade this bug to 'low' exposure.
Equivalent query: runconfig_tag IS IN ["DRM-TIP"] AND (machine_name IS IN ["shard-apl5", "shard-apl7", "fi-bxt-dsi", "shard-apl8", "shard-apl6", "shard-apl1", "fi-apl-guc", "shard-apl2", "shard-apl4", "fi-bxt-j4205", "shard-apl", "shard-apl3", "fi-apl-nasher"] OR machine_tag IS IN ["APL"] AND ((testsuite_name = "IGT" AND test_name IS IN ["igt@kms_flip@flip-vs-suspend-interruptible", "igt@pm_rpm@system-suspend-execbuf", "igt@i915_pm_rpm@system-suspend-execbuf", "igt@kms_frontbuffer_tracking@fbc-1p-rteBXT"])) AND ((testsuite_name = "IGT" AND status_name IS IN ["dmesg-warn"])) AND dmesg ~= '\*ERROR\* LSPCON mode hasn't settled'
@l4kshmi can we please close issue#188issue#92 and issue#56 as duplicates and keep only issue#165.
All these issues have same signature, LSPCON mode hasn't settled though they are occurring on different platforms.
issue#188 filter can be updated only for link training.
With this we will be able to concentrate only on one gitlab issue.
Equivalent query: runconfig_tag IS IN ["DRM-TIP"] AND (machine_name IS IN ["shard-skl", "shard-skl1", "shard-skl2", "shard-skl3", "shard-skl4", "shard-skl5", "shard-skl6", "fi-skl-6770hq", "shard-skl7", "shard-skl8", "shard-skl9", "fi-skl-6260u", "shard-skl10", "fi-skl-lmem", "fi-skl-6700hq", "fi-skl-6600u", "fi-skl-6700k2", "fi-skl-gvtdvm", "fi-skl-guc", "fi-cfl-8700k", "fi-cfl-u", "fi-cfl-s3", "fi-cfl-guc", "fi-cfl-u2", "fi-cfl-8109u", "fi-skl-iommu", "fi-skl-caroline"] OR machine_tag IS IN ["CFL", "SKL"]) AND ((testsuite_name = "IGT" AND status_name IS IN ["dmesg-warn", "dmesg-fail"])) AND dmesg ~= '\*ERROR\* LSPCON mode hasn't settled'