Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
(In reply to Daniele Ceraolo Spurio from comment 3)
This is not happening with the new FW, so closing.
It has happened 9 times in the last 14 drmtip runs, and has never been not seen for more than 4 runs at a time. This means you did not follow the 10x shown in the bug assessment process. Please follow all the steps carefuly and not skip directly to closing the issue.
I'm currently following this bug. In the last look through the CI results I can see that this is still occurring but I haven't been able to identify the exact issue yet.
I have been able to reproduce this bug on an ICL with the gem_ctx_isolation@vcs0-s3 and i915_suspend@forcewake tests. On these runs I completely lose the DUT after the failed test run. The next will be to get some serial logs for this.
This issue is recurrently seen on the following five tests: kms_vblank@pipe-c/b-continuation-suspend, gem_workarounds@suspend-resume context, gem_ctx_isolation@vcs0-s3, i915_suspend@forcewake. For all these tests, locally I can see this issue happening without guc as well.
Local test result confirmed, but the CI evidence of being seen only on our -guc machines is compelling. Issue on kms tests might be a new regression. Re-adding firmware/guc to i915/feature.
Suja and I have been working on trying to duplicate this.
On ICL, the i915_suspend test just appears to hang (see below)
gta@ubt-18:~/ril-src/igt-gpu-tools$ sudo ./build/tests/i915_suspend
IGT-Version: 1.24-g5a6c6856 (x86_64) (Linux: 5.3.0+ x86_64)
Starting subtest: fence-restore-tiled2untiled
[cmd] rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Fri Sep 27 22:42:17 2019
checking the first canary object
checking the second canary object
Subtest fence-restore-tiled2untiled: SUCCESS (7.957s)
Starting subtest: fence-restore-untiled
[cmd] rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Fri Sep 27 22:42:39 2019
checking the first canary object
checking the second canary object
Subtest fence-restore-untiled: SUCCESS (6.978s)
Starting subtest: debugfs-reader
[cmd] rtcwake: assuming RTC uses UTC ... <
rtcwake: wakeup from "mem" using /dev/rtc0 at Fri Sep 27 22:43:02 2019 <------- seems to hang here?
However, with a serial port connected it turns out that the dut does not die after all, as we still have an interactive console and can see kernel messages.
It seems that the netdev isn't waking up and that is why the test appears to hand and you can't ssh into it again.
Also, looking the running processes the test appears to be running.
Lastly, we're seeing a "PM: Cannot get swap device, try swapon -a" and
"PM: Cannot get swap writer" on the console. I wondering if the test is trying to hibernate and is expecting swap space?
I have the console going and it looks like the machine is not really dead.
The serial port is still interactive but the network appears dead, that is why you don’t see any output on your terminal, nor
can you ssh into the dut.
From the serial console, the test is still running.
The error on the serial console seems to imply it is expecting the machine to have a swap space enabled. Perhaps that is
the reason the test just appears to hang. We now know the device does come out of suspend, only that the network isn’t
restarted.
I have the console going and it looks like the machine is not really dead.
The serial port is still interactive but the network appears dead, that is
why you don’t see any output on your terminal, nor
can you ssh into the dut.
From the serial console, the test is still running.
The error on the serial console seems to imply it is expecting the machine
to have a swap space enabled. Perhaps that is
the reason the test just appears to hang. We now know the device does come
out of suspend, only that the network isn’t
restarted.
Sorry, this was a cut and paste repeat of what I was saying.
This bug has not been seen for about a week now on any of the platforms it was previously seen on. I will continue to track this bug and update if there are any changes.
This issue was recently seen again on the gem_eio@in-flight-suspend and kms_pipe_crc_basic@suspend-read-crc-pipe-b tests. Initially the incomplete tests were successful after enabling swap on guc devices. After assessing the new logs, it looks like neither of these issues are guc specific. The same issues are seen across non-guc systems as well. This particular bug log appears to be capturing general issues seen on guc systems. I do not think they are specific to guc.