[CI][SHARDS] igt@gem_eio@unwedge-stress - fail - Failed assertion: med < limit && max < 5 * limit
Submitted by Martin Peres @mupuf
Assigned to Chris Wilson @ickle
Link to original bug (#109661)
Description
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5614/shard-snb7/igt@gem_eio@unwedge-stress.html
Starting subtest: unwedge-stress
(gem_eio:1403) CRITICAL: Test assertion failure function check_wait_elapsed, file ../tests/i915/gem_eio.c:292:
(gem_eio:1403) CRITICAL: Failed assertion: med < limit && max < 5 * limit
(gem_eio:1403) CRITICAL: Wake up following reset+wedge took 187.662+-491.413ms (min:8.917ms, median:22.893ms, max:1810.883ms); limit set to 250ms on average and 1250ms maximum
Subtest unwedge-stress failed.
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Bugzilla Migration User added CI feature: GEM platform: SNB priority::high severity::normal + 1 deleted label
added CI feature: GEM platform: SNB priority::high severity::normal + 1 deleted label
CI Bug Log said:The CI Bug Log issue associated to this bug has been updated.
New filters associated
* SNB: igt@gem_eio@unwedge-stress - fail - Failed assertion: med < limit && max < 5 * limit
- https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5614/shard-snb7/igt@gem_eio@unwedge-stress.html
- https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5615/shard-snb7/igt@gem_eio@unwedge-stress.html
- https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5622/shard-snb1/igt@gem_eio@unwedge-stress.html Chris Wilson@ickle
said:It exceeded 3s in some runs. Gah.
https://patchwork.freedesktop.org/patch/286706/ is my hope. Chris Wilson@ickle
said:Fingers crossed once again,
commit 8f54b3c6c921275d10e33746553c40294ffa0d58
Author: Chris Wilson chris@chris-wilson.co.uk
Date: Tue Feb 19 12:21:57 2019 +0000
drm/i915: Trim delays for wedging
CI still reports the occasional multi-second delay for resets, in
particular along the wedge+recovery paths. As the likely, and unbounded,
delay here is from sync_rcu, use the expedited variant instead.
Testcase: igt/gem_eio/unwedge-stress
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk
Cc: Mika Kuoppala mika.kuoppala@intel.com
Reviewed-by: Mika Kuoppala mika.kuoppala@linux.intel.com
Link: https://patchwork.freedesktop.org/patch/msgid/20190219122215.8941-7-chris@chris-wilson.co.uk CI Bug Log said:A CI Bug Log filter associated to this bug has been updated:
SNB: igt@gem_eio@unwedge-stress - fail - Failed assertion: med < limit && max < 5 * limit
SNB: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
New failures caught by the filter:
Chris Wilson@ickle
said:Now that's just cruel, having supplied a patch specifically for the unwedge-stress subtest, you cross-pollute it with reset-stress!
Not that it'll make much difference, but there is quite a difference in driver paths between the two subtests. Martin Peres@mupuf
said:(In reply to Chris Wilson from comment 5)
Now that's just cruel, having supplied a patch specifically for the
unwedge-stress subtest, you cross-pollute it with reset-stress!
Not that it'll make much difference, but there is quite a difference in
driver paths between the two subtests.
Sorry about that! However, unwedge-stress is still failing:
- https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4855/shard-snb5/igt@gem_eio@unwedge-stress.html
- https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4858/shard-snb4/igt@gem_eio@unwedge-stress.html
- https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5671/shard-snb4/igt@gem_eio@unwedge-stress.html
- https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5672/shard-snb2/igt@gem_eio@unwedge-stress.html
If the fix for these issues is not fixing the reset-stress issues, we'll create a new bug! Chris Wilson@ickle
said:We're just at a mercy of an unbounded wait. We're using sync_rcu_expedited everywhere we can here and still we get delayed. I'm tempted to remove the fail for the max timeout being several seconds so long as the median is reasonable (all the limits are arbitrary anyway).
CI Bug Log said:A CI Bug Log filter associated to this bug has been updated:
SNB: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
SNB GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
New failures caught by the filter:
* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5949/shard-glk8/igt@gem_eio@unwedge-stress.html
* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5950/shard-glk8/igt@gem_eio@unwedge-stress.html
* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5951/shard-glk5/igt@gem_eio@unwedge-stress.html LAKSHMINARAYANA VUDUM@l4kshmi
said:(In reply to CI Bug Log from comment 8)
A CI Bug Log filter associated to this bug has been updated:
{- SNB: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med
< limit && max < 5 * limit -}
{+ SNB GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion:
med < limit && max < 5 * limit +}
New failures caught by the filter:
*
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5947/shard-glk1/
igt@gem_eio@unwedge-stress.html
*
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5949/shard-glk8/
igt@gem_eio@unwedge-stress.html
*
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5950/shard-glk8/
igt@gem_eio@unwedge-stress.html
*
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5951/shard-glk5/
igt@gem_eio@unwedge-stress.html
Also seen on GLK. CI Bug Log said:A CI Bug Log filter associated to this bug has been updated:
SNB GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
SNB BYT GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
New failures caught by the filter:
Chris Wilson@ickle
said:It looks like it was the reset worker feeding in the restart request that dragged us down.
commit 79ffac8599c4d8aa84d313920d3d86d7361c252b
Author: Chris Wilson chris@chris-wilson.co.uk
Date: Wed Apr 24 21:07:17 2019 +0100
drm/i915: Invert the GEM wakeref hierarchy CI Bug Log said:A CI Bug Log filter associated to this bug has been updated:
SNB BYT GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
SNB BYT SKL GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
New failures caught by the filter:
CI Bug Log said:A CI Bug Log filter associated to this bug has been updated:
SNB BYT SKL GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
New failures caught by the filter:
LAKSHMINARAYANA VUDUM@l4kshmi
said:(In reply to CI Bug Log from comment 13)
A CI Bug Log filter associated to this bug has been updated:
{- SNB BYT SKL GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed
assertion: med < limit && max < 5 * limit -}
{+ SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed
assertion: med < limit && max < 5 * limit +}
New failures caught by the filter:
*
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_293/fi-icl-u3/
igt@gem_eio@unwedge-stress.html
Reopened this bug as this failure happened on ICL. CI Bug Log said:A CI Bug Log filter associated to this bug has been updated:
SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
No new failures caught with the new filter Shuang He@shuang
uploaded an attachment:Dear sender,
i have take leave during ww26.2. Please call me cell phone if urgency, sorry for the inconvenience it might bring to you.Attachment 144631, "attachment-13473-0.html":
attachment-13473-0.html Chris Wilson@ickle
said:For reference,
commit f0e39642f6f8da5406627bfa79c6600df949e203 (upstream/master, origin/master, origin/HEAD)
Author: Chris Wilson chris@chris-wilson.co.uk
Date: Tue Jul 2 12:40:45 2019 +0100
i915/gem_eio: Assert the hanging request is correctly identified
When forcing a reset, it is crucial that the kernel correctly identifies
the injected hang. Verify this is the case for reset-stress.
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk
Reviewed-by: Mika Kuoppala mika.kuoppala@linux.intel.com
One hypothesis is that we are not resetting the guilty request and so hitting a hangcheck instead. CI Bug Log said:A CI Bug Log filter associated to this bug has been updated:
SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
SNB BYT SKL APL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
New failures caught by the filter:
Chris Wilson@ickle
said:<7>
[944.138584] [IGT] Forcing GPU reset
<7>
[944.138848] [drm:i915_reset_device [i915]] resetting chip
<5>
[944.138957] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<7>
[944.139197] [IGT] Checking that the GPU recovered
<5>
[944.162438] Setting dangerous option reset - tainting kernel
<7>
[944.275166] [drm:i915_reset_device [i915]] resetting chip
<5>
[944.276899] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<5>
[944.277178] Setting dangerous option reset - tainting kernel
<7>
[944.277284] [IGT] Forcing GPU reset
<7>
[944.277557] [drm:i915_reset_device [i915]] resetting chip
<5>
[944.278273] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<7>
[944.278579] [IGT] Checking that the GPU recovered
<5>
[944.302432] Setting dangerous option reset - tainting kernel
<7>
[946.381889] [drm:i915_reset_device [i915]] resetting chip
<5>
[946.382011] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<5>
[946.382270] Setting dangerous option reset - tainting kernel
<7>
[946.382345] [IGT] Forcing GPU reset
<7>
[946.382557] [drm:i915_reset_device [i915]] resetting chip
<5>
[946.383318] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<7>
[946.383621] [IGT] Checking that the GPU recovered
<6>
[946.475026] [IGT] gem_eio: exiting, ret=98
Which confirms that normally we expect quick reset+recovery cycles (with a reset period of 100ms between iterations). It also tells us that the delay is before i915_reset_device (although we could do with drm.debug=7 to be sure), which is the preamble in i915_handle_error(). Of note the only thing there is synchronize_rcu_expedited(). :|