[CI][SHARDS] igt@gem_eio@unwedge-stress - fail - Failed assertion: med < limit && max < 5 * limit

added CI feature: GEM platform: SNB priority::high severity::normal + 1 deleted label

CI Bug Log said:

The CI Bug Log issue associated to this bug has been updated.

New filters associated

* SNB: igt@gem_eio@unwedge-stress - fail - Failed assertion: med < limit && max < 5 * limit
- https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5614/shard-snb7/igt@gem_eio@unwedge-stress.html
- https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5615/shard-snb7/igt@gem_eio@unwedge-stress.html
- https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5622/shard-snb1/igt@gem_eio@unwedge-stress.html

Chris Wilson @ickle said:

It exceeded 3s in some runs. Gah.

https://patchwork.freedesktop.org/patch/286706/ is my hope.

Chris Wilson @ickle said:

Fingers crossed once again,

commit 8f54b3c6c921275d10e33746553c40294ffa0d58
Author: Chris Wilson chris@chris-wilson.co.uk
Date: Tue Feb 19 12:21:57 2019 +0000

drm/i915: Trim delays for wedging

CI still reports the occasional multi-second delay for resets, in
particular along the wedge+recovery paths. As the likely, and unbounded,
delay here is from sync_rcu, use the expedited variant instead.

Testcase: igt/gem_eio/unwedge-stress
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk
Cc: Mika Kuoppala mika.kuoppala@intel.com
Reviewed-by: Mika Kuoppala mika.kuoppala@linux.intel.com
Link: https://patchwork.freedesktop.org/patch/msgid/20190219122215.8941-7-chris@chris-wilson.co.uk

CI Bug Log said:

A CI Bug Log filter associated to this bug has been updated:

SNB: igt@gem_eio@unwedge-stress - fail - Failed assertion: med < limit && max < 5 * limit
SNB: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit

New failures caught by the filter:

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4838/shard-snb5/igt@gem_eio@reset-stress.html

Chris Wilson @ickle said:

Now that's just cruel, having supplied a patch specifically for the unwedge-stress subtest, you cross-pollute it with reset-stress!

Not that it'll make much difference, but there is quite a difference in driver paths between the two subtests.

Martin Peres @mupuf said:

(In reply to Chris Wilson from comment 5)

Now that's just cruel, having supplied a patch specifically for the
unwedge-stress subtest, you cross-pollute it with reset-stress!

Not that it'll make much difference, but there is quite a difference in
driver paths between the two subtests.

Sorry about that! However, unwedge-stress is still failing:
- https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4855/shard-snb5/igt@gem_eio@unwedge-stress.html
- https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4858/shard-snb4/igt@gem_eio@unwedge-stress.html
- https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5671/shard-snb4/igt@gem_eio@unwedge-stress.html
- https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5672/shard-snb2/igt@gem_eio@unwedge-stress.html

If the fix for these issues is not fixing the reset-stress issues, we'll create a new bug!

Chris Wilson @ickle said:

We're just at a mercy of an unbounded wait. We're using sync_rcu_expedited everywhere we can here and still we get delayed. I'm tempted to remove the fail for the max timeout being several seconds so long as the median is reasonable (all the limits are arbitrary anyway).

CI Bug Log said:

A CI Bug Log filter associated to this bug has been updated:

SNB: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
SNB GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit

New failures caught by the filter:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5947/shard-glk1/igt@gem_eio@unwedge-stress.html

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5949/shard-glk8/igt@gem_eio@unwedge-stress.html

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5950/shard-glk8/igt@gem_eio@unwedge-stress.html

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5951/shard-glk5/igt@gem_eio@unwedge-stress.html

LAKSHMINARAYANA VUDUM @l4kshmi said:

(In reply to CI Bug Log from comment 8)

A CI Bug Log filter associated to this bug has been updated:

{- SNB: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med
< limit && max < 5 * limit -}
{+ SNB GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion:
med < limit && max < 5 * limit +}

New failures caught by the filter:

*
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5947/shard-glk1/
igt@gem_eio@unwedge-stress.html

*
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5949/shard-glk8/
igt@gem_eio@unwedge-stress.html

*
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5950/shard-glk8/
igt@gem_eio@unwedge-stress.html

*
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5951/shard-glk5/
igt@gem_eio@unwedge-stress.html

Also seen on GLK.

CI Bug Log said:

A CI Bug Log filter associated to this bug has been updated:

SNB GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
SNB BYT GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit

New failures caught by the filter:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_258/fi-byt-n2820/igt@gem_eio@unwedge-stress.html

Chris Wilson @ickle said:

It looks like it was the reset worker feeding in the restart request that dragged us down.

commit 79ffac8599c4d8aa84d313920d3d86d7361c252b
Author: Chris Wilson chris@chris-wilson.co.uk
Date: Wed Apr 24 21:07:17 2019 +0100

drm/i915: Invert the GEM wakeref hierarchy

CI Bug Log said:

A CI Bug Log filter associated to this bug has been updated:

SNB BYT GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
SNB BYT SKL GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit

New failures caught by the filter:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6043/shard-skl5/igt@gem_eio@reset-stress.html

CI Bug Log said:

A CI Bug Log filter associated to this bug has been updated:

SNB BYT SKL GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit

New failures caught by the filter:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_293/fi-icl-u3/igt@gem_eio@unwedge-stress.html

LAKSHMINARAYANA VUDUM @l4kshmi said:

(In reply to CI Bug Log from comment 13)

A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT SKL GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed
assertion: med < limit && max < 5 * limit -}
{+ SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed
assertion: med < limit && max < 5 * limit +}

New failures caught by the filter:

*
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_293/fi-icl-u3/
igt@gem_eio@unwedge-stress.html

Reopened this bug as this failure happened on ICL.

CI Bug Log said:

A CI Bug Log filter associated to this bug has been updated:

SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit

No new failures caught with the new filter

Shuang He @shuang uploaded an attachment:

Dear sender,
i have take leave during ww26.2. Please call me cell phone if urgency, sorry for the inconvenience it might bring to you.

Attachment 144631, "attachment-13473-0.html":
attachment-13473-0.html

Chris Wilson @ickle said:

For reference,

commit f0e39642f6f8da5406627bfa79c6600df949e203 (upstream/master, origin/master, origin/HEAD)
Author: Chris Wilson chris@chris-wilson.co.uk
Date: Tue Jul 2 12:40:45 2019 +0100

i915/gem_eio: Assert the hanging request is correctly identified

When forcing a reset, it is crucial that the kernel correctly identifies
the injected hang. Verify this is the case for reset-stress.

Signed-off-by: Chris Wilson chris@chris-wilson.co.uk
Reviewed-by: Mika Kuoppala mika.kuoppala@linux.intel.com

One hypothesis is that we are not resetting the guilty request and so hitting a hangcheck instead.

CI Bug Log said:

A CI Bug Log filter associated to this bug has been updated:

SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit
SNB BYT SKL APL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit

New failures caught by the filter:

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5088/shard-apl8/igt@gem_eio@reset-stress.html

Chris Wilson @ickle said:

<7> [944.138584] [IGT] Forcing GPU reset
<7> [944.138848] [drm:i915_reset_device [i915]] resetting chip
<5> [944.138957] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<7> [944.139197] [IGT] Checking that the GPU recovered
<5> [944.162438] Setting dangerous option reset - tainting kernel
<7> [944.275166] [drm:i915_reset_device [i915]] resetting chip
<5> [944.276899] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<5> [944.277178] Setting dangerous option reset - tainting kernel
<7> [944.277284] [IGT] Forcing GPU reset
<7> [944.277557] [drm:i915_reset_device [i915]] resetting chip
<5> [944.278273] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<7> [944.278579] [IGT] Checking that the GPU recovered
<5> [944.302432] Setting dangerous option reset - tainting kernel
<7> [946.381889] [drm:i915_reset_device [i915]] resetting chip
<5> [946.382011] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<5> [946.382270] Setting dangerous option reset - tainting kernel
<7> [946.382345] [IGT] Forcing GPU reset
<7> [946.382557] [drm:i915_reset_device [i915]] resetting chip
<5> [946.383318] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<7> [946.383621] [IGT] Checking that the GPU recovered
<6> [946.475026] [IGT] gem_eio: exiting, ret=98

Which confirms that normally we expect quick reset+recovery cycles (with a reset period of 100ms between iterations). It also tells us that the delay is before i915_reset_device (although we could do with drm.debug=7 to be sure), which is the preamble in i915_handle_error(). Of note the only thing there is synchronize_rcu_expedited(). :|

[CI][SHARDS] igt@gem_eio@unwedge-stress - fail - Failed assertion: med < limit && max < 5 * limit

Submitted by Martin Peres `@mupuf`

Description

Child items ...

Activity

New filters associated

Admin message

Admin message

[CI][SHARDS] igt@gem_eio@unwedge-stress - fail - Failed assertion: med < limit && max < 5 * limit

Submitted by Martin Peres @mupuf

Description

Activity

New filters associated

Submitted by Martin Peres `@mupuf`