X crashes after 'Resetting rcs0 for preemption time out'

It took over 10s, looks to be a genuine hang. E.g. the final death throes is

[   77.074520] Asynchronous wait on fence i915:gnome-shell[1732]:5de timed out (hint:intel_atomic_commit_ready+0x0/0x54 [i915])
[   79.539492] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[   79.539534] i915 0000:00:02.0: Xorg[1600] context reset due to GPU hang
[   79.539612] [drm:__i915_request_reset.cold [i915]] context Xorg[1600]: guilty 3, banned
[   79.539687] [drm:__i915_request_reset.cold [i915]] client Xorg[1600]: gained 4 ban score, now 6

I was going to suggest trying the 640ms timeout, but it's dead Jim.

Usual question: confirmed present in drm-tip?

it was built from cod/tip/drm-tip/2019-11-27 1815badcd921b2cc41ee5ef2fc41c26b3d3ebbf4

mentioned in issue #166

mentioned in issue #201 (closed)

mentioned in issue #97 (closed)

mentioned in issue #219 (closed)

mentioned in issue #122 (closed)

mentioned in issue #125 (closed)

mentioned in issue #203 (closed)

mentioned in issue #264 (closed)

mentioned in issue #311 (closed)

mentioned in issue #353 (closed)

mentioned in issue #491 (closed)

mentioned in issue #515 (closed)

mentioned in issue #520 (closed)

mentioned in issue #526 (closed)

mentioned in issue #561 (closed)

mentioned in issue #610 (closed)

By the way, we have later found when pairing with a 4K monitor, the issue could not be reproduced.

added GPU hang feature: display/Other platform: ICL labels

changed title from [ICL] X crashes after 'Resetting rcs0 for preemption time out' to X crashes after 'Resetting rcs0 for preemption time out'

added kernel:drm-tip label

mentioned in issue mesa/mesa#2183 (closed)

Let's assume it was

    drm/i915/execlists: Always force a context reload when rewinding RING_TAIL
    
    If we rewind the RING_TAIL on a context, due to a preemption event, we
    must force the context restore for the RING_TAIL update to be properly
    handled. Rather than note which preemption events may cause us to rewind
    the tail, compare the new request's tail with the previously submitted
    RING_TAIL, as it turns out that timeslicing was causing unexpected
    rewinds.
    
       <idle>-0       0d.s2 1280851190us : __execlists_submission_tasklet: 0000:00:02.0 rcs0: expired last=130:4698, prio=3, hint=3
       <idle>-0       0d.s2 1280851192us : __i915_request_unsubmit: 0000:00:02.0 rcs0: fence 66:119966, current 119964
       <idle>-0       0d.s2 1280851195us : __i915_request_unsubmit: 0000:00:02.0 rcs0: fence 130:4698, current 4695
       <idle>-0       0d.s2 1280851198us : __i915_request_unsubmit: 0000:00:02.0 rcs0: fence 130:4696, current 4695
    ^----  Note we unwind 2 requests from the same context
    
       <idle>-0       0d.s2 1280851208us : __i915_request_submit: 0000:00:02.0 rcs0: fence 130:4696, current 4695
       <idle>-0       0d.s2 1280851213us : __i915_request_submit: 0000:00:02.0 rcs0: fence 134:1508, current 1506
    ^---- But to apply the new timeslice, we have to replay the first request
          before the new client can start -- the unexpected RING_TAIL rewind
    
       <idle>-0       0d.s2 1280851219us : trace_ports: 0000:00:02.0 rcs0: submit { 130:4696*, 134:1508 }
     synmark2-5425    2..s. 1280851239us : process_csb: 0000:00:02.0 rcs0: cs-irq head=5, tail=0
     synmark2-5425    2..s. 1280851240us : process_csb: 0000:00:02.0 rcs0: csb[0]: status=0x00008002:0x00000000
    ^---- Preemption event for the ELSP update; note the lite-restore
    
     synmark2-5425    2..s. 1280851243us : trace_ports: 0000:00:02.0 rcs0: preempted { 130:4698, 66:119966 }
     synmark2-5425    2..s. 1280851246us : trace_ports: 0000:00:02.0 rcs0: promote { 130:4696*, 134:1508 }
     synmark2-5425    2.... 1280851462us : __i915_request_commit: 0000:00:02.0 rcs0: fence 130:4700, current 4695
     synmark2-5425    2.... 1280852111us : __i915_request_commit: 0000:00:02.0 rcs0: fence 130:4702, current 4695
     synmark2-5425    2.Ns1 1280852296us : process_csb: 0000:00:02.0 rcs0: cs-irq head=0, tail=2
     synmark2-5425    2.Ns1 1280852297us : process_csb: 0000:00:02.0 rcs0: csb[1]: status=0x00000814:0x00000000
     synmark2-5425    2.Ns1 1280852299us : trace_ports: 0000:00:02.0 rcs0: completed { 130:4696!, 134:1508 }
     synmark2-5425    2.Ns1 1280852301us : process_csb: 0000:00:02.0 rcs0: csb[2]: status=0x00000818:0x00000040
     synmark2-5425    2.Ns1 1280852302us : trace_ports: 0000:00:02.0 rcs0: completed { 134:1508, 0:0 }
     synmark2-5425    2.Ns1 1280852313us : process_csb: process_csb:2336 GEM_BUG_ON(!i915_request_completed(*execlists->active) && !reset_in_progress(execlists))
    
    Fixes: 8ee36e048c98 ("drm/i915/execlists: Minimalistic timeslicing")
    Referenecs: 82c69bf58650 ("drm/i915/gt: Detect if we miss WaIdleLiteRestore")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: <stable@vger.kernel.org> # v5.4+
    Link: https://patchwork.freedesktop.org/patch/msgid/20200207211452.2860634-1-chris@chris-wilson.co.uk

or the earlier

commit 82c69bf58650e644c61aa2bf5100b63a1070fd2f (intel/for-linux-next, intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Dec 9 02:32:15 2019 +0000

    drm/i915/gt: Detect if we miss WaIdleLiteRestore
    
    In order to avoid confusing the HW, we must never submit an empty ring
    during lite-restore, that is we should always advance the RING_TAIL
    before submitting to stay ahead of the RING_HEAD.
    
    Normally this is prevented by keeping a couple of spare NOPs in the
    request->wa_tail so that on resubmission we can advance the tail. This
    relies on the request only being resubmitted once, which is the normal
    condition as it is seen once for ELSP[1] and then later in ELSP[0]. On
    preemption, the requests are unwound and the tail reset back to the
    normal end point (as we know the request is incomplete and therefore its
    RING_HEAD is even earlier).
    
    However, if this w/a should fail we would try and resubmit the request
    with the RING_TAIL already set to the location of this request's wa_tail
    potentially causing a GPU hang. We can spot when we do try and
    incorrectly resubmit without advancing the RING_TAIL and spare any
    embarrassment by forcing the context restore.
    
    In the case of preempt-to-busy, we leave the requests running on the HW
    while we unwind. As the ring is still live, we cannot rewind our
    rq->tail without forcing a reload so leave it set to rq->wa_tail and
    only force a reload if we resubmit after a lite-restore. (Normally, the
    forced reload will be a part of the preemption event.)
    
    Fixes: 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy")
    Closes: https://gitlab.freedesktop.org/drm/intel/issues/673
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: stable@kernel.vger.org
    Link: https://patchwork.freedesktop.org/patch/msgid/20191209023215.3519970-1-chris@chris-wilson.co.uk

if not please refresh the logs.

closed

mentioned in issue #1323 (closed)

mentioned in issue #2138 (closed)

mentioned in issue #2438 (closed)

mentioned in issue #2465 (closed)

mentioned in issue #2590 (closed)

mentioned in issue #2607 (moved)

mentioned in issue #2634 (closed)

mentioned in issue #2743 (closed)

mentioned in issue #2787

mentioned in issue #2823 (closed)

mentioned in issue #3353 (closed)

mentioned in issue #3403 (closed)

mentioned in issue #3960 (closed)

mentioned in issue #5175 (closed)

mentioned in issue #5401 (closed)

mentioned in issue #5853 (closed)

mentioned in issue #6138 (closed)

mentioned in issue #7613 (closed)

mentioned in issue #10551

mentioned in commit agd5f/linux@d659b715

mentioned in commit 6aaced5a

X crashes after 'Resetting rcs0 for preemption time out'

Child items 0

Activity

Admin message

Admin message

X crashes after 'Resetting rcs0 for preemption time out'

Activity