[Intel-gfx] [PATCH 3/5] drm/i915/execlists: Process interrupted context on reset

Wed Jul 17 13:43:34 UTC 2019

Quoting Chris Wilson (2019-07-17 14:40:26)
> Quoting Tvrtko Ursulin (2019-07-17 14:31:00)
> > 
> > On 16/07/2019 13:49, Chris Wilson wrote:
> > > By stopping the rings, we may trigger an arbitration point resulting in
> > > a premature context-switch (i.e. a completion event before the request
> > > is actually complete). This clears the active context before the reset,
> > > but we must remember to rewind the incomplete context for replay upon
> > > resume.
> > > 
> > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > > ---
> > >   drivers/gpu/drm/i915/gt/intel_lrc.c | 6 ++++--
> > >   1 file changed, 4 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > index 9b87a2fc186c..7570a9256001 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > @@ -1419,7 +1419,8 @@ static void process_csb(struct intel_engine_cs *engine)
> > >                        * coherent (visible from the CPU) before the
> > >                        * user interrupt and CSB is processed.
> > >                        */
> > > -                     GEM_BUG_ON(!i915_request_completed(*execlists->active));
> > > +                     GEM_BUG_ON(!i915_request_completed(*execlists->active) &&
> > > +                                !reset_in_progress(execlists));
> > >                       execlists_schedule_out(*execlists->active++);
> > >   
> > >                       GEM_BUG_ON(execlists->active - execlists->inflight >
> > > @@ -2254,7 +2255,7 @@ static void __execlists_reset(struct intel_engine_cs *engine, bool stalled)
> > >        */
> > >       rq = execlists_active(execlists);
> > >       if (!rq)
> > > -             return;
> > > +             goto unwind;
> > >   
> > >       ce = rq->hw_context;
> > >       GEM_BUG_ON(i915_active_is_idle(&ce->active));
> > > @@ -2331,6 +2332,7 @@ static void __execlists_reset(struct intel_engine_cs *engine, bool stalled)
> > >       intel_ring_update_space(ce->ring);
> > >       __execlists_update_reg_state(ce, engine);
> > >   
> > > +unwind:
> > >       /* Push back any incomplete requests for replay after the reset. */
> > >       __unwind_incomplete_requests(engine);
> > >   }
> > > 
> > 
> > Sounds plausible.
> > 
> > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > 
> > Shouldn't there be a Fixes: tag to go with it?
> 
> Yeah, it's rare even by our standards, I think there's a live_hangcheck
> failure about once a month that could be the result of this. However,
> the result would be an unrecoverable GPU hang as each attempt at
> resetting would not see the missing request and so it would remain
> perpetually in the engine->active.list until a set-wedged (i.e. suspend
> in the user case).

Heh, the commit responsible was one that was itself trying to workaround
the effect of stop_engines() setting RING_HEAD=0 :)

commit 1863e3020ab50bd5f68d85719ba26356cc282643
Author: Chris Wilson <chris at chris-wilson.co.uk>
Date:   Thu Apr 11 14:05:15 2019 +0100

    drm/i915/execlists: Always reset the context's RING registers

    During reset, we try and stop the active ring. This has the consequence
    that we often clobber the RING registers within the context image. When
    we find an active request, we update the context image to rerun that
    request (if it was guilty, we replace the hanging user payload with
    NOPs). However, we were ignoring an active context if the request had
    completed, with the consequence that the next submission on that request
    would start with RING_HEAD==0 and not the tail of the previous request,
    causing all requests still in the ring to be rerun. Rare, but
    occasionally seen within CI where we would spot that the context seqno
    would reverse and complain that we were retiring an incomplete request.

-Chris