[Intel-gfx] [PATCH] drm/i915/execlists: Reclaim hanging virtual request
Chris Wilson
chris at chris-wilson.co.uk
Tue Jan 21 11:44:10 UTC 2020
Quoting Chris Wilson (2020-01-21 11:33:36)
> Quoting Tvrtko Ursulin (2020-01-21 11:20:36)
> >
> > On 21/01/2020 10:09, Chris Wilson wrote:
> > > If we encounter a hang on a virtual engine, as we process the hang the
> > > request may already have been moved back to the virtual engine (we are
> > > processing the hang on the physical engine). We need to reclaim the
> > > request from the virtual engine so that the locking is consistent and
> > > local to the real engine on which we will hold the request for error
> > > state capturing.
> > >
> > > Fixes: 748317386afb ("drm/i915/execlists: Offline error capture")
> > > Testcase: igt/gem_exec_balancer/hang
> > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > > Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> > > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > > ---
> > > drivers/gpu/drm/i915/gt/intel_lrc.c | 20 ++++++++++++++++++++
> > > 1 file changed, 20 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > index 3a30767ff0c4..a0acf1898c1e 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > @@ -2582,6 +2582,26 @@ static void execlists_capture(struct intel_engine_cs *engine)
> > > cap->rq = active_request(cap->rq->context->timeline, cap->rq);
> > > GEM_BUG_ON(!cap->rq);
> > >
> > > + if (cap->rq->engine != engine) { /* preempted virtual engine */
> > > + struct virtual_engine *ve = to_virtual_engine(cap->rq->engine);
> > > + unsigned long flags;
> > > +
> > > + /*
> > > + * An unsubmitted request along a virtual engine will
> > > + * remain on the active (this) engine until we are able
> > > + * to process the context switch away (and so mark the
> > > + * context as no longer in flight). That cannot have happened
> > > + * yet, otherwise we would not be hanging!
> > > + */
> > > + spin_lock_irqsave(&ve->base.active.lock, flags);
> > > + GEM_BUG_ON(intel_context_inflight(cap->rq->context) != engine);
> > > + GEM_BUG_ON(ve->request != cap->rq);
> > > + ve->request = NULL;
> > > + spin_unlock_irqrestore(&ve->base.active.lock, flags);
> > > +
> > > + cap->rq->engine = engine;
> > > + }
> >
> > Conceptually this I think belongs in execlists_hold, not capture. Since
> > hold has the responsibility to hold correctly.
>
> Yeah, I didn't put it into execlists_hold() as thought that would be
> clearer without it, and that this was more a side-effect of
> process_csb() vs __unwind_incomplete_requests(), and didn't want to mix
> up with the question of engine->active.lock
>
> Agreed that execlists_capture() does not describe it well, so probably
> shouldn't be called directly from here.
>
> > Then also, this is all about a race with __unwind_incomplete_requests,
> > yes? If so would need to be done under the engine->active.lock or this
> > can still happen between now and execlists_hold.
>
> The serialisation here is on process_csb(), since we have prevented the
> tasklet being called and own the reset on the engine, we know that
> execlists_schedule_out() can't be called on this engine, and so we own
> the virtual request. It's not protected by the engine->active.lock.
>
> So happy with moving it to before the lock in execlists_hold()?
Bleugh. Put it in execlists_hold() suggests adding something like
static void execlists_hold(struct intel_engine_cs *engine,
struct i915_request *rq)
{
if (cap->rq->engine != engine) { /* preempted virtual engine */
struct virtual_engine *ve = to_virtual_engine(cap->rq->engine);
unsigned long flags;
/* foo */
GEM_BUG_ON(!test_bit(I915_RESET_ENGINE + engine->id,
&engine->gt->reset.flags);
/*
* An unsubmitted request along a virtual engine will
* remain on the active (this) engine until we are able
* to process the context switch away (and so mark the
* context as no longer in flight). That cannot have happened
* yet, otherwise we would not be hanging!
*/
spin_lock_irqsave(&ve->base.active.lock, flags);
GEM_BUG_ON(intel_context_inflight(cap->rq->context) != engine);
GEM_BUG_ON(ve->request != cap->rq);
ve->request = NULL;
spin_unlock_irqrestore(&ve->base.active.lock, flags);
cap->rq->engine = engine;
}
spin_lock_irq(&engine->active.lock);
/*
* Transfer this request onto the hold queue to prevent it
* being resumbitted to HW (and potentially completed) before we have
* released it. Since we may have already submitted following
* requests, we need to remove those as well.
*/
GEM_BUG_ON(i915_request_on_hold(rq));
GEM_BUG_ON(rq->engine != engine);
__execlists_hold(rq);
spin_unlock_irq(&engine->active.lock);
}
Writing down the serialisation rules does suggest that execlists_hold()
is not as generic as before. (On the other hand, doing so will help with
writing the selftest for this bug, and engine->suspend() is a completely
different kettle of fish that plays games with engine->active.hold
directly.)
-Chris
More information about the Intel-gfx
mailing list