[Intel-gfx] [CI] drm/i915/execlists: Workaround switching back to a complete context

Fri Mar 27 20:42:14 UTC 2020

Quoting Mika Kuoppala (2020-03-27 20:33:29)
> Chris Wilson <chris at chris-wilson.co.uk> writes:
> 
> > In what seems remarkably similar to the w/a required to not reload an
> > idle context with HEAD==TAIL, it appears we must prevent the HW from
> > switching to an idle context in ELSP[1], while simultaneously trying to
> > preempt the HW to run another context and a continuation of the idle
> > context (which is no longer idle).
> >
> > We can achieve this by preventing the context from completing while we
> > reload a new ELSP (by applying ring_set_paused(1) across the whole of
> > dequeue), except this eventually fails due to a lite-restore into a
> > waiting semaphore does not generate an ACK. Instead, we try to avoid
> > making the GPU do anything too challenging and not submit a new ELSP
> > while the interrupts + CSB events appear to have fallen behind the
> > completed contexts. We expect it to catch up shortly so we queue another
> > tasklet execution and hope for the best.
> >
> > Closes: https://gitlab.freedesktop.org/drm/intel/issues/1501
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_lrc.c | 26 +++++++++++++++++++++++---
> >  1 file changed, 23 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > index b12355048501..5f17ece07858 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > @@ -1915,11 +1915,26 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
> >        * of trouble.
> >        */
> >       active = READ_ONCE(execlists->active);
> > -     while ((last = *active) && i915_request_completed(last))
> > -             active++;
> >  
> > -     if (last) {
> > +     /*
> > +      * In theory we can skip over completed contexts that have not
> > +      * yet been processed by events (as those events are in flight):
> > +      *
> > +      * while ((last = *active) && i915_request_completed(last))
> > +      *      active++;
> > +      *
> > +      * However, the GPU is cannot handle this as it will ultimately
> 
> s/is//
> 
> I applaud the straightforward nature of this compared to the pausing.
> Albeit this seems to have a cost. 
> 
> But this should be quite rare event comparatively?

In the grand scheme of things, yes, it should only be short-circuiting
the interrupt delivery prior to direct submission or a preemption bump.
But since the request is complete, there should be nothing to stop the
context from completing, triggering the CSB event and sending the
interrupt. So it should be, one hopes, about 100ns at most behind.

> > +      * find itself trying to jump back into a context it has just
> > +      * completed and barf.
> > +      */
> > +
> > +     if ((last = *active)) {
> >               if (need_preempt(engine, last, rb)) {
> > +                     if (i915_request_completed(last)) {
> > +                             tasklet_hi_schedule(&execlists->tasklet);
> > +                             return;
> > +                     }
> > +
> 
> I was pondering of the lost tracing and if you can
> work it backwards to this condition.

We back out, but we will back again and will likely need to preempt
after we handle the next CSB event. So it's just "gpu is not ready yet,
no decision made" noise.
-Chris