[Intel-gfx] [PATCH] drm/i915/execlists: Pull tasklet interrupt-bh local to direct submission

Mon Mar 23 09:45:38 UTC 2020

Quoting Francisco Jerez (2020-03-20 22:14:51)
> Francisco Jerez <currojerez at riseup.net> writes:
> 
> > Chris Wilson <chris at chris-wilson.co.uk> writes:
> >
> >> We dropped calling process_csb prior to handling direct submission in
> >> order to avoid the nesting of spinlocks and lift process_csb() and the
> >> majority of the tasklet out of irq-off. However, we do want to avoid
> >> ksoftirqd latency in the fast path, so try and pull the interrupt-bh
> >> local to direct submission if we can acquire the tasklet's lock.
> >>
> >> v2: Tweak the balance to avoid over submitting lite-restores
> >>
> >> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >> Cc: Francisco Jerez <currojerez at riseup.net>
> >> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
> >> ---
> >>  drivers/gpu/drm/i915/gt/intel_lrc.c    | 44 ++++++++++++++++++++------
> >>  drivers/gpu/drm/i915/gt/selftest_lrc.c |  2 +-
> >>  2 files changed, 36 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> >> index f09dd87324b9..dceb65a0088f 100644
> >> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> >> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> >> @@ -2884,17 +2884,17 @@ static void queue_request(struct intel_engine_cs *engine,
> >>      set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> >>  }
> >>  
> >> -static void __submit_queue_imm(struct intel_engine_cs *engine)
> >> +static bool pending_csb(const struct intel_engine_execlists *el)
> >>  {
> >> -    struct intel_engine_execlists * const execlists = &engine->execlists;
> >> +    return READ_ONCE(*el->csb_write) != READ_ONCE(el->csb_head);
> >> +}
> >>  
> >> -    if (reset_in_progress(execlists))
> >> -            return; /* defer until we restart the engine following reset */
> >> +static bool skip_lite_restore(struct intel_engine_execlists *el,
> >> +                          const struct i915_request *rq)
> >> +{
> >> +    struct i915_request *inflight = execlists_active(el);
> >>  
> >> -    if (execlists->tasklet.func == execlists_submission_tasklet)
> >> -            __execlists_submission_tasklet(engine);
> >> -    else
> >> -            tasklet_hi_schedule(&execlists->tasklet);
> >> +    return inflight && inflight->context == rq->context;
> >>  }
> >>  
> >>  static void submit_queue(struct intel_engine_cs *engine,
> >> @@ -2905,8 +2905,34 @@ static void submit_queue(struct intel_engine_cs *engine,
> >>      if (rq_prio(rq) <= execlists->queue_priority_hint)
> >>              return;
> >>  
> >> +    if (reset_in_progress(execlists))
> >> +            return; /* defer until we restart the engine following reset */
> >> +
> >> +    /*
> >> +     * Suppress immediate lite-restores, leave that to the tasklet.
> >> +     *
> >> +     * However, we leave the queue_priority_hint unset so that if we do
> >> +     * submit a second context, we push that into ELSP[1] immediately.
> >> +     */
> >> +    if (skip_lite_restore(execlists, rq))
> >> +            return;
> >> +
> > Why do you need to treat lite-restore specially here?

Lite-restore have a noticeable impact on no-op loads. A part of that is
that a lite-restore is about 1us, and the other part is that the driver
has a lot more work to do. There's a balance point around here for not
needlessly interrupting ourselves and ensuring that there is no bubble.

> >
> > Anyway, trying this out now in combination with my patches now.
> >
> 
> This didn't seem to help (together with your other suggestion to move
> the overload accounting to __execlists_schedule_in/out).  And it makes
> the current -5% SynMark OglMultithread regression with my series go down
> to -10%.  My previous suggestion of moving the
> intel_gt_pm_active_begin() call to process_csb() when the submission is
> ACK'ed by the hardware does seem to help (and it roughly halves the
> OglMultithread regression), possibly because that way we're able to
> determine whether the first context was actually overlapping at the
> point that the second was received by the hardware -- I haven't tested
> it extensively yet though.

Grumble, it just seems like we are setting and clearing the flag on
completely unrelated events -- which I still think boils down to working
around latency in the driver. Or at least I hope there's an explanation
and bug to fix that improves responsiveness for all.
-Chris