[Intel-gfx] [PATCH 07/31] drm/i915/execlists: Direct submission of new requests (avoid tasklet/ksoftirqd)
Chris Wilson
chris at chris-wilson.co.uk
Wed Jun 27 10:58:24 UTC 2018
Quoting Tvrtko Ursulin (2018-06-27 11:40:32)
>
> On 25/06/2018 10:48, Chris Wilson wrote:
> > Back in commit 27af5eea54d1 ("drm/i915: Move execlists irq handler to a
> > bottom half"), we came to the conclusion that running our CSB processing
> > and ELSP submission from inside the irq handler was a bad idea. A really
> > bad idea as we could impose nearly 1s latency on other users of the
> > system, on average! Deferring our work to a tasklet allowed us to do the
> > processing with irqs enabled, reducing the impact to an average of about
> > 50us.
> >
> > We have since eradicated the use of forcewaked mmio from inside the CSB
> > processing and ELSP submission, bringing the impact down to around 5us
> > (on Kabylake); an order of magnitude better than our measurements 2
> > years ago on Broadwell and only about 2x worse on average than the
> > gem_syslatency on an unladen system.
> >
> > In this iteration of the tasklet-vs-direct submission debate, we seek a
> > compromise where by we submit new requests immediately to the HW but
> > defer processing the CS interrupt onto a tasklet. We gain the advantage
> > of low-latency and ksoftirqd avoidance when waking up the HW, while
> > avoiding the system-wide starvation of our CS irq-storms.
> >
> > Comparing the impact on the maximum latency observed (that is the time
> > stolen from an RT process) over a 120s interval, repeated several times
> > (using gem_syslatency, similar to RT's cyclictest) while the system is
> > fully laden with i915 nops, we see that direct submission an actually
> > improve the worse case.
> >
> > Maximum latency in microseconds of a third party RT thread
> > (gem_syslatency -t 120 -f 2)
> > x Always using tasklets (a couple of >1000us outliers removed)
> > + Only using tasklets from CS irq, direct submission of requests
> > +------------------------------------------------------------------------+
> > | + |
> > | + |
> > | + |
> > | + + |
> > | + + + |
> > | + + + + x x x |
> > | +++ + + + x x x x x x |
> > | +++ + ++ + + *x x x x x x |
> > | +++ + ++ + * *x x * x x x |
> > | + +++ + ++ * * +*xxx * x x xx |
> > | * +++ + ++++* *x+**xx+ * x x xxxx x |
> > | **x++++*++**+*x*x****x+ * +x xx xxxx x x |
> > |x* ******+***************++*+***xxxxxx* xx*x xxx + x+|
> > | |__________MA___________| |
> > | |______M__A________| |
> > +------------------------------------------------------------------------+
> > N Min Max Median Avg Stddev
> > x 118 91 186 124 125.28814 16.279137
> > + 120 92 187 109 112.00833 13.458617
> > Difference at 95.0% confidence
> > -13.2798 +/- 3.79219
> > -10.5994% +/- 3.02677%
> > (Student's t, pooled s = 14.9237)
> >
> > However the mean latency is adversely affected:
> >
> > Mean latency in microseconds of a third party RT thread
> > (gem_syslatency -t 120 -f 1)
> > x Always using tasklets
> > + Only using tasklets from CS irq, direct submission of requests
> > +------------------------------------------------------------------------+
> > | xxxxxx + ++ |
> > | xxxxxx + ++ |
> > | xxxxxx + +++ ++ |
> > | xxxxxxx +++++ ++ |
> > | xxxxxxx +++++ ++ |
> > | xxxxxxx +++++ +++ |
> > | xxxxxxx + ++++++++++ |
> > | xxxxxxxx ++ ++++++++++ |
> > | xxxxxxxx ++ ++++++++++ |
> > | xxxxxxxxxx +++++++++++++++ |
> > | xxxxxxxxxxx x +++++++++++++++ |
> > |x xxxxxxxxxxxxx x + + ++++++++++++++++++ +|
> > | |__A__| |
> > | |____A___| |
> > +------------------------------------------------------------------------+
> > N Min Max Median Avg Stddev
> > x 120 3.506 3.727 3.631 3.6321417 0.02773109
> > + 120 3.834 4.149 4.039 4.0375167 0.041221676
> > Difference at 95.0% confidence
> > 0.405375 +/- 0.00888913
> > 11.1608% +/- 0.244735%
> > (Student's t, pooled s = 0.03513)
> >
> > However, since the mean latency corresponds to the amount of irqsoff
> > processing we have to do for a CS interrupt, we only need to speed that
> > up to benefit not just system latency but our own throughput.
> >
> > v2: Remember to defer submissions when under reset.
> > v4: Only use direct submission for new requests
> > v5: Be aware that with mixing direct tasklet evaluation and deferred
> > tasklets, we may end up idling before running the deferred tasklet.
> >
> > Testcase: igt/gem_exec_latency/*rthog*
> > References: 27af5eea54d1 ("drm/i915: Move execlists irq handler to a bottom half")
> > Suggested-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > ---
> > drivers/gpu/drm/i915/i915_gem.h | 5 +
> > drivers/gpu/drm/i915/i915_irq.c | 11 +-
> > drivers/gpu/drm/i915/intel_engine_cs.c | 8 +-
> > drivers/gpu/drm/i915/intel_lrc.c | 147 ++++++++++++++----------
> > drivers/gpu/drm/i915/intel_ringbuffer.h | 1 -
> > 5 files changed, 98 insertions(+), 74 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
> > index 261da577829a..7892ac773916 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.h
> > +++ b/drivers/gpu/drm/i915/i915_gem.h
> > @@ -88,4 +88,9 @@ static inline void __tasklet_enable_sync_once(struct tasklet_struct *t)
> > tasklet_kill(t);
> > }
> >
> > +static inline bool __tasklet_is_enabled(const struct tasklet_struct *t)
> > +{
> > + return likely(!atomic_read(&t->count));
> > +}
> > +
>
> For the unlikely-likely chain from
> __submit_queue->reset_in_progress->__tasklet_is_enabled I think it would
> be better to drop the likely/unlikely from low-level helpers and put the
> one unlikely into the __submit_queue.
Tasklets are rarely disabled, I think that's quite important to stress.
Tasklets do not function very well (heavy spinning) while disabled.
> > #endif /* __I915_GEM_H__ */
> > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> > index 46aaef5c1851..316d0b08d40f 100644
> > --- a/drivers/gpu/drm/i915/i915_irq.c
> > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > @@ -1469,14 +1469,10 @@ static void snb_gt_irq_handler(struct drm_i915_private *dev_priv,
> > static void
> > gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
> > {
> > - struct intel_engine_execlists * const execlists = &engine->execlists;
> > bool tasklet = false;
> >
> > - if (iir & GT_CONTEXT_SWITCH_INTERRUPT) {
> > - if (READ_ONCE(engine->execlists.active))
>
> What is the thinking behind this change? It used to be that we scheduled
> the tasklet only when we knew we are expecting interrupts and now we
> don't care any more for some reason?
The filtering is done inside process_csb(). We filtered on active
previously as some interrupts were seemingly going astray, now I am much
more confident that all are accounted for.
> > - tasklet = !test_and_set_bit(ENGINE_IRQ_EXECLIST,
> > - &engine->irq_posted);
>
> And this is gone as well. Can you put a paragraph in the commit message
> explaining the change? It doesn't seem immediately connected with direct
> submission.
Removing one heavyweight atomic operation in the latency sensitive
interrupt.
> > + if (iir & GT_CONTEXT_SWITCH_INTERRUPT)
> > + tasklet = true;
> >
> > if (iir & GT_RENDER_USER_INTERRUPT) {
> > notify_ring(engine);
> > @@ -1484,7 +1480,7 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
> > }
> >
> > if (tasklet)
> > - tasklet_hi_schedule(&execlists->tasklet);
> > + tasklet_hi_schedule(&engine->execlists.tasklet);
> > }
> >
> > static void gen8_gt_irq_ack(struct drm_i915_private *i915,
> > @@ -2216,7 +2212,6 @@ static irqreturn_t cherryview_irq_handler(int irq, void *arg)
> >
> > I915_WRITE(VLV_IER, ier);
> > I915_WRITE(GEN8_MASTER_IRQ, GEN8_MASTER_IRQ_CONTROL);
> > - POSTING_READ(GEN8_MASTER_IRQ);
>
> What is this?
Something that I haven't managed to kill yet.
> > diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> > index 7209c22798e6..ace93958689e 100644
> > --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> > +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> > @@ -1353,12 +1353,10 @@ static void intel_engine_print_registers(const struct intel_engine_cs *engine,
> > ptr = I915_READ(RING_CONTEXT_STATUS_PTR(engine));
> > read = GEN8_CSB_READ_PTR(ptr);
> > write = GEN8_CSB_WRITE_PTR(ptr);
> > - drm_printf(m, " Execlist CSB read %d [%d cached], write %d [%d from hws], interrupt posted? %s, tasklet queued? %s (%s)\n",
> > + drm_printf(m, " Execlist CSB read %d [%d cached], write %d [%d from hws], tasklet queued? %s (%s)\n",
> > read, execlists->csb_head,
> > write,
> > intel_read_status_page(engine, intel_hws_csb_write_index(engine->i915)),
> > - yesno(test_bit(ENGINE_IRQ_EXECLIST,
> > - &engine->irq_posted)),
> > yesno(test_bit(TASKLET_STATE_SCHED,
> > &engine->execlists.tasklet.state)),
> > enableddisabled(!atomic_read(&engine->execlists.tasklet.count)));
> > @@ -1570,11 +1568,9 @@ void intel_engine_dump(struct intel_engine_cs *engine,
> > spin_unlock(&b->rb_lock);
> > local_irq_restore(flags);
> >
> > - drm_printf(m, "IRQ? 0x%lx (breadcrumbs? %s) (execlists? %s)\n",
> > + drm_printf(m, "IRQ? 0x%lx (breadcrumbs? %s)\n",
> > engine->irq_posted,
> > yesno(test_bit(ENGINE_IRQ_BREADCRUMB,
> > - &engine->irq_posted)),
> > - yesno(test_bit(ENGINE_IRQ_EXECLIST,
> > &engine->irq_posted)));
> >
> > drm_printf(m, "HWSP:\n");
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index 5a12b8fc9d8f..c82efa3ac105 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -562,13 +562,15 @@ static void complete_preempt_context(struct intel_engine_execlists *execlists)
> > {
> > GEM_BUG_ON(!execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT));
> >
> > + __unwind_incomplete_requests(container_of(execlists,
> > + typeof(struct intel_engine_cs),
> > + execlists));
> > execlists_cancel_port_requests(execlists);
> > - execlists_unwind_incomplete_requests(execlists);
>
> Is the ordering change significant and why?
Mostly for consistency and reasoning about request reference lifetimes.
(Unwind => we retain the request reference, as it is moved back to the
protected execution lists.)
> > +static void execlists_submission_tasklet(unsigned long data)
> > +{
> > + struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
> > + unsigned long flags;
> > +
> > + GEM_TRACE("%s awake?=%d, active=%x\n",
> > + engine->name,
> > + engine->i915->gt.awake,
> > + engine->execlists.active);
> > +
> > + spin_lock_irqsave(&engine->timeline.lock, flags);
> > +
> > + if (engine->i915->gt.awake) /* we may be delayed until after we idle! */
> > + __execlists_submission_tasklet(engine);
>
> Sounds quite bad! this means we fail to process pending CSB. And going
> idle syncs the tasklets so what am I missing?
That tasklets get kicked randomly, I think was the culprit.
> > +static void __submit_queue(struct intel_engine_cs *engine)
> > +{
> > + struct intel_engine_execlists * const execlists = &engine->execlists;
> > +
> > + if (reset_in_progress(execlists))
> > + return; /* defer until we restart the engine following reset */
> > +
> > + if (execlists->tasklet.func == execlists_submission_tasklet)
>
> What is this check determining?
That we can call __execlists_submission_tasklet, i.e. it is not guc,
veng or anything weirder.
> Are we always calling it directly even if the ports are busy? Wouldn't
> it be better to schedule in that that case?
No, we are only calling it if we have a more important request than
either port (see queue_priority).
> > +static void reset_csb_pointers(struct intel_engine_execlists *execlists)
> > +{
> > + /*
> > + * After a reset, the HW starts writing into CSB entry [0]. We
> > + * therefore have to set our HEAD pointer back one entry so that
> > + * the *first* entry we check is entry 0. To complicate this further,
> > + * as we don't wait for the first interrupt after reset, we have to
> > + * fake the HW write to point back to the last entry so that our
> > + * inline comparison of our cached head position against the last HW
> > + * write works even before the first interrupt.
> > + */
> > + execlists->csb_head = GEN8_CSB_ENTRIES - 1;
> > + WRITE_ONCE(*execlists->csb_write, (GEN8_CSB_ENTRIES - 1) | 0xff << 16);
> > +}
>
> Hmm this makes me think there should be another prep patch before direct
> submission. Need to build a clearer picture before I can suggest how.
For what? This was already in the prep patches (handing CSB on reset vs
resume paths, and all the fake fallout), I don't actually need it in a
function, it was just handy to do so as iirc I wanted to use it
elsewhere, but fortunately killed off that caller.
So the prep patch is just making it into a function.
-Chris
More information about the Intel-gfx
mailing list