[Intel-gfx] [PATCH 07/31] drm/i915/execlists: Direct submission of new requests (avoid tasklet/ksoftirqd)

Wed Jun 27 10:58:24 UTC 2018

Quoting Tvrtko Ursulin (2018-06-27 11:40:32)
> 
> On 25/06/2018 10:48, Chris Wilson wrote:
> > Back in commit 27af5eea54d1 ("drm/i915: Move execlists irq handler to a
> > bottom half"), we came to the conclusion that running our CSB processing
> > and ELSP submission from inside the irq handler was a bad idea. A really
> > bad idea as we could impose nearly 1s latency on other users of the
> > system, on average! Deferring our work to a tasklet allowed us to do the
> > processing with irqs enabled, reducing the impact to an average of about
> > 50us.
> > 
> > We have since eradicated the use of forcewaked mmio from inside the CSB
> > processing and ELSP submission, bringing the impact down to around 5us
> > (on Kabylake); an order of magnitude better than our measurements 2
> > years ago on Broadwell and only about 2x worse on average than the
> > gem_syslatency on an unladen system.
> > 
> > In this iteration of the tasklet-vs-direct submission debate, we seek a
> > compromise where by we submit new requests immediately to the HW but
> > defer processing the CS interrupt onto a tasklet. We gain the advantage
> > of low-latency and ksoftirqd avoidance when waking up the HW, while
> > avoiding the system-wide starvation of our CS irq-storms.
> > 
> > Comparing the impact on the maximum latency observed (that is the time
> > stolen from an RT process) over a 120s interval, repeated several times
> > (using gem_syslatency, similar to RT's cyclictest) while the system is
> > fully laden with i915 nops, we see that direct submission an actually
> > improve the worse case.
> > 
> > Maximum latency in microseconds of a third party RT thread
> > (gem_syslatency -t 120 -f 2)
> >    x Always using tasklets (a couple of >1000us outliers removed)
> >    + Only using tasklets from CS irq, direct submission of requests
> > +------------------------------------------------------------------------+
> > |          +                                                             |
> > |          +                                                             |
> > |          +                                                             |
> > |          +       +                                                     |
> > |          + +     +                                                     |
> > |       +  + +     +  x     x     x                                      |
> > |      +++ + +     +  x  x  x  x  x  x                                   |
> > |      +++ + ++  + +  *x x  x  x  x  x                                   |
> > |      +++ + ++  + *  *x x  *  x  x  x                                   |
> > |    + +++ + ++  * * +*xxx  *  x  x  xx                                  |
> > |    * +++ + ++++* *x+**xx+ *  x  x xxxx x                               |
> > |   **x++++*++**+*x*x****x+ * +x xx xxxx x          x                    |
> > |x* ******+***************++*+***xxxxxx* xx*x     xxx +                x+|
> > |             |__________MA___________|                                  |
> > |      |______M__A________|                                              |
> > +------------------------------------------------------------------------+
> >      N           Min           Max        Median           Avg        Stddev
> > x 118            91           186           124     125.28814     16.279137
> > + 120            92           187           109     112.00833     13.458617
> > Difference at 95.0% confidence
> >       -13.2798 +/- 3.79219
> >       -10.5994% +/- 3.02677%
> >       (Student's t, pooled s = 14.9237)
> > 
> > However the mean latency is adversely affected:
> > 
> > Mean latency in microseconds of a third party RT thread
> > (gem_syslatency -t 120 -f 1)
> >    x Always using tasklets
> >    + Only using tasklets from CS irq, direct submission of requests
> > +------------------------------------------------------------------------+
> > |           xxxxxx                                        +   ++         |
> > |           xxxxxx                                        +   ++         |
> > |           xxxxxx                                      + +++ ++         |
> > |           xxxxxxx                                     +++++ ++         |
> > |           xxxxxxx                                     +++++ ++         |
> > |           xxxxxxx                                     +++++ +++        |
> > |           xxxxxxx                                   + ++++++++++       |
> > |           xxxxxxxx                                 ++ ++++++++++       |
> > |           xxxxxxxx                                 ++ ++++++++++       |
> > |          xxxxxxxxxx                                +++++++++++++++     |
> > |         xxxxxxxxxxx    x                           +++++++++++++++     |
> > |x       xxxxxxxxxxxxx   x           +            + ++++++++++++++++++  +|
> > |           |__A__|                                                      |
> > |                                                      |____A___|        |
> > +------------------------------------------------------------------------+
> >      N           Min           Max        Median           Avg        Stddev
> > x 120         3.506         3.727         3.631     3.6321417    0.02773109
> > + 120         3.834         4.149         4.039     4.0375167   0.041221676
> > Difference at 95.0% confidence
> >       0.405375 +/- 0.00888913
> >       11.1608% +/- 0.244735%
> >       (Student's t, pooled s = 0.03513)
> > 
> > However, since the mean latency corresponds to the amount of irqsoff
> > processing we have to do for a CS interrupt, we only need to speed that
> > up to benefit not just system latency but our own throughput.
> > 
> > v2: Remember to defer submissions when under reset.
> > v4: Only use direct submission for new requests
> > v5: Be aware that with mixing direct tasklet evaluation and deferred
> > tasklets, we may end up idling before running the deferred tasklet.
> > 
> > Testcase: igt/gem_exec_latency/*rthog*
> > References: 27af5eea54d1 ("drm/i915: Move execlists irq handler to a bottom half")
> > Suggested-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gem.h         |   5 +
> >   drivers/gpu/drm/i915/i915_irq.c         |  11 +-
> >   drivers/gpu/drm/i915/intel_engine_cs.c  |   8 +-
> >   drivers/gpu/drm/i915/intel_lrc.c        | 147 ++++++++++++++----------
> >   drivers/gpu/drm/i915/intel_ringbuffer.h |   1 -
> >   5 files changed, 98 insertions(+), 74 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
> > index 261da577829a..7892ac773916 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.h
> > +++ b/drivers/gpu/drm/i915/i915_gem.h
> > @@ -88,4 +88,9 @@ static inline void __tasklet_enable_sync_once(struct tasklet_struct *t)
> >               tasklet_kill(t);
> >   }
> >   
> > +static inline bool __tasklet_is_enabled(const struct tasklet_struct *t)
> > +{
> > +     return likely(!atomic_read(&t->count));
> > +}
> > +
> 
> For the unlikely-likely chain from 
> __submit_queue->reset_in_progress->__tasklet_is_enabled I think it would 
> be better to drop the likely/unlikely from low-level helpers and put the 
> one unlikely into the __submit_queue.

Tasklets are rarely disabled, I think that's quite important to stress.
Tasklets do not function very well (heavy spinning) while disabled.

> >   #endif /* __I915_GEM_H__ */
> > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> > index 46aaef5c1851..316d0b08d40f 100644
> > --- a/drivers/gpu/drm/i915/i915_irq.c
> > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > @@ -1469,14 +1469,10 @@ static void snb_gt_irq_handler(struct drm_i915_private *dev_priv,
> >   static void
> >   gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
> >   {
> > -     struct intel_engine_execlists * const execlists = &engine->execlists;
> >       bool tasklet = false;
> >   
> > -     if (iir & GT_CONTEXT_SWITCH_INTERRUPT) {
> > -             if (READ_ONCE(engine->execlists.active))
> 
> What is the thinking behind this change? It used to be that we scheduled 
> the tasklet only when we knew we are expecting interrupts and now we 
> don't care any more for some reason?

The filtering is done inside process_csb(). We filtered on active
previously as some interrupts were seemingly going astray, now I am much
more confident that all are accounted for.

> > -                     tasklet = !test_and_set_bit(ENGINE_IRQ_EXECLIST,
> > -                                                 &engine->irq_posted);
> 
> And this is gone as well. Can you put a paragraph in the commit message 
> explaining the change? It doesn't seem immediately connected with direct 
> submission.

Removing one heavyweight atomic operation in the latency sensitive
interrupt.

> > +     if (iir & GT_CONTEXT_SWITCH_INTERRUPT)
> > +             tasklet = true;
> >   
> >       if (iir & GT_RENDER_USER_INTERRUPT) {
> >               notify_ring(engine);
> > @@ -1484,7 +1480,7 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
> >       }
> >   
> >       if (tasklet)
> > -             tasklet_hi_schedule(&execlists->tasklet);
> > +             tasklet_hi_schedule(&engine->execlists.tasklet);
> >   }
> >   
> >   static void gen8_gt_irq_ack(struct drm_i915_private *i915,
> > @@ -2216,7 +2212,6 @@ static irqreturn_t cherryview_irq_handler(int irq, void *arg)
> >   
> >               I915_WRITE(VLV_IER, ier);
> >               I915_WRITE(GEN8_MASTER_IRQ, GEN8_MASTER_IRQ_CONTROL);
> > -             POSTING_READ(GEN8_MASTER_IRQ);
> 
> What is this?

Something that I haven't managed to kill yet.

> > diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> > index 7209c22798e6..ace93958689e 100644
> > --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> > +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> > @@ -1353,12 +1353,10 @@ static void intel_engine_print_registers(const struct intel_engine_cs *engine,
> >               ptr = I915_READ(RING_CONTEXT_STATUS_PTR(engine));
> >               read = GEN8_CSB_READ_PTR(ptr);
> >               write = GEN8_CSB_WRITE_PTR(ptr);
> > -             drm_printf(m, "    Execlist CSB read %d [%d cached], write %d [%d from hws], interrupt posted? %s, tasklet queued? %s (%s)\n",
> > +             drm_printf(m, "    Execlist CSB read %d [%d cached], write %d [%d from hws], tasklet queued? %s (%s)\n",
> >                          read, execlists->csb_head,
> >                          write,
> >                          intel_read_status_page(engine, intel_hws_csb_write_index(engine->i915)),
> > -                        yesno(test_bit(ENGINE_IRQ_EXECLIST,
> > -                                       &engine->irq_posted)),
> >                          yesno(test_bit(TASKLET_STATE_SCHED,
> >                                         &engine->execlists.tasklet.state)),
> >                          enableddisabled(!atomic_read(&engine->execlists.tasklet.count)));
> > @@ -1570,11 +1568,9 @@ void intel_engine_dump(struct intel_engine_cs *engine,
> >       spin_unlock(&b->rb_lock);
> >       local_irq_restore(flags);
> >   
> > -     drm_printf(m, "IRQ? 0x%lx (breadcrumbs? %s) (execlists? %s)\n",
> > +     drm_printf(m, "IRQ? 0x%lx (breadcrumbs? %s)\n",
> >                  engine->irq_posted,
> >                  yesno(test_bit(ENGINE_IRQ_BREADCRUMB,
> > -                               &engine->irq_posted)),
> > -                yesno(test_bit(ENGINE_IRQ_EXECLIST,
> >                                 &engine->irq_posted)));
> >   
> >       drm_printf(m, "HWSP:\n");
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index 5a12b8fc9d8f..c82efa3ac105 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -562,13 +562,15 @@ static void complete_preempt_context(struct intel_engine_execlists *execlists)
> >   {
> >       GEM_BUG_ON(!execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT));
> >   
> > +     __unwind_incomplete_requests(container_of(execlists,
> > +                                               typeof(struct intel_engine_cs),
> > +                                               execlists));
> >       execlists_cancel_port_requests(execlists);
> > -     execlists_unwind_incomplete_requests(execlists);
> 
> Is the ordering change significant and why?

Mostly for consistency and reasoning about request reference lifetimes.
(Unwind => we retain the request reference, as it is moved back to the
protected execution lists.)

> > +static void execlists_submission_tasklet(unsigned long data)
> > +{
> > +     struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
> > +     unsigned long flags;
> > +
> > +     GEM_TRACE("%s awake?=%d, active=%x\n",
> > +               engine->name,
> > +               engine->i915->gt.awake,
> > +               engine->execlists.active);
> > +
> > +     spin_lock_irqsave(&engine->timeline.lock, flags);
> > +
> > +     if (engine->i915->gt.awake) /* we may be delayed until after we idle! */
> > +             __execlists_submission_tasklet(engine);
> 
> Sounds quite bad! this means we fail to process pending CSB. And going 
> idle syncs the tasklets so what am I missing?

That tasklets get kicked randomly, I think was the culprit.

> > +static void __submit_queue(struct intel_engine_cs *engine)
> > +{
> > +     struct intel_engine_execlists * const execlists = &engine->execlists;
> > +
> > +     if (reset_in_progress(execlists))
> > +             return; /* defer until we restart the engine following reset */
> > +
> > +     if (execlists->tasklet.func == execlists_submission_tasklet)
> 
> What is this check determining?

That we can call __execlists_submission_tasklet, i.e. it is not guc,
veng or anything weirder.

> Are we always calling it directly even if the ports are busy? Wouldn't 
> it be better to schedule in that that case?

No, we are only calling it if we have a more important request than
either port (see queue_priority).

> > +static void reset_csb_pointers(struct intel_engine_execlists *execlists)
> > +{
> > +     /*
> > +      * After a reset, the HW starts writing into CSB entry [0]. We
> > +      * therefore have to set our HEAD pointer back one entry so that
> > +      * the *first* entry we check is entry 0. To complicate this further,
> > +      * as we don't wait for the first interrupt after reset, we have to
> > +      * fake the HW write to point back to the last entry so that our
> > +      * inline comparison of our cached head position against the last HW
> > +      * write works even before the first interrupt.
> > +      */
> > +     execlists->csb_head = GEN8_CSB_ENTRIES - 1;
> > +     WRITE_ONCE(*execlists->csb_write, (GEN8_CSB_ENTRIES - 1) | 0xff << 16);
> > +}
> 
> Hmm this makes me think there should be another prep patch before direct 
> submission. Need to build a clearer picture before I can suggest how.

For what? This was already in the prep patches (handing CSB on reset vs
resume paths, and all the fake fallout), I don't actually need it in a
function, it was just handy to do so as iirc I wanted to use it
elsewhere, but fortunately killed off that caller.

So the prep patch is just making it into a function.
-Chris