[Intel-gfx] [PATCH 07/31] drm/i915/execlists: Direct submission of new requests (avoid tasklet/ksoftirqd)

Wed Jun 27 15:21:24 UTC 2018

On 27/06/2018 14:29, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-06-27 14:15:22)
>>
>> On 27/06/2018 11:58, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-06-27 11:40:32)
>>>>
>>>> On 25/06/2018 10:48, Chris Wilson wrote:
>>>>> Back in commit 27af5eea54d1 ("drm/i915: Move execlists irq handler to a
>>>>> bottom half"), we came to the conclusion that running our CSB processing
>>>>> and ELSP submission from inside the irq handler was a bad idea. A really
>>>>> bad idea as we could impose nearly 1s latency on other users of the
>>>>> system, on average! Deferring our work to a tasklet allowed us to do the
>>>>> processing with irqs enabled, reducing the impact to an average of about
>>>>> 50us.
>>>>>
>>>>> We have since eradicated the use of forcewaked mmio from inside the CSB
>>>>> processing and ELSP submission, bringing the impact down to around 5us
>>>>> (on Kabylake); an order of magnitude better than our measurements 2
>>>>> years ago on Broadwell and only about 2x worse on average than the
>>>>> gem_syslatency on an unladen system.
>>>>>
>>>>> In this iteration of the tasklet-vs-direct submission debate, we seek a
>>>>> compromise where by we submit new requests immediately to the HW but
>>>>> defer processing the CS interrupt onto a tasklet. We gain the advantage
>>>>> of low-latency and ksoftirqd avoidance when waking up the HW, while
>>>>> avoiding the system-wide starvation of our CS irq-storms.
>>>>>
>>>>> Comparing the impact on the maximum latency observed (that is the time
>>>>> stolen from an RT process) over a 120s interval, repeated several times
>>>>> (using gem_syslatency, similar to RT's cyclictest) while the system is
>>>>> fully laden with i915 nops, we see that direct submission an actually
>>>>> improve the worse case.
>>>>>
>>>>> Maximum latency in microseconds of a third party RT thread
>>>>> (gem_syslatency -t 120 -f 2)
>>>>>      x Always using tasklets (a couple of >1000us outliers removed)
>>>>>      + Only using tasklets from CS irq, direct submission of requests
>>>>> +------------------------------------------------------------------------+
>>>>> |          +                                                             |
>>>>> |          +                                                             |
>>>>> |          +                                                             |
>>>>> |          +       +                                                     |
>>>>> |          + +     +                                                     |
>>>>> |       +  + +     +  x     x     x                                      |
>>>>> |      +++ + +     +  x  x  x  x  x  x                                   |
>>>>> |      +++ + ++  + +  *x x  x  x  x  x                                   |
>>>>> |      +++ + ++  + *  *x x  *  x  x  x                                   |
>>>>> |    + +++ + ++  * * +*xxx  *  x  x  xx                                  |
>>>>> |    * +++ + ++++* *x+**xx+ *  x  x xxxx x                               |
>>>>> |   **x++++*++**+*x*x****x+ * +x xx xxxx x          x                    |
>>>>> |x* ******+***************++*+***xxxxxx* xx*x     xxx +                x+|
>>>>> |             |__________MA___________|                                  |
>>>>> |      |______M__A________|                                              |
>>>>> +------------------------------------------------------------------------+
>>>>>        N           Min           Max        Median           Avg        Stddev
>>>>> x 118            91           186           124     125.28814     16.279137
>>>>> + 120            92           187           109     112.00833     13.458617
>>>>> Difference at 95.0% confidence
>>>>>         -13.2798 +/- 3.79219
>>>>>         -10.5994% +/- 3.02677%
>>>>>         (Student's t, pooled s = 14.9237)
>>>>>
>>>>> However the mean latency is adversely affected:
>>>>>
>>>>> Mean latency in microseconds of a third party RT thread
>>>>> (gem_syslatency -t 120 -f 1)
>>>>>      x Always using tasklets
>>>>>      + Only using tasklets from CS irq, direct submission of requests
>>>>> +------------------------------------------------------------------------+
>>>>> |           xxxxxx                                        +   ++         |
>>>>> |           xxxxxx                                        +   ++         |
>>>>> |           xxxxxx                                      + +++ ++         |
>>>>> |           xxxxxxx                                     +++++ ++         |
>>>>> |           xxxxxxx                                     +++++ ++         |
>>>>> |           xxxxxxx                                     +++++ +++        |
>>>>> |           xxxxxxx                                   + ++++++++++       |
>>>>> |           xxxxxxxx                                 ++ ++++++++++       |
>>>>> |           xxxxxxxx                                 ++ ++++++++++       |
>>>>> |          xxxxxxxxxx                                +++++++++++++++     |
>>>>> |         xxxxxxxxxxx    x                           +++++++++++++++     |
>>>>> |x       xxxxxxxxxxxxx   x           +            + ++++++++++++++++++  +|
>>>>> |           |__A__|                                                      |
>>>>> |                                                      |____A___|        |
>>>>> +------------------------------------------------------------------------+
>>>>>        N           Min           Max        Median           Avg        Stddev
>>>>> x 120         3.506         3.727         3.631     3.6321417    0.02773109
>>>>> + 120         3.834         4.149         4.039     4.0375167   0.041221676
>>>>> Difference at 95.0% confidence
>>>>>         0.405375 +/- 0.00888913
>>>>>         11.1608% +/- 0.244735%
>>>>>         (Student's t, pooled s = 0.03513)
>>>>>
>>>>> However, since the mean latency corresponds to the amount of irqsoff
>>>>> processing we have to do for a CS interrupt, we only need to speed that
>>>>> up to benefit not just system latency but our own throughput.
>>>>>
>>>>> v2: Remember to defer submissions when under reset.
>>>>> v4: Only use direct submission for new requests
>>>>> v5: Be aware that with mixing direct tasklet evaluation and deferred
>>>>> tasklets, we may end up idling before running the deferred tasklet.
>>>>>
>>>>> Testcase: igt/gem_exec_latency/*rthog*
>>>>> References: 27af5eea54d1 ("drm/i915: Move execlists irq handler to a bottom half")
>>>>> Suggested-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/i915_gem.h         |   5 +
>>>>>     drivers/gpu/drm/i915/i915_irq.c         |  11 +-
>>>>>     drivers/gpu/drm/i915/intel_engine_cs.c  |   8 +-
>>>>>     drivers/gpu/drm/i915/intel_lrc.c        | 147 ++++++++++++++----------
>>>>>     drivers/gpu/drm/i915/intel_ringbuffer.h |   1 -
>>>>>     5 files changed, 98 insertions(+), 74 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
>>>>> index 261da577829a..7892ac773916 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_gem.h
>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.h
>>>>> @@ -88,4 +88,9 @@ static inline void __tasklet_enable_sync_once(struct tasklet_struct *t)
>>>>>                 tasklet_kill(t);
>>>>>     }
>>>>>     
>>>>> +static inline bool __tasklet_is_enabled(const struct tasklet_struct *t)
>>>>> +{
>>>>> +     return likely(!atomic_read(&t->count));
>>>>> +}
>>>>> +
>>>>
>>>> For the unlikely-likely chain from
>>>> __submit_queue->reset_in_progress->__tasklet_is_enabled I think it would
>>>> be better to drop the likely/unlikely from low-level helpers and put the
>>>> one unlikely into the __submit_queue.
>>>
>>> Tasklets are rarely disabled, I think that's quite important to stress.
>>> Tasklets do not function very well (heavy spinning) while disabled.
>>
>> I think we shouldn't be concerned by that. Purpose of this is to wrap
>> internal implementation we even shouldn't be touching if we could help
>> it, and I feel correct thing is to express the branching hint higher up
>> the stack. Caller wants to optimize certain scenarios, while the helper
>> doesn't know who is calling it and why. On top we have this
>> likely-unlikley chain which I mentioned. Even just one unlikely in
>> reset_in_progress would probably be enough for what you wanted to ensure.
> 
> I already acquiesced and did extra that.

Okay.

> 
>>>>>     #endif /* __I915_GEM_H__ */
>>>>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>>>>> index 46aaef5c1851..316d0b08d40f 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_irq.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>>>>> @@ -1469,14 +1469,10 @@ static void snb_gt_irq_handler(struct drm_i915_private *dev_priv,
>>>>>     static void
>>>>>     gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
>>>>>     {
>>>>> -     struct intel_engine_execlists * const execlists = &engine->execlists;
>>>>>         bool tasklet = false;
>>>>>     
>>>>> -     if (iir & GT_CONTEXT_SWITCH_INTERRUPT) {
>>>>> -             if (READ_ONCE(engine->execlists.active))
>>>>
>>>> What is the thinking behind this change? It used to be that we scheduled
>>>> the tasklet only when we knew we are expecting interrupts and now we
>>>> don't care any more for some reason?
>>>
>>> The filtering is done inside process_csb(). We filtered on active
>>> previously as some interrupts were seemingly going astray, now I am much
>>> more confident that all are accounted for.
>>
>> Hm how? We filter extra interrupts, we can't filter to get what's missing?
> 
> Not quite, since we process the CSB more frequently than interrupts, we
> may also get an interrupt after having already processed the CSB.

If it is all processed already then why schedule the tasklet?

Or from a different old angle - does this belong in this patch?

> 
>>>>> -                     tasklet = !test_and_set_bit(ENGINE_IRQ_EXECLIST,
>>>>> -                                                 &engine->irq_posted);
>>>>
>>>> And this is gone as well. Can you put a paragraph in the commit message
>>>> explaining the change? It doesn't seem immediately connected with direct
>>>> submission.
>>>
>>> Removing one heavyweight atomic operation in the latency sensitive
>>> interrupt.
>>
>> But on the higher level - why we don't need this any more.
> 
> Because we are using the CSB, ok I think that can be a separate step.

Cool.

> 
>>>>> +     if (iir & GT_CONTEXT_SWITCH_INTERRUPT)
>>>>> +             tasklet = true;
>>>>>     
>>>>>         if (iir & GT_RENDER_USER_INTERRUPT) {
>>>>>                 notify_ring(engine);
>>>>> @@ -1484,7 +1480,7 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
>>>>>         }
>>>>>     
>>>>>         if (tasklet)
>>>>> -             tasklet_hi_schedule(&execlists->tasklet);
>>>>> +             tasklet_hi_schedule(&engine->execlists.tasklet);
>>>>>     }
>>>>>     
>>>>>     static void gen8_gt_irq_ack(struct drm_i915_private *i915,
>>>>> @@ -2216,7 +2212,6 @@ static irqreturn_t cherryview_irq_handler(int irq, void *arg)
>>>>>     
>>>>>                 I915_WRITE(VLV_IER, ier);
>>>>>                 I915_WRITE(GEN8_MASTER_IRQ, GEN8_MASTER_IRQ_CONTROL);
>>>>> -             POSTING_READ(GEN8_MASTER_IRQ);
>>>>
>>>> What is this?
>>>
>>> Something that I haven't managed to kill yet.
>>
>> No sneaking in this patch then either! :D
> 
> Just close your eyes. Nothing to see here.

You shall not pass! :D

> 
>>>>> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
>>>>> index 7209c22798e6..ace93958689e 100644
>>>>> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
>>>>> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
>>>>> @@ -1353,12 +1353,10 @@ static void intel_engine_print_registers(const struct intel_engine_cs *engine,
>>>>>                 ptr = I915_READ(RING_CONTEXT_STATUS_PTR(engine));
>>>>>                 read = GEN8_CSB_READ_PTR(ptr);
>>>>>                 write = GEN8_CSB_WRITE_PTR(ptr);
>>>>> -             drm_printf(m, "    Execlist CSB read %d [%d cached], write %d [%d from hws], interrupt posted? %s, tasklet queued? %s (%s)\n",
>>>>> +             drm_printf(m, "    Execlist CSB read %d [%d cached], write %d [%d from hws], tasklet queued? %s (%s)\n",
>>>>>                            read, execlists->csb_head,
>>>>>                            write,
>>>>>                            intel_read_status_page(engine, intel_hws_csb_write_index(engine->i915)),
>>>>> -                        yesno(test_bit(ENGINE_IRQ_EXECLIST,
>>>>> -                                       &engine->irq_posted)),
>>>>>                            yesno(test_bit(TASKLET_STATE_SCHED,
>>>>>                                           &engine->execlists.tasklet.state)),
>>>>>                            enableddisabled(!atomic_read(&engine->execlists.tasklet.count)));
>>>>> @@ -1570,11 +1568,9 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>>>>>         spin_unlock(&b->rb_lock);
>>>>>         local_irq_restore(flags);
>>>>>     
>>>>> -     drm_printf(m, "IRQ? 0x%lx (breadcrumbs? %s) (execlists? %s)\n",
>>>>> +     drm_printf(m, "IRQ? 0x%lx (breadcrumbs? %s)\n",
>>>>>                    engine->irq_posted,
>>>>>                    yesno(test_bit(ENGINE_IRQ_BREADCRUMB,
>>>>> -                               &engine->irq_posted)),
>>>>> -                yesno(test_bit(ENGINE_IRQ_EXECLIST,
>>>>>                                   &engine->irq_posted)));
>>>>>     
>>>>>         drm_printf(m, "HWSP:\n");
>>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>>>>> index 5a12b8fc9d8f..c82efa3ac105 100644
>>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>>>> @@ -562,13 +562,15 @@ static void complete_preempt_context(struct intel_engine_execlists *execlists)
>>>>>     {
>>>>>         GEM_BUG_ON(!execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT));
>>>>>     
>>>>> +     __unwind_incomplete_requests(container_of(execlists,
>>>>> +                                               typeof(struct intel_engine_cs),
>>>>> +                                               execlists));
>>>>>         execlists_cancel_port_requests(execlists);
>>>>> -     execlists_unwind_incomplete_requests(execlists);
>>>>
>>>> Is the ordering change significant and why?
>>>
>>> Mostly for consistency and reasoning about request reference lifetimes.
>>> (Unwind => we retain the request reference, as it is moved back to the
>>> protected execution lists.)
>>
>> Remove from this patch then?
> 
> Move to an earlier patch then.

Cool.

> 
>>>>> +static void execlists_submission_tasklet(unsigned long data)
>>>>> +{
>>>>> +     struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
>>>>> +     unsigned long flags;
>>>>> +
>>>>> +     GEM_TRACE("%s awake?=%d, active=%x\n",
>>>>> +               engine->name,
>>>>> +               engine->i915->gt.awake,
>>>>> +               engine->execlists.active);
>>>>> +
>>>>> +     spin_lock_irqsave(&engine->timeline.lock, flags);
>>>>> +
>>>>> +     if (engine->i915->gt.awake) /* we may be delayed until after we idle! */
>>>>> +             __execlists_submission_tasklet(engine);
>>>>
>>>> Sounds quite bad! this means we fail to process pending CSB. And going
>>>> idle syncs the tasklets so what am I missing?
>>>
>>> That tasklets get kicked randomly, I think was the culprit.
>>
>> What do you mean? I hope we have busy-idle quite controlled and we know
>> when we should and should expect a tasklet. If we synced them when
>> transitioning to idle they cannot happen. Otherwise we better be active!
>> GEM_BUG_ON(!engine->i915->gt.awake) instead? Does that trigger?!
> 
> tasklet_schedule() is called off the main path, without locking, so
> unsynchronized to parking. Just because.

I need to understand this - which main path? Submission - we will be 
mark_busy. After last request - we will idle the engines and sync the 
tasklet.

There is even GEM_BUG_ON(!engine->i915->gt.awake); in 
__execlists_submission_tasklet.

Regards,

Tvrtko