[Intel-gfx] [PATCH 07/31] drm/i915/execlists: Direct submission of new requests (avoid tasklet/ksoftirqd)
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Wed Jun 27 15:21:24 UTC 2018
On 27/06/2018 14:29, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-06-27 14:15:22)
>>
>> On 27/06/2018 11:58, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-06-27 11:40:32)
>>>>
>>>> On 25/06/2018 10:48, Chris Wilson wrote:
>>>>> Back in commit 27af5eea54d1 ("drm/i915: Move execlists irq handler to a
>>>>> bottom half"), we came to the conclusion that running our CSB processing
>>>>> and ELSP submission from inside the irq handler was a bad idea. A really
>>>>> bad idea as we could impose nearly 1s latency on other users of the
>>>>> system, on average! Deferring our work to a tasklet allowed us to do the
>>>>> processing with irqs enabled, reducing the impact to an average of about
>>>>> 50us.
>>>>>
>>>>> We have since eradicated the use of forcewaked mmio from inside the CSB
>>>>> processing and ELSP submission, bringing the impact down to around 5us
>>>>> (on Kabylake); an order of magnitude better than our measurements 2
>>>>> years ago on Broadwell and only about 2x worse on average than the
>>>>> gem_syslatency on an unladen system.
>>>>>
>>>>> In this iteration of the tasklet-vs-direct submission debate, we seek a
>>>>> compromise where by we submit new requests immediately to the HW but
>>>>> defer processing the CS interrupt onto a tasklet. We gain the advantage
>>>>> of low-latency and ksoftirqd avoidance when waking up the HW, while
>>>>> avoiding the system-wide starvation of our CS irq-storms.
>>>>>
>>>>> Comparing the impact on the maximum latency observed (that is the time
>>>>> stolen from an RT process) over a 120s interval, repeated several times
>>>>> (using gem_syslatency, similar to RT's cyclictest) while the system is
>>>>> fully laden with i915 nops, we see that direct submission an actually
>>>>> improve the worse case.
>>>>>
>>>>> Maximum latency in microseconds of a third party RT thread
>>>>> (gem_syslatency -t 120 -f 2)
>>>>> x Always using tasklets (a couple of >1000us outliers removed)
>>>>> + Only using tasklets from CS irq, direct submission of requests
>>>>> +------------------------------------------------------------------------+
>>>>> | + |
>>>>> | + |
>>>>> | + |
>>>>> | + + |
>>>>> | + + + |
>>>>> | + + + + x x x |
>>>>> | +++ + + + x x x x x x |
>>>>> | +++ + ++ + + *x x x x x x |
>>>>> | +++ + ++ + * *x x * x x x |
>>>>> | + +++ + ++ * * +*xxx * x x xx |
>>>>> | * +++ + ++++* *x+**xx+ * x x xxxx x |
>>>>> | **x++++*++**+*x*x****x+ * +x xx xxxx x x |
>>>>> |x* ******+***************++*+***xxxxxx* xx*x xxx + x+|
>>>>> | |__________MA___________| |
>>>>> | |______M__A________| |
>>>>> +------------------------------------------------------------------------+
>>>>> N Min Max Median Avg Stddev
>>>>> x 118 91 186 124 125.28814 16.279137
>>>>> + 120 92 187 109 112.00833 13.458617
>>>>> Difference at 95.0% confidence
>>>>> -13.2798 +/- 3.79219
>>>>> -10.5994% +/- 3.02677%
>>>>> (Student's t, pooled s = 14.9237)
>>>>>
>>>>> However the mean latency is adversely affected:
>>>>>
>>>>> Mean latency in microseconds of a third party RT thread
>>>>> (gem_syslatency -t 120 -f 1)
>>>>> x Always using tasklets
>>>>> + Only using tasklets from CS irq, direct submission of requests
>>>>> +------------------------------------------------------------------------+
>>>>> | xxxxxx + ++ |
>>>>> | xxxxxx + ++ |
>>>>> | xxxxxx + +++ ++ |
>>>>> | xxxxxxx +++++ ++ |
>>>>> | xxxxxxx +++++ ++ |
>>>>> | xxxxxxx +++++ +++ |
>>>>> | xxxxxxx + ++++++++++ |
>>>>> | xxxxxxxx ++ ++++++++++ |
>>>>> | xxxxxxxx ++ ++++++++++ |
>>>>> | xxxxxxxxxx +++++++++++++++ |
>>>>> | xxxxxxxxxxx x +++++++++++++++ |
>>>>> |x xxxxxxxxxxxxx x + + ++++++++++++++++++ +|
>>>>> | |__A__| |
>>>>> | |____A___| |
>>>>> +------------------------------------------------------------------------+
>>>>> N Min Max Median Avg Stddev
>>>>> x 120 3.506 3.727 3.631 3.6321417 0.02773109
>>>>> + 120 3.834 4.149 4.039 4.0375167 0.041221676
>>>>> Difference at 95.0% confidence
>>>>> 0.405375 +/- 0.00888913
>>>>> 11.1608% +/- 0.244735%
>>>>> (Student's t, pooled s = 0.03513)
>>>>>
>>>>> However, since the mean latency corresponds to the amount of irqsoff
>>>>> processing we have to do for a CS interrupt, we only need to speed that
>>>>> up to benefit not just system latency but our own throughput.
>>>>>
>>>>> v2: Remember to defer submissions when under reset.
>>>>> v4: Only use direct submission for new requests
>>>>> v5: Be aware that with mixing direct tasklet evaluation and deferred
>>>>> tasklets, we may end up idling before running the deferred tasklet.
>>>>>
>>>>> Testcase: igt/gem_exec_latency/*rthog*
>>>>> References: 27af5eea54d1 ("drm/i915: Move execlists irq handler to a bottom half")
>>>>> Suggested-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>> ---
>>>>> drivers/gpu/drm/i915/i915_gem.h | 5 +
>>>>> drivers/gpu/drm/i915/i915_irq.c | 11 +-
>>>>> drivers/gpu/drm/i915/intel_engine_cs.c | 8 +-
>>>>> drivers/gpu/drm/i915/intel_lrc.c | 147 ++++++++++++++----------
>>>>> drivers/gpu/drm/i915/intel_ringbuffer.h | 1 -
>>>>> 5 files changed, 98 insertions(+), 74 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
>>>>> index 261da577829a..7892ac773916 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_gem.h
>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.h
>>>>> @@ -88,4 +88,9 @@ static inline void __tasklet_enable_sync_once(struct tasklet_struct *t)
>>>>> tasklet_kill(t);
>>>>> }
>>>>>
>>>>> +static inline bool __tasklet_is_enabled(const struct tasklet_struct *t)
>>>>> +{
>>>>> + return likely(!atomic_read(&t->count));
>>>>> +}
>>>>> +
>>>>
>>>> For the unlikely-likely chain from
>>>> __submit_queue->reset_in_progress->__tasklet_is_enabled I think it would
>>>> be better to drop the likely/unlikely from low-level helpers and put the
>>>> one unlikely into the __submit_queue.
>>>
>>> Tasklets are rarely disabled, I think that's quite important to stress.
>>> Tasklets do not function very well (heavy spinning) while disabled.
>>
>> I think we shouldn't be concerned by that. Purpose of this is to wrap
>> internal implementation we even shouldn't be touching if we could help
>> it, and I feel correct thing is to express the branching hint higher up
>> the stack. Caller wants to optimize certain scenarios, while the helper
>> doesn't know who is calling it and why. On top we have this
>> likely-unlikley chain which I mentioned. Even just one unlikely in
>> reset_in_progress would probably be enough for what you wanted to ensure.
>
> I already acquiesced and did extra that.
Okay.
>
>>>>> #endif /* __I915_GEM_H__ */
>>>>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>>>>> index 46aaef5c1851..316d0b08d40f 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_irq.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>>>>> @@ -1469,14 +1469,10 @@ static void snb_gt_irq_handler(struct drm_i915_private *dev_priv,
>>>>> static void
>>>>> gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
>>>>> {
>>>>> - struct intel_engine_execlists * const execlists = &engine->execlists;
>>>>> bool tasklet = false;
>>>>>
>>>>> - if (iir & GT_CONTEXT_SWITCH_INTERRUPT) {
>>>>> - if (READ_ONCE(engine->execlists.active))
>>>>
>>>> What is the thinking behind this change? It used to be that we scheduled
>>>> the tasklet only when we knew we are expecting interrupts and now we
>>>> don't care any more for some reason?
>>>
>>> The filtering is done inside process_csb(). We filtered on active
>>> previously as some interrupts were seemingly going astray, now I am much
>>> more confident that all are accounted for.
>>
>> Hm how? We filter extra interrupts, we can't filter to get what's missing?
>
> Not quite, since we process the CSB more frequently than interrupts, we
> may also get an interrupt after having already processed the CSB.
If it is all processed already then why schedule the tasklet?
Or from a different old angle - does this belong in this patch?
>
>>>>> - tasklet = !test_and_set_bit(ENGINE_IRQ_EXECLIST,
>>>>> - &engine->irq_posted);
>>>>
>>>> And this is gone as well. Can you put a paragraph in the commit message
>>>> explaining the change? It doesn't seem immediately connected with direct
>>>> submission.
>>>
>>> Removing one heavyweight atomic operation in the latency sensitive
>>> interrupt.
>>
>> But on the higher level - why we don't need this any more.
>
> Because we are using the CSB, ok I think that can be a separate step.
Cool.
>
>>>>> + if (iir & GT_CONTEXT_SWITCH_INTERRUPT)
>>>>> + tasklet = true;
>>>>>
>>>>> if (iir & GT_RENDER_USER_INTERRUPT) {
>>>>> notify_ring(engine);
>>>>> @@ -1484,7 +1480,7 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
>>>>> }
>>>>>
>>>>> if (tasklet)
>>>>> - tasklet_hi_schedule(&execlists->tasklet);
>>>>> + tasklet_hi_schedule(&engine->execlists.tasklet);
>>>>> }
>>>>>
>>>>> static void gen8_gt_irq_ack(struct drm_i915_private *i915,
>>>>> @@ -2216,7 +2212,6 @@ static irqreturn_t cherryview_irq_handler(int irq, void *arg)
>>>>>
>>>>> I915_WRITE(VLV_IER, ier);
>>>>> I915_WRITE(GEN8_MASTER_IRQ, GEN8_MASTER_IRQ_CONTROL);
>>>>> - POSTING_READ(GEN8_MASTER_IRQ);
>>>>
>>>> What is this?
>>>
>>> Something that I haven't managed to kill yet.
>>
>> No sneaking in this patch then either! :D
>
> Just close your eyes. Nothing to see here.
You shall not pass! :D
>
>>>>> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
>>>>> index 7209c22798e6..ace93958689e 100644
>>>>> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
>>>>> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
>>>>> @@ -1353,12 +1353,10 @@ static void intel_engine_print_registers(const struct intel_engine_cs *engine,
>>>>> ptr = I915_READ(RING_CONTEXT_STATUS_PTR(engine));
>>>>> read = GEN8_CSB_READ_PTR(ptr);
>>>>> write = GEN8_CSB_WRITE_PTR(ptr);
>>>>> - drm_printf(m, " Execlist CSB read %d [%d cached], write %d [%d from hws], interrupt posted? %s, tasklet queued? %s (%s)\n",
>>>>> + drm_printf(m, " Execlist CSB read %d [%d cached], write %d [%d from hws], tasklet queued? %s (%s)\n",
>>>>> read, execlists->csb_head,
>>>>> write,
>>>>> intel_read_status_page(engine, intel_hws_csb_write_index(engine->i915)),
>>>>> - yesno(test_bit(ENGINE_IRQ_EXECLIST,
>>>>> - &engine->irq_posted)),
>>>>> yesno(test_bit(TASKLET_STATE_SCHED,
>>>>> &engine->execlists.tasklet.state)),
>>>>> enableddisabled(!atomic_read(&engine->execlists.tasklet.count)));
>>>>> @@ -1570,11 +1568,9 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>>>>> spin_unlock(&b->rb_lock);
>>>>> local_irq_restore(flags);
>>>>>
>>>>> - drm_printf(m, "IRQ? 0x%lx (breadcrumbs? %s) (execlists? %s)\n",
>>>>> + drm_printf(m, "IRQ? 0x%lx (breadcrumbs? %s)\n",
>>>>> engine->irq_posted,
>>>>> yesno(test_bit(ENGINE_IRQ_BREADCRUMB,
>>>>> - &engine->irq_posted)),
>>>>> - yesno(test_bit(ENGINE_IRQ_EXECLIST,
>>>>> &engine->irq_posted)));
>>>>>
>>>>> drm_printf(m, "HWSP:\n");
>>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>>>>> index 5a12b8fc9d8f..c82efa3ac105 100644
>>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>>>> @@ -562,13 +562,15 @@ static void complete_preempt_context(struct intel_engine_execlists *execlists)
>>>>> {
>>>>> GEM_BUG_ON(!execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT));
>>>>>
>>>>> + __unwind_incomplete_requests(container_of(execlists,
>>>>> + typeof(struct intel_engine_cs),
>>>>> + execlists));
>>>>> execlists_cancel_port_requests(execlists);
>>>>> - execlists_unwind_incomplete_requests(execlists);
>>>>
>>>> Is the ordering change significant and why?
>>>
>>> Mostly for consistency and reasoning about request reference lifetimes.
>>> (Unwind => we retain the request reference, as it is moved back to the
>>> protected execution lists.)
>>
>> Remove from this patch then?
>
> Move to an earlier patch then.
Cool.
>
>>>>> +static void execlists_submission_tasklet(unsigned long data)
>>>>> +{
>>>>> + struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
>>>>> + unsigned long flags;
>>>>> +
>>>>> + GEM_TRACE("%s awake?=%d, active=%x\n",
>>>>> + engine->name,
>>>>> + engine->i915->gt.awake,
>>>>> + engine->execlists.active);
>>>>> +
>>>>> + spin_lock_irqsave(&engine->timeline.lock, flags);
>>>>> +
>>>>> + if (engine->i915->gt.awake) /* we may be delayed until after we idle! */
>>>>> + __execlists_submission_tasklet(engine);
>>>>
>>>> Sounds quite bad! this means we fail to process pending CSB. And going
>>>> idle syncs the tasklets so what am I missing?
>>>
>>> That tasklets get kicked randomly, I think was the culprit.
>>
>> What do you mean? I hope we have busy-idle quite controlled and we know
>> when we should and should expect a tasklet. If we synced them when
>> transitioning to idle they cannot happen. Otherwise we better be active!
>> GEM_BUG_ON(!engine->i915->gt.awake) instead? Does that trigger?!
>
> tasklet_schedule() is called off the main path, without locking, so
> unsynchronized to parking. Just because.
I need to understand this - which main path? Submission - we will be
mark_busy. After last request - we will idle the engines and sync the
tasklet.
There is even GEM_BUG_ON(!engine->i915->gt.awake); in
__execlists_submission_tasklet.
Regards,
Tvrtko
More information about the Intel-gfx
mailing list