[Intel-gfx] [PATCH v7] drm/i915/execlists: Direct submission of new requests (avoid tasklet/ksoftirqd)
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Thu Jun 28 13:21:06 UTC 2018
On 28/06/2018 14:11, Chris Wilson wrote:
> Back in commit 27af5eea54d1 ("drm/i915: Move execlists irq handler to a
> bottom half"), we came to the conclusion that running our CSB processing
> and ELSP submission from inside the irq handler was a bad idea. A really
> bad idea as we could impose nearly 1s latency on other users of the
> system, on average! Deferring our work to a tasklet allowed us to do the
> processing with irqs enabled, reducing the impact to an average of about
> 50us.
>
> We have since eradicated the use of forcewaked mmio from inside the CSB
> processing and ELSP submission, bringing the impact down to around 5us
> (on Kabylake); an order of magnitude better than our measurements 2
> years ago on Broadwell and only about 2x worse on average than the
> gem_syslatency on an unladen system.
>
> In this iteration of the tasklet-vs-direct submission debate, we seek a
> compromise where by we submit new requests immediately to the HW but
> defer processing the CS interrupt onto a tasklet. We gain the advantage
> of low-latency and ksoftirqd avoidance when waking up the HW, while
> avoiding the system-wide starvation of our CS irq-storms.
>
> Comparing the impact on the maximum latency observed (that is the time
> stolen from an RT process) over a 120s interval, repeated several times
> (using gem_syslatency, similar to RT's cyclictest) while the system is
> fully laden with i915 nops, we see that direct submission an actually
> improve the worse case.
>
> Maximum latency in microseconds of a third party RT thread
> (gem_syslatency -t 120 -f 2)
> x Always using tasklets (a couple of >1000us outliers removed)
> + Only using tasklets from CS irq, direct submission of requests
> +------------------------------------------------------------------------+
> | + |
> | + |
> | + |
> | + + |
> | + + + |
> | + + + + x x x |
> | +++ + + + x x x x x x |
> | +++ + ++ + + *x x x x x x |
> | +++ + ++ + * *x x * x x x |
> | + +++ + ++ * * +*xxx * x x xx |
> | * +++ + ++++* *x+**xx+ * x x xxxx x |
> | **x++++*++**+*x*x****x+ * +x xx xxxx x x |
> |x* ******+***************++*+***xxxxxx* xx*x xxx + x+|
> | |__________MA___________| |
> | |______M__A________| |
> +------------------------------------------------------------------------+
> N Min Max Median Avg Stddev
> x 118 91 186 124 125.28814 16.279137
> + 120 92 187 109 112.00833 13.458617
> Difference at 95.0% confidence
> -13.2798 +/- 3.79219
> -10.5994% +/- 3.02677%
> (Student's t, pooled s = 14.9237)
>
> However the mean latency is adversely affected:
>
> Mean latency in microseconds of a third party RT thread
> (gem_syslatency -t 120 -f 1)
> x Always using tasklets
> + Only using tasklets from CS irq, direct submission of requests
> +------------------------------------------------------------------------+
> | xxxxxx + ++ |
> | xxxxxx + ++ |
> | xxxxxx + +++ ++ |
> | xxxxxxx +++++ ++ |
> | xxxxxxx +++++ ++ |
> | xxxxxxx +++++ +++ |
> | xxxxxxx + ++++++++++ |
> | xxxxxxxx ++ ++++++++++ |
> | xxxxxxxx ++ ++++++++++ |
> | xxxxxxxxxx +++++++++++++++ |
> | xxxxxxxxxxx x +++++++++++++++ |
> |x xxxxxxxxxxxxx x + + ++++++++++++++++++ +|
> | |__A__| |
> | |____A___| |
> +------------------------------------------------------------------------+
> N Min Max Median Avg Stddev
> x 120 3.506 3.727 3.631 3.6321417 0.02773109
> + 120 3.834 4.149 4.039 4.0375167 0.041221676
> Difference at 95.0% confidence
> 0.405375 +/- 0.00888913
> 11.1608% +/- 0.244735%
> (Student's t, pooled s = 0.03513)
>
> However, since the mean latency corresponds to the amount of irqsoff
> processing we have to do for a CS interrupt, we only need to speed that
> up to benefit not just system latency but our own throughput.
>
> v2: Remember to defer submissions when under reset.
> v4: Only use direct submission for new requests
> v5: Be aware that with mixing direct tasklet evaluation and deferred
> tasklets, we may end up idling before running the deferred tasklet.
> v6: Remove the redudant likely() from tasklet_is_enabled(), restrict the
> annotation to reset_in_progress().
> v7: Take the full timeline.lock when enabling perf_pmu stats as the
> tasklet is no longer a valid guard. A consequence is that the stats are
> now only valid for engines also using the timeline.lock to process
> state.
>
> Testcase: igt/gem_exec_latency/*rthog*
> References: 27af5eea54d1 ("drm/i915: Move execlists irq handler to a bottom half")
> Suggested-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> ---
> drivers/gpu/drm/i915/i915_gem.h | 5 ++
> drivers/gpu/drm/i915/intel_engine_cs.c | 8 +-
> drivers/gpu/drm/i915/intel_lrc.c | 102 ++++++++++++++++---------
> 3 files changed, 75 insertions(+), 40 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
> index 261da577829a..e46592956872 100644
> --- a/drivers/gpu/drm/i915/i915_gem.h
> +++ b/drivers/gpu/drm/i915/i915_gem.h
> @@ -88,4 +88,9 @@ static inline void __tasklet_enable_sync_once(struct tasklet_struct *t)
> tasklet_kill(t);
> }
>
> +static inline bool __tasklet_is_enabled(const struct tasklet_struct *t)
> +{
> + return !atomic_read(&t->count);
> +}
> +
> #endif /* __I915_GEM_H__ */
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index ace93958689e..7fc97bc5362d 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -1619,8 +1619,8 @@ int intel_enable_engine_stats(struct intel_engine_cs *engine)
> if (!intel_engine_supports_stats(engine))
> return -ENODEV;
>
> - tasklet_disable(&execlists->tasklet);
> - write_seqlock_irqsave(&engine->stats.lock, flags);
> + spin_lock_irqsave(&engine->timeline.lock, flags);
> + write_seqlock(&engine->stats.lock, flags);
>
> if (unlikely(engine->stats.enabled == ~0)) {
> err = -EBUSY;
> @@ -1644,8 +1644,8 @@ int intel_enable_engine_stats(struct intel_engine_cs *engine)
> }
>
> unlock:
> - write_sequnlock_irqrestore(&engine->stats.lock, flags);
> - tasklet_enable(&execlists->tasklet);
> + write_sequnlock(&engine->stats.lock);
> + spin_lock_irqrestore(&engine->timeline.lock, flags);
>
> return err;
> }
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 7f8b29684d9d..ab89dabc2965 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -563,12 +563,14 @@ static void complete_preempt_context(struct intel_engine_execlists *execlists)
> GEM_BUG_ON(!execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT));
>
> execlists_cancel_port_requests(execlists);
> - execlists_unwind_incomplete_requests(execlists);
> + __unwind_incomplete_requests(container_of(execlists,
> + struct intel_engine_cs,
> + execlists));
>
> execlists_clear_active(execlists, EXECLISTS_ACTIVE_PREEMPT);
> }
>
> -static void __execlists_dequeue(struct intel_engine_cs *engine)
> +static void execlists_dequeue(struct intel_engine_cs *engine)
> {
> struct intel_engine_execlists * const execlists = &engine->execlists;
> struct execlist_port *port = execlists->port;
> @@ -578,9 +580,8 @@ static void __execlists_dequeue(struct intel_engine_cs *engine)
> struct rb_node *rb;
> bool submit = false;
>
> - lockdep_assert_held(&engine->timeline.lock);
> -
> - /* Hardware submission is through 2 ports. Conceptually each port
> + /*
> + * Hardware submission is through 2 ports. Conceptually each port
> * has a (RING_START, RING_HEAD, RING_TAIL) tuple. RING_START is
> * static for a context, and unique to each, so we only execute
> * requests belonging to a single context from each ring. RING_HEAD
> @@ -770,15 +771,6 @@ static void __execlists_dequeue(struct intel_engine_cs *engine)
> !port_isset(engine->execlists.port));
> }
>
> -static void execlists_dequeue(struct intel_engine_cs *engine)
> -{
> - unsigned long flags;
> -
> - spin_lock_irqsave(&engine->timeline.lock, flags);
> - __execlists_dequeue(engine);
> - spin_unlock_irqrestore(&engine->timeline.lock, flags);
> -}
> -
> void
> execlists_cancel_port_requests(struct intel_engine_execlists * const execlists)
> {
> @@ -957,6 +949,12 @@ static void execlists_cancel_requests(struct intel_engine_cs *engine)
> spin_unlock_irqrestore(&engine->timeline.lock, flags);
> }
>
> +static inline bool
> +reset_in_progress(const struct intel_engine_execlists *execlists)
> +{
> + return unlikely(!__tasklet_is_enabled(&execlists->tasklet));
> +}
> +
> static void process_csb(struct intel_engine_cs *engine)
> {
> struct intel_engine_execlists * const execlists = &engine->execlists;
> @@ -1108,18 +1106,9 @@ static void process_csb(struct intel_engine_cs *engine)
> execlists->csb_head = head;
> }
>
> -/*
> - * Check the unread Context Status Buffers and manage the submission of new
> - * contexts to the ELSP accordingly.
> - */
> -static void execlists_submission_tasklet(unsigned long data)
> +static void __execlists_submission_tasklet(struct intel_engine_cs *const engine)
> {
> - struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
> -
> - GEM_TRACE("%s awake?=%d, active=%x\n",
> - engine->name,
> - engine->i915->gt.awake,
> - engine->execlists.active);
> + lockdep_assert_held(&engine->timeline.lock);
>
> /*
> * We can skip acquiring intel_runtime_pm_get() here as it was taken
> @@ -1136,6 +1125,28 @@ static void execlists_submission_tasklet(unsigned long data)
> execlists_dequeue(engine);
> }
>
> +/*
> + * Check the unread Context Status Buffers and manage the submission of new
> + * contexts to the ELSP accordingly.
> + */
> +static void execlists_submission_tasklet(unsigned long data)
> +{
> + struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
> + unsigned long flags;
> +
> + GEM_TRACE("%s awake?=%d, active=%x\n",
> + engine->name,
> + engine->i915->gt.awake,
> + engine->execlists.active);
> +
> + spin_lock_irqsave(&engine->timeline.lock, flags);
> +
> + if (engine->i915->gt.awake) /* we may be delayed until after we idle! */
No races between the check and tasklet call? In other words the code
path which you were describing that can race is taking the timeline lock?
> + __execlists_submission_tasklet(engine);
> +
> + spin_unlock_irqrestore(&engine->timeline.lock, flags);
> +}
> +
> static void queue_request(struct intel_engine_cs *engine,
> struct i915_sched_node *node,
> int prio)
> @@ -1144,16 +1155,30 @@ static void queue_request(struct intel_engine_cs *engine,
> &lookup_priolist(engine, prio)->requests);
> }
>
> -static void __submit_queue(struct intel_engine_cs *engine, int prio)
> +static void __update_queue(struct intel_engine_cs *engine, int prio)
> {
> engine->execlists.queue_priority = prio;
> - tasklet_hi_schedule(&engine->execlists.tasklet);
> +}
> +
> +static void __submit_queue_imm(struct intel_engine_cs *engine)
> +{
> + struct intel_engine_execlists * const execlists = &engine->execlists;
> +
> + if (reset_in_progress(execlists))
> + return; /* defer until we restart the engine following reset */
We have a tasklet kick somewhere in that path?
> +
> + if (execlists->tasklet.func == execlists_submission_tasklet)
> + __execlists_submission_tasklet(engine);
> + else
> + tasklet_hi_schedule(&execlists->tasklet);
> }
>
> static void submit_queue(struct intel_engine_cs *engine, int prio)
> {
> - if (prio > engine->execlists.queue_priority)
> - __submit_queue(engine, prio);
> + if (prio > engine->execlists.queue_priority) {
> + __update_queue(engine, prio);
> + __submit_queue_imm(engine);
> + }
> }
>
> static void execlists_submit_request(struct i915_request *request)
> @@ -1165,11 +1190,12 @@ static void execlists_submit_request(struct i915_request *request)
> spin_lock_irqsave(&engine->timeline.lock, flags);
>
> queue_request(engine, &request->sched, rq_prio(request));
> - submit_queue(engine, rq_prio(request));
>
> GEM_BUG_ON(!engine->execlists.first);
> GEM_BUG_ON(list_empty(&request->sched.link));
>
> + submit_queue(engine, rq_prio(request));
> +
> spin_unlock_irqrestore(&engine->timeline.lock, flags);
> }
>
> @@ -1296,8 +1322,11 @@ static void execlists_schedule(struct i915_request *request,
> }
>
> if (prio > engine->execlists.queue_priority &&
> - i915_sw_fence_done(&sched_to_request(node)->submit))
> - __submit_queue(engine, prio);
> + i915_sw_fence_done(&sched_to_request(node)->submit)) {
> + /* defer submission until after all of our updates */
> + __update_queue(engine, prio);
> + tasklet_hi_schedule(&engine->execlists.tasklet);
_imm suffix on __submit_queue_imm sounds unused if there is no plain
__submit_queue, which could have been used here. But meh.
> + }
> }
>
> spin_unlock_irq(&engine->timeline.lock);
> @@ -1878,6 +1907,7 @@ execlists_reset_prepare(struct intel_engine_cs *engine)
> {
> struct intel_engine_execlists * const execlists = &engine->execlists;
> struct i915_request *request, *active;
> + unsigned long flags;
>
> GEM_TRACE("%s\n", engine->name);
>
> @@ -1892,6 +1922,8 @@ execlists_reset_prepare(struct intel_engine_cs *engine)
> */
> __tasklet_disable_sync_once(&execlists->tasklet);
>
> + spin_lock_irqsave(&engine->timeline.lock, flags);
> +
> /*
> * We want to flush the pending context switches, having disabled
> * the tasklet above, we can assume exclusive access to the execlists.
> @@ -1909,15 +1941,12 @@ execlists_reset_prepare(struct intel_engine_cs *engine)
> active = NULL;
> request = port_request(execlists->port);
> if (request) {
> - unsigned long flags;
> -
> /*
> * Prevent the breadcrumb from advancing before we decide
> * which request is currently active.
> */
> intel_engine_stop_cs(engine);
>
> - spin_lock_irqsave(&engine->timeline.lock, flags);
> list_for_each_entry_from_reverse(request,
> &engine->timeline.requests,
> link) {
> @@ -1927,9 +1956,10 @@ execlists_reset_prepare(struct intel_engine_cs *engine)
>
> active = request;
> }
> - spin_unlock_irqrestore(&engine->timeline.lock, flags);
> }
>
> + spin_unlock_irqrestore(&engine->timeline.lock, flags);
> +
> return active;
> }
>
>
Much easier to read and understand now that it is smaller, thanks!
Just a couple question above and no other complaints.
Regards,
Tvrtko
More information about the Intel-gfx
mailing list