[Intel-gfx] [PATCH] drm/i915/gt: Restrict forced preemption to the active context
Andi Shyti
andi.shyti at linux.intel.com
Wed Sep 21 15:43:10 UTC 2022
Hi Andrzej and Chris,
On Wed, Sep 21, 2022 at 03:52:58PM +0200, Andrzej Hajda wrote:
> From: Chris Wilson <chris.p.wilson at intel.com>
>
> When we submit a new pair of contexts to ELSP for execution, we start a
> timer by which point we expect the HW to have switched execution to the
> pending contexts. If the promotion to the new pair of contexts has not
> occurred, we declare the executing context to have hung and force the
> preemption to take place by resetting the engine and resubmitting the
> new contexts.
>
> This can lead to an unfair situation where almost all of the preemption
> timeout is consumed by the first context which just switches into the
> second context immediately prior to the timer firing and triggering the
> preemption reset (assuming that the timer interrupts before we process
> the CS events for the context switch). The second context hasn't yet had
> a chance to yield to the incoming ELSP (and send the ACk for the
> promotion) and so ends up being blamed for the reset.
>
> If we see that a context switch has occurred since setting the
> preemption timeout, but have not yet received the ACK for the ELSP
> promotion, rearm the preemption timer and check again. This is
> especially significant if the first context was not schedulable and so
> we used the shortest timer possible, greatly increasing the chance of
> accidentally blaming the second innocent context.
>
> Fixes: 3a7a92aba8fb ("drm/i915/execlists: Force preemption")
> Fixes: d12acee84ffb ("drm/i915/execlists: Cancel banned contexts on schedule-out")
> Reported-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Andi Shyti <andi.shyti at linux.intel.com>
> Reviewed-by: Andrzej Hajda <andrzej.hajda at intel.com>
> Tested-by: Andrzej Hajda <andrzej.hajda at intel.com>
> Cc: <stable at vger.kernel.org> # v5.5+
> ---
> Hi,
>
> This patch is upstreamed from internal branch. So I have removed
> R-B by Andi. Andi let me know if your R-B still apply.
yes, I know this patch and my r-b holds:
Reviewed-by: Andi Shyti <andi.shyti at linux.intel.com>
Anyway, thanks Chris for the comments and the clear explanation
both in the commit log and in between the code.
Andi
> Regards
> Andrzej
> ---
> drivers/gpu/drm/i915/gt/intel_engine_types.h | 15 +++++++++++++
> .../drm/i915/gt/intel_execlists_submission.c | 21 ++++++++++++++++++-
> 2 files changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 633a7e5dba3b4b..6b5d4ea22b673b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -165,6 +165,21 @@ struct intel_engine_execlists {
> */
> struct timer_list preempt;
>
> + /**
> + * @preempt_target: active request at the time of the preemption request
> + *
> + * We force a preemption to occur if the pending contexts have not
> + * been promoted to active upon receipt of the CS ack event within
> + * the timeout. This timeout maybe chosen based on the target,
> + * using a very short timeout if the context is no longer schedulable.
> + * That short timeout may not be applicable to other contexts, so
> + * if a context switch should happen within before the preemption
> + * timeout, we may shoot early at an innocent context. To prevent this,
> + * we record which context was active at the time of the preemption
> + * request and only reset that context upon the timeout.
> + */
> + const struct i915_request *preempt_target;
> +
> /**
> * @ccid: identifier for contexts submitted to this engine
> */
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 4b909cb88cdfb7..c718e6dc40b515 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -1241,6 +1241,9 @@ static unsigned long active_preempt_timeout(struct intel_engine_cs *engine,
> if (!rq)
> return 0;
>
> + /* Only allow ourselves to force reset the currently active context */
> + engine->execlists.preempt_target = rq;
> +
> /* Force a fast reset for terminated contexts (ignoring sysfs!) */
> if (unlikely(intel_context_is_banned(rq->context) || bad_request(rq)))
> return INTEL_CONTEXT_BANNED_PREEMPT_TIMEOUT_MS;
> @@ -2427,8 +2430,24 @@ static void execlists_submission_tasklet(struct tasklet_struct *t)
> GEM_BUG_ON(inactive - post > ARRAY_SIZE(post));
>
> if (unlikely(preempt_timeout(engine))) {
> + const struct i915_request *rq = *engine->execlists.active;
> +
> + /*
> + * If after the preempt-timeout expired, we are still on the
> + * same active request/context as before we initiated the
> + * preemption, reset the engine.
> + *
> + * However, if we have processed a CS event to switch contexts,
> + * but not yet processed the CS event for the pending
> + * preemption, reset the timer allowing the new context to
> + * gracefully exit.
> + */
> cancel_timer(&engine->execlists.preempt);
> - engine->execlists.error_interrupt |= ERROR_PREEMPT;
> + if (rq == engine->execlists.preempt_target)
> + engine->execlists.error_interrupt |= ERROR_PREEMPT;
> + else
> + set_timer_ms(&engine->execlists.preempt,
> + active_preempt_timeout(engine, rq));
> }
>
> if (unlikely(READ_ONCE(engine->execlists.error_interrupt))) {
> --
> 2.34.1
More information about the Intel-gfx
mailing list