[PATCH] drm/i915/guc: Fix the fix for reset lock confusion
Nirmoy Das
nirmoy.das at linux.intel.com
Wed Apr 3 09:21:44 UTC 2024
On 3/30/2024 12:53 AM, John.C.Harrison at Intel.com wrote:
> From: John Harrison <John.C.Harrison at Intel.com>
>
> The previous fix for the circlular lock splat about the busyness
> worker wasn't quite complete. Even though the reset-in-progress flag
> is cleared at the start of intel_uc_reset_finish, the entire function
> is still inside the reset mutex lock. Not sure why the patch appeared
> to fix the issue both locally and in CI. However, it is now back
> again.
>
> There is a further complication the wedge code path within
> intel_gt_reset() jumps around so much it results in nested
> reset_prepare/_finish calls. That is, the call sequence is:
> intel_gt_reset
> | reset_prepare
> | __intel_gt_set_wedged
> | | reset_prepare
> | | reset_finish
> | reset_finish
>
> The nested finish means that even if the clear of the in-progress flag
> was moved to the end of _finish, it would still be clear for the
> entire second call. Surprisingly, this does not seem to be causing any
> other problems at present.
>
> As an aside, a wedge on fini does not call the finish functions at
> all. The reset_in_progress flag is left set (twice).
>
> So instead of trying to cancel the worker anywhere at all in the reset
> path, just add a cancel to intel_guc_submission_fini instead. Note
> that it is not a problem if the worker is still active during a reset.
> Either it will run before the reset path starts locking things and
> will simply block the reset code for a tiny amount of time. Or it will
> run after the locks have been acquired and will early exit due to the
> try-lock.
>
> Also, do not use the reset-in-progress flag to decide whether a
> synchronous cancel is safe (from a lockdep perspective) or not.
> Instead, use the actual reset mutex state (both the genuine one and
> the custom rolled BACKOFF one).
>
> Fixes: 0e00a8814eec ("drm/i915/guc: Avoid circular locking issue on busyness flush")
> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
> Cc: Zhanjun Dong <zhanjun.dong at intel.com>
> Cc: John Harrison <John.C.Harrison at Intel.com>
> Cc: Andi Shyti <andi.shyti at linux.intel.com>
> Cc: Daniel Vetter <daniel at ffwll.ch>
> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> Cc: Nirmoy Das <nirmoy.das at intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
> Cc: Andrzej Hajda <andrzej.hajda at intel.com>
> Cc: Matt Roper <matthew.d.roper at intel.com>
> Cc: Jonathan Cavitt <jonathan.cavitt at intel.com>
> Cc: Prathap Kumar Valsan <prathap.kumar.valsan at intel.com>
> Cc: Alan Previn <alan.previn.teres.alexis at intel.com>
> Cc: Madhumitha Tolakanahalli Pradeep <madhumitha.tolakanahalli.pradeep at intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
> Cc: Ashutosh Dixit <ashutosh.dixit at intel.com>
> Cc: Dnyaneshwar Bhadane <dnyaneshwar.bhadane at intel.com>
Thanks for the details, looks good to me:
Reviewed-by: Nirmoy Das <nirmoy.das at intel.com>
> ---
> .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 ++++++++-----------
> drivers/gpu/drm/i915/gt/uc/intel_uc.c | 4 ++++
> 2 files changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 16640d6dd0589..00757d6333e88 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1403,14 +1403,17 @@ static void guc_cancel_busyness_worker(struct intel_guc *guc)
> * Trying to pass a 'need_sync' or 'in_reset' flag all the way down through
> * every possible call stack is unfeasible. It would be too intrusive to many
> * areas that really don't care about the GuC backend. However, there is the
> - * 'reset_in_progress' flag available, so just use that.
> + * I915_RESET_BACKOFF flag and the gt->reset.mutex can be tested for is_locked.
> + * So just use those. Note that testing both is required due to the hideously
> + * complex nature of the i915 driver's reset code paths.
> *
> * And note that in the case of a reset occurring during driver unload
> - * (wedge_on_fini), skipping the cancel in _prepare (when the reset flag is set
> - * is fine because there is another cancel in _finish (when the reset flag is
> - * not).
> + * (wedged_on_fini), skipping the cancel in reset_prepare/reset_fini (when the
> + * reset flag/mutex are set) is fine because there is another explicit cancel in
> + * intel_guc_submission_fini (when the reset flag/mutex are not).
> */
> - if (guc_to_gt(guc)->uc.reset_in_progress)
> + if (mutex_is_locked(&guc_to_gt(guc)->reset.mutex) ||
> + test_bit(I915_RESET_BACKOFF, &guc_to_gt(guc)->reset.flags))
> cancel_delayed_work(&guc->timestamp.work);
> else
> cancel_delayed_work_sync(&guc->timestamp.work);
> @@ -1424,8 +1427,6 @@ static void __reset_guc_busyness_stats(struct intel_guc *guc)
> unsigned long flags;
> ktime_t unused;
>
> - guc_cancel_busyness_worker(guc);
> -
> spin_lock_irqsave(&guc->timestamp.lock, flags);
>
> guc_update_pm_timestamp(guc, &unused);
> @@ -2004,13 +2005,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
>
> void intel_guc_submission_reset_finish(struct intel_guc *guc)
> {
> - /*
> - * Ensure the busyness worker gets cancelled even on a fatal wedge.
> - * Note that reset_prepare is not allowed to because it confuses lockdep.
> - */
> - if (guc_submission_initialized(guc))
> - guc_cancel_busyness_worker(guc);
> -
> /* Reset called during driver load or during wedge? */
> if (unlikely(!guc_submission_initialized(guc) ||
> !intel_guc_is_fw_running(guc) ||
> @@ -2136,6 +2130,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
> if (!guc->submission_initialized)
> return;
>
> + guc_fini_engine_stats(guc);
> guc_flush_destroyed_contexts(guc);
> guc_lrc_desc_pool_destroy_v69(guc);
> i915_sched_engine_put(guc->sched_engine);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> index b47051ddf17f2..7a63abf8f644c 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> @@ -633,6 +633,10 @@ void intel_uc_reset_finish(struct intel_uc *uc)
> {
> struct intel_guc *guc = &uc->guc;
>
> + /*
> + * NB: The wedge code path results in prepare -> prepare -> finish -> finish.
> + * So this function is sometimes called with the in-progress flag not set.
> + */
> uc->reset_in_progress = false;
>
> /* Firmware expected to be running when this function is called */
More information about the Intel-gfx
mailing list