[PATCH] drm/i915/guc: Fix the fix for reset lock confusion

Nirmoy Das nirmoy.das at linux.intel.com
Wed Apr 3 09:21:44 UTC 2024


On 3/30/2024 12:53 AM, John.C.Harrison at Intel.com wrote:
> From: John Harrison <John.C.Harrison at Intel.com>
>
> The previous fix for the circlular lock splat about the busyness
> worker wasn't quite complete. Even though the reset-in-progress flag
> is cleared at the start of intel_uc_reset_finish, the entire function
> is still inside the reset mutex lock. Not sure why the patch appeared
> to fix the issue both locally and in CI. However, it is now back
> again.
>
> There is a further complication the wedge code path within
> intel_gt_reset() jumps around so much it results in nested
> reset_prepare/_finish calls. That is, the call sequence is:
>    intel_gt_reset
>    | reset_prepare
>    | __intel_gt_set_wedged
>    | | reset_prepare
>    | | reset_finish
>    | reset_finish
>
> The nested finish means that even if the clear of the in-progress flag
> was moved to the end of _finish, it would still be clear for the
> entire second call. Surprisingly, this does not seem to be causing any
> other problems at present.
>
> As an aside, a wedge on fini does not call the finish functions at
> all. The reset_in_progress flag is left set (twice).
>
> So instead of trying to cancel the worker anywhere at all in the reset
> path, just add a cancel to intel_guc_submission_fini instead. Note
> that it is not a problem if the worker is still active during a reset.
> Either it will run before the reset path starts locking things and
> will simply block the reset code for a tiny amount of time. Or it will
> run after the locks have been acquired and will early exit due to the
> try-lock.
>
> Also, do not use the reset-in-progress flag to decide whether a
> synchronous cancel is safe (from a lockdep perspective) or not.
> Instead, use the actual reset mutex state (both the genuine one and
> the custom rolled BACKOFF one).
>
> Fixes: 0e00a8814eec ("drm/i915/guc: Avoid circular locking issue on busyness flush")
> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
> Cc: Zhanjun Dong <zhanjun.dong at intel.com>
> Cc: John Harrison <John.C.Harrison at Intel.com>
> Cc: Andi Shyti <andi.shyti at linux.intel.com>
> Cc: Daniel Vetter <daniel at ffwll.ch>
> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> Cc: Nirmoy Das <nirmoy.das at intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
> Cc: Andrzej Hajda <andrzej.hajda at intel.com>
> Cc: Matt Roper <matthew.d.roper at intel.com>
> Cc: Jonathan Cavitt <jonathan.cavitt at intel.com>
> Cc: Prathap Kumar Valsan <prathap.kumar.valsan at intel.com>
> Cc: Alan Previn <alan.previn.teres.alexis at intel.com>
> Cc: Madhumitha Tolakanahalli Pradeep <madhumitha.tolakanahalli.pradeep at intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
> Cc: Ashutosh Dixit <ashutosh.dixit at intel.com>
> Cc: Dnyaneshwar Bhadane <dnyaneshwar.bhadane at intel.com>

Thanks for the details, looks good to me:

Reviewed-by: Nirmoy Das <nirmoy.das at intel.com>

> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 ++++++++-----------
>   drivers/gpu/drm/i915/gt/uc/intel_uc.c         |  4 ++++
>   2 files changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 16640d6dd0589..00757d6333e88 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1403,14 +1403,17 @@ static void guc_cancel_busyness_worker(struct intel_guc *guc)
>   	 * Trying to pass a 'need_sync' or 'in_reset' flag all the way down through
>   	 * every possible call stack is unfeasible. It would be too intrusive to many
>   	 * areas that really don't care about the GuC backend. However, there is the
> -	 * 'reset_in_progress' flag available, so just use that.
> +	 * I915_RESET_BACKOFF flag and the gt->reset.mutex can be tested for is_locked.
> +	 * So just use those. Note that testing both is required due to the hideously
> +	 * complex nature of the i915 driver's reset code paths.
>   	 *
>   	 * And note that in the case of a reset occurring during driver unload
> -	 * (wedge_on_fini), skipping the cancel in _prepare (when the reset flag is set
> -	 * is fine because there is another cancel in _finish (when the reset flag is
> -	 * not).
> +	 * (wedged_on_fini), skipping the cancel in reset_prepare/reset_fini (when the
> +	 * reset flag/mutex are set) is fine because there is another explicit cancel in
> +	 * intel_guc_submission_fini (when the reset flag/mutex are not).
>   	 */
> -	if (guc_to_gt(guc)->uc.reset_in_progress)
> +	if (mutex_is_locked(&guc_to_gt(guc)->reset.mutex) ||
> +	    test_bit(I915_RESET_BACKOFF, &guc_to_gt(guc)->reset.flags))
>   		cancel_delayed_work(&guc->timestamp.work);
>   	else
>   		cancel_delayed_work_sync(&guc->timestamp.work);
> @@ -1424,8 +1427,6 @@ static void __reset_guc_busyness_stats(struct intel_guc *guc)
>   	unsigned long flags;
>   	ktime_t unused;
>   
> -	guc_cancel_busyness_worker(guc);
> -
>   	spin_lock_irqsave(&guc->timestamp.lock, flags);
>   
>   	guc_update_pm_timestamp(guc, &unused);
> @@ -2004,13 +2005,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
>   
>   void intel_guc_submission_reset_finish(struct intel_guc *guc)
>   {
> -	/*
> -	 * Ensure the busyness worker gets cancelled even on a fatal wedge.
> -	 * Note that reset_prepare is not allowed to because it confuses lockdep.
> -	 */
> -	if (guc_submission_initialized(guc))
> -		guc_cancel_busyness_worker(guc);
> -
>   	/* Reset called during driver load or during wedge? */
>   	if (unlikely(!guc_submission_initialized(guc) ||
>   		     !intel_guc_is_fw_running(guc) ||
> @@ -2136,6 +2130,7 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>   	if (!guc->submission_initialized)
>   		return;
>   
> +	guc_fini_engine_stats(guc);
>   	guc_flush_destroyed_contexts(guc);
>   	guc_lrc_desc_pool_destroy_v69(guc);
>   	i915_sched_engine_put(guc->sched_engine);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> index b47051ddf17f2..7a63abf8f644c 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> @@ -633,6 +633,10 @@ void intel_uc_reset_finish(struct intel_uc *uc)
>   {
>   	struct intel_guc *guc = &uc->guc;
>   
> +	/*
> +	 * NB: The wedge code path results in prepare -> prepare -> finish -> finish.
> +	 * So this function is sometimes called with the in-progress flag not set.
> +	 */
>   	uc->reset_in_progress = false;
>   
>   	/* Firmware expected to be running when this function is called */


More information about the Intel-gfx mailing list