[PATCH v2] drm/i915/guc: Handle race condition where wakeref count drops below 0

Daniele Ceraolo Spurio daniele.ceraolospurio at intel.com
Wed May 28 20:35:12 UTC 2025



On 5/23/2025 4:52 PM, Jesus Narvaez wrote:
> There is a rare race condition when preparing for a reset where
> guc_lrc_desc_unpin() could be in the process of deregistering a context
> while a different thread is scrubbing outstanding contexts and it alters
> the context state and does a wakeref put. Then, if there is a failure
> with deregister_context(), a second wakeref put could occur. As a result
> the wakeref count could drop below 0 and fail an INTEL_WAKEREF_BUG_ON()
> check.
>
> Therefore if there is a failure with deregister_context(), undo the
> context state changes and do a wakeref put only if the context was set
> to be destroyed earlier.
>
> v2: Expand comment to better explain change. (Daniele)
>
> Fixes: 2f2cc53b5fe7 ("drm/i915/guc: Close deregister-context race against CT-loss")
> Signed-off-by: Jesus Narvaez <jesus.narvaez at intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
> Cc: Alan Previn <alan.previn.teres.alexis at intel.com>
> Cc: Anshuman Gupta <anshuman.gupta at intel.com>
> Cc: Mousumi Jana <mousumi.jana at intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> Cc: Matt Roper <matthew.d.roper at intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 22 ++++++++++++++-----
>   1 file changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index f8cb7c630d5b..a20a9969c4e7 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3441,20 +3441,32 @@ static inline int guc_lrc_desc_unpin(struct intel_context *ce)
>   
>   	/*
>   	 * GuC is active, lets destroy this context, but at this point we can still be racing
> -	 * with suspend, so we undo everything if the H2G fails in deregister_context so
> -	 * that GuC reset will find this context during clean up.
> +	 * with suspend, so we undo everything if the H2G fails in deregister_context
> +	 * and if the context was scheduled to be destroyed so that GuC reset will
> +	 * find this context during clean up.
> +	 *
> +	 * There is a race condition where the reset code could have altered
> +	 * this context's state and done a wakeref put before we try to
> +	 * deregister it here. So check if the context is still set to be
> +	 * destroyed before undoing earlier changes, to avoid two wakeref puts
> +	 * on the same context.

The new chunk is good, but you've kept the modification to the previous 
comment which is still confusing. I suggest leaving the original comment 
as-is and just have the second chunk added. With that:

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>

Daniele

>   	 */
>   	ret = deregister_context(ce, ce->guc_id.id);
>   	if (ret) {
> +		bool pending_destroyed;
>   		spin_lock_irqsave(&ce->guc_state.lock, flags);
> -		set_context_registered(ce);
> -		clr_context_destroyed(ce);
> +		pending_destroyed = context_destroyed(ce);
> +		if (pending_destroyed) {
> +			set_context_registered(ce);
> +			clr_context_destroyed(ce);
> +		}
>   		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>   		/*
>   		 * As gt-pm is awake at function entry, intel_wakeref_put_async merely decrements
>   		 * the wakeref immediately but per function spec usage call this after unlock.
>   		 */
> -		intel_wakeref_put_async(&gt->wakeref);
> +		if (pending_destroyed)
> +			intel_wakeref_put_async(&gt->wakeref);
>   	}
>   
>   	return ret;



More information about the Intel-gfx mailing list