[PATCH] drm/i915/guc: Handle race condition where wakeref count drops below 0
Daniele Ceraolo Spurio
daniele.ceraolospurio at intel.com
Thu May 22 00:00:00 UTC 2025
On 5/15/2025 11:37 AM, Jesus Narvaez wrote:
> There is a rare race condition when preparing for a reset where
> guc_lrc_desc_unpin() could be in the process of deregistering a context
> while a different thread is scrubbing outstanding contexts and it alters
> the context state and does a wakeref put. Then, if there is a failure
> with deregister_context(), a second wakeref put could occur. As a result
> the wakeref count could drop below 0 and fail an INTEL_WAKEREF_BUG_ON()
> check.
>
> Therefore if there is a failure with deregister_context(), undo the
> context state changes and do a wakeref put only if the context was set
> to be destroyed earlier.
>
> Fixes: 2f2cc53b5fe7 ("drm/i915/guc: Close deregister-context race against CT-loss")
> Signed-off-by: Jesus Narvaez <jesus.narvaez at intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
> Cc: Alan Previn <alan.previn.teres.alexis at intel.com>
> Cc: Anshuman Gupta <anshuman.gupta at intel.com>
> Cc: Mousumi Jana <mousumi.jana at intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> Cc: Matt Roper <matthew.d.roper at intel.com>
> ---
> .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 16 +++++++++++-----
> 1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index f8cb7c630d5b..f066c2dd7114 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3441,20 +3441,26 @@ static inline int guc_lrc_desc_unpin(struct intel_context *ce)
>
> /*
> * GuC is active, lets destroy this context, but at this point we can still be racing
> - * with suspend, so we undo everything if the H2G fails in deregister_context so
> - * that GuC reset will find this context during clean up.
> + * with suspend, so we undo everything if the H2G fails in deregister_context
> + * and if the context was scheduled to be destroyed so that GuC reset will
> + * find this context during clean up.
Just adding "and if the context was scheduled to be destroyed" to the
comment is not very clear, because looking at this function it is hard
to consider that the state might change due to the reset code kicking
in. This needs to be expanded a bit to explain the race.
Daniele
> */
> ret = deregister_context(ce, ce->guc_id.id);
> if (ret) {
> + bool pending_destroyed;
> spin_lock_irqsave(&ce->guc_state.lock, flags);
> - set_context_registered(ce);
> - clr_context_destroyed(ce);
> + pending_destroyed = context_destroyed(ce);
> + if (pending_destroyed) {
> + set_context_registered(ce);
> + clr_context_destroyed(ce);
> + }
> spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> /*
> * As gt-pm is awake at function entry, intel_wakeref_put_async merely decrements
> * the wakeref immediately but per function spec usage call this after unlock.
> */
> - intel_wakeref_put_async(>->wakeref);
> + if (pending_destroyed)
> + intel_wakeref_put_async(>->wakeref);
> }
>
> return ret;
More information about the Intel-gfx
mailing list