[PATCH 1/9] drm/i915/guc: Fix blocked context accounting

Wed Aug 11 10:30:52 UTC 2021

On Wed, Aug 11, 2021 at 01:16:14AM +0000, Matthew Brost wrote:
> Prior to this patch the blocked context counter was cleared on
> init_sched_state (used during registering a context & resets) which is
> incorrect. This state needs to be persistent or the counter can read the
> incorrect value resulting in scheduling never getting enabled again.
> 
> Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")

Tiny bikeshed on that commit, but for SCHED_STATE_BLOCKED_MASK you want
GENMASK. Also SCHED_STATE_BLOCKED is usually called
SCHED_STATE_BLOCKED_BIAS or similar if you put a counter into that field.

But also ... we can't afford a separate counter? Is all the bitshifting
actually worth the space savings? With a separate counter your bugfix
below would look a lot more reasonable too.

> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> Cc: <stable at vger.kernel.org>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 87d8dc8f51b9..69faa39da178 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -152,7 +152,7 @@ static inline void init_sched_state(struct intel_context *ce)
>  {
>  	/* Only should be called from guc_lrc_desc_pin() */
>  	atomic_set(&ce->guc_sched_state_no_lock, 0);

atomic_t without barriers or anything like that. tsk, tsk.

Do we really need this?

Also I looked through the callchain of init_sched_state and I couldn't
find any locking, nor any description about ownership that would explain
why there's no locking.

E.g. scrub_guc_desc_for_outstanding_g2h has an xa_for_each with no
protection. No idea how that one works. I also couldn't figure out how
anything else in there is protected (no spinlocks to be seen anywhere at
least).

And then there's stuff like this:

		/* Flush context */
		spin_lock_irqsave(&ce->guc_state.lock, flags);
		spin_unlock_irqrestore(&ce->guc_state.lock, flags);

This pattern seems very common, and it freaks me out.

Finally none of the locking or consistency rules are explained in the
kerneldoc (or even comments) in the relevant datastructures, which is not
great.

> -	ce->guc_state.sched_state = 0;
> +	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;

The patch itself matches the commit message and makes sense. But like I
said, would be cleaner I think if it's just a separate counter.

Reviewed-by: Daniel Vetter <daniel.vetter at ffwll.ch>

>  }
>  
>  static inline bool
> -- 
> 2.32.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch