[PATCH v2] drm/xe/guc: Enable WA_DUAL_QUEUE for newer platforms

Wed Dec 11 22:35:05 UTC 2024

-----Original Message-----
From: Ceraolo Spurio, Daniele <daniele.ceraolospurio at intel.com> 
Sent: Wednesday, December 11, 2024 1:56 PM
To: intel-xe at lists.freedesktop.org
Cc: Ceraolo Spurio, Daniele <daniele.ceraolospurio at intel.com>; Harrison, John C <john.c.harrison at intel.com>; Narvaez, Jesus <jesus.narvaez at intel.com>; Cavitt, Jonathan <jonathan.cavitt at intel.com>
Subject: [PATCH v2] drm/xe/guc: Enable WA_DUAL_QUEUE for newer platforms
> 
> The DUAL_QUEUE_WA tells the GuC to not allow concurrent submissions
> on RCS and CCSes with different address spaces, which on DG2 is
> required as a WA for an HW bug. On newer platforms, this block has
> been moved in HW at the CS level, by stalling the RCS/CCS context
> switch when one of the other RCS/CCSes is busy with a different
> address space. While functionally correct, having a submission
> stalled on the HW limits the GuC ability to shuffle things around and
> can cause complications if the non-stalled submission runs for a long
> time, because the GuC doesn't know that the stalled submission isn't
> actually running and might declare it as hung. Therefore, we enable
> the DUAL_QUEUE_WA on all newer platforms to move management back to
> the GuC.
> 
> Note that the GuC specs also recommend enabling this for all platforms
> starting from MTL that have a CCS.
> 
> v2: only set the flag on GTs that have CCS engines
> 
> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
> Cc: John Harrison <John.C.Harrison at Intel.com>
> Cc: Jesus Narvaez <jesus.narvaez at intel.com>
> Cc: Jonathan Cavitt <jonathan.cavitt at intel.com>
> ---
>  drivers/gpu/drm/xe/xe_guc.c | 27 ++++++++++++++++++++++++++-
>  1 file changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> index 4e2868efb620..1d87933d7d0f 100644
> --- a/drivers/gpu/drm/xe/xe_guc.c
> +++ b/drivers/gpu/drm/xe/xe_guc.c
> @@ -147,6 +147,31 @@ static u32 guc_ctl_ads_flags(struct xe_guc *guc)
>  	return flags;
>  }
>  
> +static bool needs_wa_dual_queue(struct xe_gt *gt)
> +{
> +	/*
> +	 * The DUAL_QUEUE_WA tells the GuC to not allow concurrent submissions
> +	 * on RCS and CCSes with different address spaces, which on DG2 is
> +	 * required as a WA for an HW bug. On newer platforms, this block has
> +	 * been moved in HW at the CS level, by stalling the RCS/CCS context
> +	 * switch when one of the other RCS/CCSes is busy with a different
> +	 * address space. While functionally correct, having a submission
> +	 * stalled on the HW limits the GuC ability to shuffle things around and
> +	 * can cause complications if the non-stalled submission runs for a long
> +	 * time, because the GuC doesn't know that the stalled submission isn't
> +	 * actually running and might declare it as hung. Therefore, we enable
> +	 * the DUAL_QUEUE_WA on all newer platforms on GTs that have CCS engines
> +	 * to move management back to the GuC.
> +	 */

LGTM, and my RB still stands.

However, we may want to split this comment up into explaining the two cases where
we'd want to apply the workaround.  Specifically, we'd explain here that we're applying
the workaround for hardware reasons.  Then, below, we'd explain that we need the functionality
of the workaround extended to other platforms for non-hardware (GuC related) reasons.

So, the two comments would look something like:
"""
	/*
	 * The DUAL_QUEUE_WA tells the GuC to not allow concurrent submissions
	 * on RCS and CCSes with different address spaces, which on DG2 is
	 * required as a WA for an HW bug.
	 */
	if (XE_WA(gt, 22011391025))
		return true;

	/*
	 * On newer platforms, this block has
	 * been moved in HW at the CS level, by stalling the RCS/CCS context
	 * switch when one of the other RCS/CCSes is busy with a different
	 * address space. While functionally correct, having a submission
	 * stalled on the HW limits the GuC ability to shuffle things around and
	 * can cause complications if the non-stalled submission runs for a long
	 * time, because the GuC doesn't know that the stalled submission isn't
	 * actually running and might declare it as hung. Therefore, we enable
	 * the DUAL_QUEUE_WA on all newer platforms on GTs that have CCS engines
	 * to move management back to the GuC.
	 */
	if (CCS_MASK(gt) && GRAPHICS_VERx100(gt_to_xe(gt)) >= 1270)
		return true;
"""
Except, you know, cleaner than that.
-Jonathan Cavitt

> +	if (XE_WA(gt, 22011391025))
> +		return true;
> +
> +	if (CCS_MASK(gt) && GRAPHICS_VERx100(gt_to_xe(gt)) >= 1270)
> +		return true;
> +
> +	return false;
> +}
> +
>  static u32 guc_ctl_wa_flags(struct xe_guc *guc)
>  {
>  	struct xe_device *xe = guc_to_xe(guc);
> @@ -159,7 +184,7 @@ static u32 guc_ctl_wa_flags(struct xe_guc *guc)
>  	if (XE_WA(gt, 14014475959))
>  		flags |= GUC_WA_HOLD_CCS_SWITCHOUT;
>  
> -	if (XE_WA(gt, 22011391025))
> +	if (needs_wa_dual_queue(gt))
>  		flags |= GUC_WA_DUAL_QUEUE;
>  
>  	/*
> -- 
> 2.43.0
> 
>