[Intel-gfx] [PATCH 1/2] drm/i915: limit double GT reset to pre-MTL

Wed Mar 22 19:44:56 UTC 2023

On 3/20/2023 14:10, Daniele Ceraolo Spurio wrote:
> Commit 3db9d590557d ("drm/i915/gt: Reset twice") modified the code to
> always hit the GDRST register twice when doing a reset, with the
> reported aim to fix invalid post-reset engine state on some platforms
> (Jasperlake being the only one actually mentioned).
It still concerns me that there are no actual details about this issue 
from a hardware perspective as to what/why it goes wrong, the comment is 
fully of non-definitive language - 'appears to', 'should', 'is still a 
concern that'. And there is no w/a number associated with it. It all 
feels extremely suspect and warrants a great big FIXME tag as a minimum.

John.

>
> This is a problem on MTL, due to the fact that we have to apply a time
> consuming WA (coming in the next patch) every time we hit the GDRST
> register in a way that can include the GSC engine. Even post MTL, the
> expectation is that we'll have some work to do before and after hitting
> the GDRST if the GSC is involved.
>
> Since the issue requiring the double reset seems to be limited to older
> platforms, instead of trying to handle the double-reset on MTL and
> future platforms it is just easier to turn it off. The default on MTL is
> also for GuC to own engine reset, with i915 only covering full-GT reset.
>
> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Andi Shyti <andi.shyti at linux.intel.com>
> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> Cc: Gwan-gyeong Mun <gwan-gyeong.mun at intel.com>
> Cc: John Harrison <John.C.Harrison at Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_reset.c | 35 +++++++++++++++------------
>   1 file changed, 20 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index 0bb9094fdacd..2c3463f77e5c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -268,9 +268,27 @@ static int ilk_do_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask,
>   static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
>   {
>   	struct intel_uncore *uncore = gt->uncore;
> -	int loops = 2;
> +	int loops;
>   	int err;
>   
> +	/*
> +	 * On some platforms, e.g. Jasperlake, we see that the engine register
> +	 * state is not cleared until shortly after GDRST reports completion,
> +	 * causing a failure as we try to immediately resume while the internal
> +	 * state is still in flux. If we immediately repeat the reset, the
> +	 * second reset appears to serialise with the first, and since it is a
> +	 * no-op, the registers should retain their reset value. However, there
> +	 * is still a concern that upon leaving the second reset, the internal
> +	 * engine state is still in flux and not ready for resuming.
> +	 *
> +	 * Starting on MTL, there are some prep steps that we need to do when
> +	 * resetting some engines that need to be applied every time we write to
> +	 * GEN6_GDRST. As those are time consuming (tens of ms), we don't want
> +	 * to perform that twice, so, since the Jasperlake issue hasn't been
> +	 * observed on MTL, we avoid repeating the reset on newer platforms.
> +	 */
> +	loops = GRAPHICS_VER_FULL(gt->i915) < IP_VER(12, 70) ? 2 : 1;
> +
>   	/*
>   	 * GEN6_GDRST is not in the gt power well, no need to check
>   	 * for fifo space for the write or forcewake the chip for
> @@ -279,20 +297,7 @@ static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask)
>   	do {
>   		intel_uncore_write_fw(uncore, GEN6_GDRST, hw_domain_mask);
>   
> -		/*
> -		 * Wait for the device to ack the reset requests.
> -		 *
> -		 * On some platforms, e.g. Jasperlake, we see that the
> -		 * engine register state is not cleared until shortly after
> -		 * GDRST reports completion, causing a failure as we try
> -		 * to immediately resume while the internal state is still
> -		 * in flux. If we immediately repeat the reset, the second
> -		 * reset appears to serialise with the first, and since
> -		 * it is a no-op, the registers should retain their reset
> -		 * value. However, there is still a concern that upon
> -		 * leaving the second reset, the internal engine state
> -		 * is still in flux and not ready for resuming.
> -		 */
> +		/* Wait for the device to ack the reset requests. */
>   		err = __intel_wait_for_register_fw(uncore, GEN6_GDRST,
>   						   hw_domain_mask, 0,
>   						   2000, 0,