[PATCH 5/5] drm/amdgpu: reduce reset time

Mon Jul 25 21:17:43 UTC 2022

On 2022-07-22 03:34, Victor Zhao wrote:
> In multi container use case, reset time is important, so skip ring
> tests and cp halt wait during ip suspending for reset as they are
> going to fail and cost more time on reset

Why are they failing in this case ? Skipping ring tests is not the best 
idea as you
loose important indicator of system's sanity. Is there any way to make 
them work ?

>
> Signed-off-by: Victor Zhao <Victor.Zhao at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c |  2 +-
>   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  | 26 +++++++++++++++----------
>   2 files changed, 17 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index 222d3d7ea076..f872495ccc3a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -477,7 +477,7 @@ int amdgpu_gfx_disable_kcq(struct amdgpu_device *adev)
>   		kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.compute_ring[i],
>   					   RESET_QUEUES, 0, 0);
>   
> -	if (adev->gfx.kiq.ring.sched.ready)
> +	if (adev->gfx.kiq.ring.sched.ready && !amdgpu_in_reset(adev))
>   		r = amdgpu_ring_test_helper(kiq_ring);
>   	spin_unlock(&adev->gfx.kiq.ring_lock);
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index fafbad3cf08d..9ae29023e38f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -5971,16 +5971,19 @@ static int gfx_v10_0_cp_gfx_enable(struct amdgpu_device *adev, bool enable)
>   		WREG32_SOC15(GC, 0, mmCP_ME_CNTL, tmp);
>   	}
>   
> -	for (i = 0; i < adev->usec_timeout; i++) {
> -		if (RREG32_SOC15(GC, 0, mmCP_STAT) == 0)
> -			break;
> -		udelay(1);
> -	}
> -
> -	if (i >= adev->usec_timeout)
> -		DRM_ERROR("failed to %s cp gfx\n", enable ? "unhalt" : "halt");
> +	if (!amdgpu_in_reset(adev)) {
> +		for (i = 0; i < adev->usec_timeout; i++) {
> +			if (RREG32_SOC15(GC, 0, mmCP_STAT) == 0)
> +				break;
> +			udelay(1);
> +		}
>   
> +		if (i >= adev->usec_timeout)
> +			DRM_ERROR("failed to %s cp gfx\n",
> +				  enable ? "unhalt" : "halt");
> +	}
>   	return 0;
> +
>   }

This change has impact beyond container case no ? We had no issue
with this code during regular reset cases so why we would give up on 
this code
which confirms CP is idle ? What is the side effect of skipping this 
during all GPU resets ?

Andrey

>   
>   static int gfx_v10_0_cp_gfx_load_pfp_microcode(struct amdgpu_device *adev)
> @@ -7569,8 +7572,10 @@ static int gfx_v10_0_kiq_disable_kgq(struct amdgpu_device *adev)
>   	for (i = 0; i < adev->gfx.num_gfx_rings; i++)
>   		kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.gfx_ring[i],
>   					   PREEMPT_QUEUES, 0, 0);
> -
> -	return amdgpu_ring_test_helper(kiq_ring);
> +	if (!amdgpu_in_reset(adev))
> +		return amdgpu_ring_test_helper(kiq_ring);
> +	else
> +		return 0;
>   }
>   #endif
>   
> @@ -7610,6 +7615,7 @@ static int gfx_v10_0_hw_fini(void *handle)
>   
>   		return 0;
>   	}
> +
>   	gfx_v10_0_cp_enable(adev, false);
>   	gfx_v10_0_enable_gui_idle_interrupt(adev, false);
>