[PATCH] drm/amdgpu: Do gpu reset if we lost some gpu reset requests

Tue Aug 6 14:23:53 UTC 2019

On 8/5/19 2:02 AM, Pan, Xinhui wrote:
> As the race of gpu reset with ras interrupts. we might lose a chance to
> do gpu recovery. To guarantee the gpu has recovered successfully, we use
> atomic to save the counts of gpu reset requests, and issue another gpu
> reset if there are any pending requests.
>
> Signed-off-by: xinhui pan <xinhui.pan at amd.com>

How this protects against RAS triggered amdgpu_device_gpu_recover being 
dropped because there was another non RAS recover GPU reset in progress 
such as due to job timeout?

And reiterating a question I already asked before, why do you have to 
do  schedule_work for GPU resets from amdgpu_ras_reset_gpu when you 
already in non interrupt context for any of the ras_ih_if.cb handlers. I 
see why you need it in amdgpu_ras_resume but not for the other call sites.

Andrey

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 +++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  2 +-
>   2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index a96b0f17c619..c1f444b74b19 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1220,7 +1220,15 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
>   		container_of(work, struct amdgpu_ras, recovery_work);
>   
>   	amdgpu_device_gpu_recover(ras->adev, 0);
> -	atomic_set(&ras->in_recovery, 0);
> +	/* if there is no competiton, in_recovery changes from 1 to 0.
> +	 * if ras_reset_gpu is called while we are doing gpu recvoery,
> +	 * bacause of the atomic protection, we may lose some recovery
> +	 * requests.
> +	 * So we use atomic_xchg to check the count of requests, and
> +	 * issue another gpu reset request to perform the gpu recovery.
> +	 */
> +	if (atomic_xchg(&ras->in_recovery, 0) > 1)
> +		amdgpu_ras_reset_gpu(ras->adev, 0);
>   }
>   
>   static int amdgpu_ras_release_vram(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 2765f2dbb1e6..ba423a4a3013 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -498,7 +498,7 @@ static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev,
>   {
>   	struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
>   
> -	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
> +	if (atomic_inc_return(&ras->in_recovery) == 1)
>   		schedule_work(&ras->recovery_work);
>   	return 0;
>   }