[PATCH] drm/amdgpu: Fix two reset triggered in a row

Tue Apr 23 05:50:46 UTC 2024

Am 22.04.24 um 21:45 schrieb Yunxiang Li:
> Reset request from KFD is missing a check for if a reset is already in
> progress, this causes a second reset to be triggered right after the
> previous one finishes. Add the check to align with the other reset sources.

NAK, that isn't how this should be handled.

Instead all reset source which are handled by a previous reset should be 
canceled.

In other words there should be a cancel_work(&adev->kfd.reset_work); 
somewhere in the KFD code. When this doesn't work correctly then that is 
somehow missing.

If you see the use of amdgpu_in_reset() outside of the low level 
functions than that is clearly a bug.

Regards,
Christian.

>
> Signed-off-by: Yunxiang Li <Yunxiang.Li at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 3b4591f554f1..ce3dbb1cc2da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -283,7 +283,7 @@ int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev)
>   
>   void amdgpu_amdkfd_gpu_reset(struct amdgpu_device *adev)
>   {
> -	if (amdgpu_device_should_recover_gpu(adev))
> +	if (amdgpu_device_should_recover_gpu(adev) && !amdgpu_in_reset(adev))
>   		amdgpu_reset_domain_schedule(adev->reset_domain,
>   					     &adev->kfd.reset_work);
>   }