[PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery

Christian König christian.koenig at amd.com
Thu Aug 20 08:59:58 UTC 2020


Yes, that is perfectly valid. Same thing for multiple timeouts from 
different queues.

Christian.

Am 20.08.20 um 10:40 schrieb Li, Dennis:
> [AMD Public Use]
>
> Hi, Hawking,
>        When RAS uncorrectable error happens, RAS interrupt will trigger a GPU recovery.  At the same time, if a GFX or compute job is timeout, driver will trigger a new one.
>
> Best Regards
> Dennis Li
> -----Original Message-----
> From: Zhang, Hawking <Hawking.Zhang at amd.com>
> Sent: Thursday, August 20, 2020 4:24 PM
> To: Li, Dennis <Dennis.Li at amd.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix <Felix.Kuehling at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
> Cc: Li, Dennis <Dennis.Li at amd.com>
> Subject: RE: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery
>
> [AMD Public Use]
>
> Hi Dennis,
>
> Can you elaborate the case that driver re-enter GPU recovery in sGPU system? I'm wondering whether this is a valid case or we shall prevent this from the beginning.
>
> Regards,
> Hawking
>
> -----Original Message-----
> From: Dennis Li <Dennis.Li at amd.com>
> Sent: Thursday, August 20, 2020 10:21
> To: amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix <Felix.Kuehling at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
> Cc: Li, Dennis <Dennis.Li at amd.com>
> Subject: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery
>
> in single gpu system, if driver reenter gpu recovery, amdgpu_device_lock_adev will return false, but hive is nullptr now.
>
> Signed-off-by: Dennis Li <Dennis.Li at amd.com>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 82242e2f5658..81b1d9a1dca0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4371,8 +4371,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		if (!amdgpu_device_lock_adev(tmp_adev)) {
>   			DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress",
>   				  job ? job->base.id : -1);
> -			mutex_unlock(&hive->hive_lock);
> -			return 0;
> +			r = 0;
> +			goto skip_recovery;
>   		}
>   
>   		/*
> @@ -4505,6 +4505,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		amdgpu_device_unlock_adev(tmp_adev);
>   	}
>   
> +skip_recovery:
>   	if (hive) {
>   		atomic_set(&hive->in_reset, 0);
>   		mutex_unlock(&hive->hive_lock);
> --
> 2.17.1



More information about the amd-gfx mailing list