[PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset

Liu, Shaoyun Shaoyun.Liu at amd.com
Fri Jan 26 17:56:30 UTC 2018


Sorry , I meant if I use the "false" flag and  gpu_recovery is not enabled , the  reset will be  ignored.  

-----Original Message-----
From: Liu, Shaoyun 
Sent: Friday, January 26, 2018 12:54 PM
To: Koenig, Christian; amd-gfx at lists.freedesktop.org
Subject: RE: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset

If we did use  force flag and amdgpu_gpu_recovery = 0 , the   reset will be ignored .  I'm kind of like this reset can go through like sriov .  If we depends on the  parameter  amdgpu_gpu_recovery , it may think the GPU is hang and  trigger the GPU reset when rocm submit  some heavy compute stuff running  and actually not hang  . 

Regards
Shaoyun.liu

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
Sent: Friday, January 26, 2018 12:41 PM
To: Liu, Shaoyun; amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset

Am 26.01.2018 um 18:38 schrieb Shaoyun Liu:
> Change-Id: I222f4bb2c9a91c7a4764e6aa706e7d7f2e6d948d
> Signed-off-by: Shaoyun Liu <Shaoyun.Liu at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 19 +++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  6 ++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 +++++
>   3 files changed, 30 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 2d99099..cb1ee26 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -239,6 +239,25 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
>   	return r;
>   }
>   
> +void amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev) {
> +	if (adev->kfd)
> +		kgd2kfd->pre_reset(adev->kfd);
> +}
> +
> +void amdgpu_amdkfd_post_reset(struct amdgpu_device *adev) {
> +	if (adev->kfd)
> +		kgd2kfd->post_reset(adev->kfd);
> +}
> +
> +void amdgpu_amdkfd_gpu_reset(struct kgd_dev *kgd) {
> +	struct amdgpu_device *adev = (struct amdgpu_device *)kgd;
> +
> +	amdgpu_device_gpu_recover(adev, NULL, true);

Use false for the force parameter here, apart from that the set looks good to me.

Regards,
Christian.

> +}
> +
>   int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum kgd_engine_type engine,
>   				uint32_t vmid, uint64_t gpu_addr,
>   				uint32_t *ib_cmd, uint32_t ib_len) diff --git 
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 7c36e52..230761f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -155,6 +155,12 @@ int amdgpu_amdkfd_copy_mem_to_mem(struct kgd_dev *kgd, struct kgd_mem *src_mem,
>   bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev,
>   			u32 vmid);
>   
> +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev);
> +
> +int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev);
> +
> +void amdgpu_amdkfd_gpu_reset(struct kgd_dev *kgd);
> +
>   /* Shared API */
>   int map_bo(struct amdgpu_device *rdev, uint64_t va, void *vm,
>   		struct amdgpu_bo *bo, struct amdgpu_bo_va **bo_va); diff --git 
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 94f837b..61e7d35 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2660,6 +2660,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	atomic_inc(&adev->gpu_reset_counter);
>   	adev->in_gpu_reset = 1;
>   
> +	/* Block kfd */
> +	amdgpu_amdkfd_pre_reset(adev);
> +
>   	/* block TTM */
>   	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
>   	/* store modesetting */
> @@ -2765,6 +2768,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
>   	} else {
>   		dev_info(adev->dev, "GPU reset(%d) 
> successed!\n",atomic_read(&adev->gpu_reset_counter));
> +		/*unlock kfd after a successfully recovery*/
> +		amdgpu_amdkfd_post_reset(adev);
>   	}
>   
>   	amdgpu_vf_error_trans_all(adev);



More information about the amd-gfx mailing list