[PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset

Liu, Shaoyun Shaoyun.Liu at amd.com
Fri Jan 26 18:37:55 UTC 2018


Ok, the wrong hang detection when amdgpu_gpu_recovery is enabled may be another issue , we can fix it later . 

Changed to 'false' flag as suggested and submitted . 

Regards
Shaoyun.liu


-----Original Message-----
From: Koenig, Christian 
Sent: Friday, January 26, 2018 1:28 PM
To: Liu, Shaoyun; amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset

> Sorry , I meant if I use the "false" flag and  gpu_recovery is not enabled , the  reset will be  ignored.
And exactly that is the intention here. So please use the false flag, apart from that the patch looks good to me.

Regards,
Christian.

Am 26.01.2018 um 18:56 schrieb Liu, Shaoyun:
> Sorry , I meant if I use the "false" flag and  gpu_recovery is not enabled , the  reset will be  ignored.
>
> -----Original Message-----
> From: Liu, Shaoyun
> Sent: Friday, January 26, 2018 12:54 PM
> To: Koenig, Christian; amd-gfx at lists.freedesktop.org
> Subject: RE: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset
>
> If we did use  force flag and amdgpu_gpu_recovery = 0 , the   reset will be ignored .  I'm kind of like this reset can go through like sriov .  If we depends on the  parameter  amdgpu_gpu_recovery , it may think the GPU is hang and  trigger the GPU reset when rocm submit  some heavy compute stuff running  and actually not hang  .
>
> Regards
> Shaoyun.liu
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
> Sent: Friday, January 26, 2018 12:41 PM
> To: Liu, Shaoyun; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset
>
> Am 26.01.2018 um 18:38 schrieb Shaoyun Liu:
>> Change-Id: I222f4bb2c9a91c7a4764e6aa706e7d7f2e6d948d
>> Signed-off-by: Shaoyun Liu <Shaoyun.Liu at amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 19 +++++++++++++++++++
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  6 ++++++
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 +++++
>>    3 files changed, 30 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> index 2d99099..cb1ee26 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> @@ -239,6 +239,25 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
>>    	return r;
>>    }
>>    
>> +void amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev) {
>> +	if (adev->kfd)
>> +		kgd2kfd->pre_reset(adev->kfd);
>> +}
>> +
>> +void amdgpu_amdkfd_post_reset(struct amdgpu_device *adev) {
>> +	if (adev->kfd)
>> +		kgd2kfd->post_reset(adev->kfd);
>> +}
>> +
>> +void amdgpu_amdkfd_gpu_reset(struct kgd_dev *kgd) {
>> +	struct amdgpu_device *adev = (struct amdgpu_device *)kgd;
>> +
>> +	amdgpu_device_gpu_recover(adev, NULL, true);
> Use false for the force parameter here, apart from that the set looks good to me.
>
> Regards,
> Christian.
>
>> +}
>> +
>>    int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum kgd_engine_type engine,
>>    				uint32_t vmid, uint64_t gpu_addr,
>>    				uint32_t *ib_cmd, uint32_t ib_len) diff --git 
>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> index 7c36e52..230761f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> @@ -155,6 +155,12 @@ int amdgpu_amdkfd_copy_mem_to_mem(struct kgd_dev *kgd, struct kgd_mem *src_mem,
>>    bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev,
>>    			u32 vmid);
>>    
>> +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev);
>> +
>> +int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev);
>> +
>> +void amdgpu_amdkfd_gpu_reset(struct kgd_dev *kgd);
>> +
>>    /* Shared API */
>>    int map_bo(struct amdgpu_device *rdev, uint64_t va, void *vm,
>>    		struct amdgpu_bo *bo, struct amdgpu_bo_va **bo_va); diff --git 
>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 94f837b..61e7d35 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2660,6 +2660,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    	atomic_inc(&adev->gpu_reset_counter);
>>    	adev->in_gpu_reset = 1;
>>    
>> +	/* Block kfd */
>> +	amdgpu_amdkfd_pre_reset(adev);
>> +
>>    	/* block TTM */
>>    	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
>>    	/* store modesetting */
>> @@ -2765,6 +2768,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    		amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
>>    	} else {
>>    		dev_info(adev->dev, "GPU reset(%d) 
>> successed!\n",atomic_read(&adev->gpu_reset_counter));
>> +		/*unlock kfd after a successfully recovery*/
>> +		amdgpu_amdkfd_post_reset(adev);
>>    	}
>>    
>>    	amdgpu_vf_error_trans_all(adev);



More information about the amd-gfx mailing list