[PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset

Christian König ckoenig.leichtzumerken at gmail.com
Fri Jan 26 18:41:31 UTC 2018


Alternatively if you really want to add some distinction here you could 
add an enum like trigger source.

Something like manual, SRIOV host, GPU scheduler, KFD, interrupt etc... 
And then use the user configurable option as bitmask to enable/disable 
GPU recovery for each trigger source.

Regards,
Christian.

Am 26.01.2018 um 19:37 schrieb Liu, Shaoyun:
> Ok, the wrong hang detection when amdgpu_gpu_recovery is enabled may be another issue , we can fix it later .
>
> Changed to 'false' flag as suggested and submitted .
>
> Regards
> Shaoyun.liu
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: Friday, January 26, 2018 1:28 PM
> To: Liu, Shaoyun; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset
>
>> Sorry , I meant if I use the "false" flag and  gpu_recovery is not enabled , the  reset will be  ignored.
> And exactly that is the intention here. So please use the false flag, apart from that the patch looks good to me.
>
> Regards,
> Christian.
>
> Am 26.01.2018 um 18:56 schrieb Liu, Shaoyun:
>> Sorry , I meant if I use the "false" flag and  gpu_recovery is not enabled , the  reset will be  ignored.
>>
>> -----Original Message-----
>> From: Liu, Shaoyun
>> Sent: Friday, January 26, 2018 12:54 PM
>> To: Koenig, Christian; amd-gfx at lists.freedesktop.org
>> Subject: RE: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset
>>
>> If we did use  force flag and amdgpu_gpu_recovery = 0 , the   reset will be ignored .  I'm kind of like this reset can go through like sriov .  If we depends on the  parameter  amdgpu_gpu_recovery , it may think the GPU is hang and  trigger the GPU reset when rocm submit  some heavy compute stuff running  and actually not hang  .
>>
>> Regards
>> Shaoyun.liu
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
>> Sent: Friday, January 26, 2018 12:41 PM
>> To: Liu, Shaoyun; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset
>>
>> Am 26.01.2018 um 18:38 schrieb Shaoyun Liu:
>>> Change-Id: I222f4bb2c9a91c7a4764e6aa706e7d7f2e6d948d
>>> Signed-off-by: Shaoyun Liu <Shaoyun.Liu at amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 19 +++++++++++++++++++
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  6 ++++++
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 +++++
>>>     3 files changed, 30 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> index 2d99099..cb1ee26 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> @@ -239,6 +239,25 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
>>>     	return r;
>>>     }
>>>     
>>> +void amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev) {
>>> +	if (adev->kfd)
>>> +		kgd2kfd->pre_reset(adev->kfd);
>>> +}
>>> +
>>> +void amdgpu_amdkfd_post_reset(struct amdgpu_device *adev) {
>>> +	if (adev->kfd)
>>> +		kgd2kfd->post_reset(adev->kfd);
>>> +}
>>> +
>>> +void amdgpu_amdkfd_gpu_reset(struct kgd_dev *kgd) {
>>> +	struct amdgpu_device *adev = (struct amdgpu_device *)kgd;
>>> +
>>> +	amdgpu_device_gpu_recover(adev, NULL, true);
>> Use false for the force parameter here, apart from that the set looks good to me.
>>
>> Regards,
>> Christian.
>>
>>> +}
>>> +
>>>     int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum kgd_engine_type engine,
>>>     				uint32_t vmid, uint64_t gpu_addr,
>>>     				uint32_t *ib_cmd, uint32_t ib_len) diff --git
>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> index 7c36e52..230761f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> @@ -155,6 +155,12 @@ int amdgpu_amdkfd_copy_mem_to_mem(struct kgd_dev *kgd, struct kgd_mem *src_mem,
>>>     bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev,
>>>     			u32 vmid);
>>>     
>>> +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev);
>>> +
>>> +int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev);
>>> +
>>> +void amdgpu_amdkfd_gpu_reset(struct kgd_dev *kgd);
>>> +
>>>     /* Shared API */
>>>     int map_bo(struct amdgpu_device *rdev, uint64_t va, void *vm,
>>>     		struct amdgpu_bo *bo, struct amdgpu_bo_va **bo_va); diff --git
>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 94f837b..61e7d35 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2660,6 +2660,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>     	atomic_inc(&adev->gpu_reset_counter);
>>>     	adev->in_gpu_reset = 1;
>>>     
>>> +	/* Block kfd */
>>> +	amdgpu_amdkfd_pre_reset(adev);
>>> +
>>>     	/* block TTM */
>>>     	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
>>>     	/* store modesetting */
>>> @@ -2765,6 +2768,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>     		amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
>>>     	} else {
>>>     		dev_info(adev->dev, "GPU reset(%d)
>>> successed!\n",atomic_read(&adev->gpu_reset_counter));
>>> +		/*unlock kfd after a successfully recovery*/
>>> +		amdgpu_amdkfd_post_reset(adev);
>>>     	}
>>>     
>>>     	amdgpu_vf_error_trans_all(adev);
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



More information about the amd-gfx mailing list