[PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset

Fri Jan 26 19:50:19 UTC 2018

Yes, understood .  I already rebased the changes on amd-kfd-staging  and  add you and  Christian as reviewer. 

Regards
Shaoyun.liu

-----Original Message-----
From: Kuehling, Felix 
Sent: Friday, January 26, 2018 2:47 PM
To: Liu, Shaoyun; Koenig, Christian; amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset

Hi Shaoyun,

As I'm upstreaming GPUVM support for KFD right now, a bunch of amdgpu_amdkfd_... changes are going to go through Oded's tree. To avoid conflicts when merging with Alex's tree, I'd recommend we don't submit your changes to amd-staging-drm-next.

Instead submit them to amd-kfd-staging, and I can take care of upstreaming them via Oded later.

We can still do code review on the amd-gfx mailing list to get valuable feedback.

Regards,
  Felix

On 2018-01-26 01:53 PM, Liu, Shaoyun wrote:
> It could be  done in next step .
> I just notice my change is based on amd-kfd-staging ,  I will add you as the reviewer again .  
>
> Regards
> Shaoyun.liu
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
> Sent: Friday, January 26, 2018 1:42 PM
> To: Liu, Shaoyun; Koenig, Christian; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset
>
> Alternatively if you really want to add some distinction here you could add an enum like trigger source.
>
> Something like manual, SRIOV host, GPU scheduler, KFD, interrupt etc... 
> And then use the user configurable option as bitmask to enable/disable GPU recovery for each trigger source.
>
> Regards,
> Christian.
>
> Am 26.01.2018 um 19:37 schrieb Liu, Shaoyun:
>> Ok, the wrong hang detection when amdgpu_gpu_recovery is enabled may be another issue , we can fix it later .
>>
>> Changed to 'false' flag as suggested and submitted .
>>
>> Regards
>> Shaoyun.liu
>>
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: Friday, January 26, 2018 1:28 PM
>> To: Liu, Shaoyun; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset
>>
>>> Sorry , I meant if I use the "false" flag and  gpu_recovery is not enabled , the  reset will be  ignored.
>> And exactly that is the intention here. So please use the false flag, apart from that the patch looks good to me.
>>
>> Regards,
>> Christian.
>>
>> Am 26.01.2018 um 18:56 schrieb Liu, Shaoyun:
>>> Sorry , I meant if I use the "false" flag and  gpu_recovery is not enabled , the  reset will be  ignored.
>>>
>>> -----Original Message-----
>>> From: Liu, Shaoyun
>>> Sent: Friday, January 26, 2018 12:54 PM
>>> To: Koenig, Christian; amd-gfx at lists.freedesktop.org
>>> Subject: RE: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset
>>>
>>> If we did use  force flag and amdgpu_gpu_recovery = 0 , the   reset will be ignored .  I'm kind of like this reset can go through like sriov .  If we depends on the  parameter  amdgpu_gpu_recovery , it may think the GPU is hang and  trigger the GPU reset when rocm submit  some heavy compute stuff running  and actually not hang  .
>>>
>>> Regards
>>> Shaoyun.liu
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
>>> Sent: Friday, January 26, 2018 12:41 PM
>>> To: Liu, Shaoyun; amd-gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset
>>>
>>> Am 26.01.2018 um 18:38 schrieb Shaoyun Liu:
>>>> Change-Id: I222f4bb2c9a91c7a4764e6aa706e7d7f2e6d948d
>>>> Signed-off-by: Shaoyun Liu <Shaoyun.Liu at amd.com>
>>>> ---
>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 19 +++++++++++++++++++
>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  6 ++++++
>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 +++++
>>>>     3 files changed, 30 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>> index 2d99099..cb1ee26 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>> @@ -239,6 +239,25 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
>>>>     	return r;
>>>>     }
>>>>     
>>>> +void amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev) {
>>>> +	if (adev->kfd)
>>>> +		kgd2kfd->pre_reset(adev->kfd);
>>>> +}
>>>> +
>>>> +void amdgpu_amdkfd_post_reset(struct amdgpu_device *adev) {
>>>> +	if (adev->kfd)
>>>> +		kgd2kfd->post_reset(adev->kfd);
>>>> +}
>>>> +
>>>> +void amdgpu_amdkfd_gpu_reset(struct kgd_dev *kgd) {
>>>> +	struct amdgpu_device *adev = (struct amdgpu_device *)kgd;
>>>> +
>>>> +	amdgpu_device_gpu_recover(adev, NULL, true);
>>> Use false for the force parameter here, apart from that the set looks good to me.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> +}
>>>> +
>>>>     int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum kgd_engine_type engine,
>>>>     				uint32_t vmid, uint64_t gpu_addr,
>>>>     				uint32_t *ib_cmd, uint32_t ib_len) diff --git 
>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>> index 7c36e52..230761f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>> @@ -155,6 +155,12 @@ int amdgpu_amdkfd_copy_mem_to_mem(struct kgd_dev *kgd, struct kgd_mem *src_mem,
>>>>     bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev,
>>>>     			u32 vmid);
>>>>     
>>>> +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev);
>>>> +
>>>> +int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev);
>>>> +
>>>> +void amdgpu_amdkfd_gpu_reset(struct kgd_dev *kgd);
>>>> +
>>>>     /* Shared API */
>>>>     int map_bo(struct amdgpu_device *rdev, uint64_t va, void *vm,
>>>>     		struct amdgpu_bo *bo, struct amdgpu_bo_va **bo_va); diff 
>>>> --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 94f837b..61e7d35 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -2660,6 +2660,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>     	atomic_inc(&adev->gpu_reset_counter);
>>>>     	adev->in_gpu_reset = 1;
>>>>     
>>>> +	/* Block kfd */
>>>> +	amdgpu_amdkfd_pre_reset(adev);
>>>> +
>>>>     	/* block TTM */
>>>>     	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
>>>>     	/* store modesetting */
>>>> @@ -2765,6 +2768,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>     		amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
>>>>     	} else {
>>>>     		dev_info(adev->dev, "GPU reset(%d) 
>>>> successed!\n",atomic_read(&adev->gpu_reset_counter));
>>>> +		/*unlock kfd after a successfully recovery*/
>>>> +		amdgpu_amdkfd_post_reset(adev);
>>>>     	}
>>>>     
>>>>     	amdgpu_vf_error_trans_all(adev);
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx