[PATCH] drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov

Fri Nov 5 21:10:50 UTC 2021

On 2021-11-05 1:59 p.m., Liu, Shaoyun wrote:
> [AMD Official Use Only]
>
> Ye, a lot already been changed since then , now the  pre_reset and  post_reset not in the  lock/unlock anymore.  With  my previous change , we make kfd_pre_reset  avoid touch  HW . Now it's pure SW handling , should be safe  to be moved out of the full access .
> Anyway, thanks to bring this up, it will remind us to verify on the  XGMI configuration on SRIOV.
OK. Assuming it doesn't break in your testing, consider the patch

Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>


>
> Regards
> shaoyun.liu
>
> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> Sent: Friday, November 5, 2021 1:48 PM
> To: Liu, Shaoyun <Shaoyun.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
>
> There was a reason why pre_reset was done differently on SRIOV. However, the code has changed a lot since then. Is this concern still valid?
>
>> commit 7b184b006185215daf4e911f8de212964c99a514
>> Author: wentalou <Wentao.Lou at amd.com>
>> Date:   Fri Dec 7 13:53:18 2018 +0800
>>
>>      drm/amdgpu: kfd_pre_reset outside req_full_gpu cause sriov hang
>>
>>      XGMI hive put kfd_pre_reset into amdgpu_device_lock_adev,
>>      but outside req_full_gpu of sriov.
>>      It would make sriov hang during reset.
>>
>>      Signed-off-by: Wentao Lou <Wentao.Lou at amd.com>
>>      Reviewed-by: Shaoyun Liu <Shaoyun.Liu at amd.com>
>>      Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> Regards,
>     Felix
>
>
> On 2021-11-05 12:57 p.m., shaoyunl wrote:
>> The KFD pre_reset should be called before reset been executed, it will
>> hold the lock to prevent other rocm process to sent the packlage to
>> hiq during host execute the real reset on the HW
>>
>> Signed-off-by: shaoyunl <shaoyun.liu at amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +----
>>    1 file changed, 1 insertion(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 95fec36e385e..d7c9dce17cad 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4278,8 +4278,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>>    	if (r)
>>    		return r;
>>    
>> -	amdgpu_amdkfd_pre_reset(adev);
>> -
>>    	/* Resume IP prior to SMC */
>>    	r = amdgpu_device_ip_reinit_early_sriov(adev);
>>    	if (r)
>> @@ -5015,8 +5013,7 @@ int amdgpu_device_gpu_recover(struct
>> amdgpu_device *adev,
>>    
>>    		cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
>>    
>> -		if (!amdgpu_sriov_vf(tmp_adev))
>> -			amdgpu_amdkfd_pre_reset(tmp_adev);
>> +		amdgpu_amdkfd_pre_reset(tmp_adev);
>>    
>>    		/*
>>    		 * Mark these ASICs to be reseted as untracked first