[PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

Felix Kuehling felix.kuehling at amd.com
Tue Dec 17 21:14:19 UTC 2019


I agree. Removing the call to pre-reset probably breaks GPU reset for KFD.

We call the KFD suspend function in pre-reset, which uses the HIQ to 
stop any user mode queues still running. If that is not possible because 
the HIQ is hanging, it should fail with a timeout. There may be 
something we can do if we know that the HIQ is hanging, so we only 
update the KFD-internal queue state without actually sending anything to 
the HIQ.

Regards,
   Felix

On 2019-12-17 10:37, shaoyunl wrote:
> I think amdkfd side depends on this call to stop the user queue, 
> without this call, the user queue can submit to HW during the reset 
> which could cause hang again ...
> Do we know the root cause why this function would ruin MEC ? From the 
> logic, I think this function should be called before FLR since we need 
> to disable the user queue submission first.
> I remembered the function should use hiq to communicate with HW , 
> shouldn't use kiq to access HW registerm,  has this been changed ?
>
>
> Regards
> shaoyun.liu
>
>
> On 2019-12-17 5:19 a.m., Monk Liu wrote:
>> issues:
>> MEC is ruined by the amdkfd_pre_reset after VF FLR done
>>
>> fix:
>> amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF FLR,
>> the correct sequence is do amdkfd_pre_reset before VF FLR but there is
>> a limitation to block this sequence:
>> if we do pre_reset() before VF FLR, it would go KIQ way to do register
>> access and stuck there, because KIQ probably won't work by that time
>> (e.g. you already made GFX hang)
>>
>> so the best way right now is to simply remove it.
>>
>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>>   1 file changed, 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 605cef6..ae962b9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct 
>> amdgpu_device *adev,
>>       if (r)
>>           return r;
>>   -    amdgpu_amdkfd_pre_reset(adev);
>> -
>>       /* Resume IP prior to SMC */
>>       r = amdgpu_device_ip_reinit_early_sriov(adev);
>>       if (r)
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Cfelix.kuehling%40amd.com%7Cbd097404ba8b4e7f9d9308d7830717fe%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938908876710&sdata=bNGTZtFLiQ46UwjCa5u8hXG1KUtK%2Fs98g7rBmBtTaPs%3D&reserved=0 
>


More information about the amd-gfx mailing list