[PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

Liu, Monk Monk.Liu at amd.com
Thu Dec 19 03:49:26 UTC 2019


Hi Shaoyun

>>> Do we know the root cause why this function would ruin MEC ? From the logic, I think this function should be called before FLR since we need to disable the user queue submission first.
Right now I don't know which detail step lead to KIQ ring test fail, I totally agree with you that this func should be called before VF FLR, but we cannot do it and the reason is described in 
The comment:

> if we do pre_reset() before VF FLR, it would go KIQ way to do register 
> access and stuck there, because KIQ probably won't work by that time 
> (e.g. you already made GFX hang)


>>> I remembered the function should use hiq to communicate with HW , shouldn't use kiq to access HW registerm,  has this been changed ?
Tis function use WREG32/RREG32 to do register access, like all other functions in KMD,  and WREG32/RREG32 will let KIQ to do the register access
If we are under dynamic SRIOV  mode (means we are SRIOV VF and isn't under full exclusive mode)

You see that if you call this func before EVENT_5 (event 5 triggers VF FLR) then it will run under dynamic mode and KIQ will handle the register access, which is not an option 
Since ME/MEC probably already hang ( if we are testing quark on gfx/compute rings)

Do you have a good suggestion ?

thanks
_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of shaoyunl
Sent: Tuesday, December 17, 2019 11:38 PM
To: amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV

I think amdkfd side depends on this call to stop the user queue, without this call, the user queue can submit to HW during the reset which could cause hang again ...
Do we know the root cause why this function would ruin MEC ? From the logic, I think this function should be called before FLR since we need to disable the user queue submission first.
I remembered the function should use hiq to communicate with HW , shouldn't use kiq to access HW registerm,  has this been changed ?


Regards
shaoyun.liu


On 2019-12-17 5:19 a.m., Monk Liu wrote:
> issues:
> MEC is ruined by the amdkfd_pre_reset after VF FLR done
>
> fix:
> amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF 
> FLR, the correct sequence is do amdkfd_pre_reset before VF FLR but 
> there is a limitation to block this sequence:
> if we do pre_reset() before VF FLR, it would go KIQ way to do register 
> access and stuck there, because KIQ probably won't work by that time 
> (e.g. you already made GFX hang)
>
> so the best way right now is to simply remove it.
>
> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
>   1 file changed, 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 605cef6..ae962b9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3672,8 +3672,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>   	if (r)
>   		return r;
>   
> -	amdgpu_amdkfd_pre_reset(adev);
> -
>   	/* Resume IP prior to SMC */
>   	r = amdgpu_device_ip_reinit_early_sriov(adev);
>   	if (r)
_______________________________________________
amd-gfx mailing list
amd-gfx at lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Cmonk.liu%40amd.com%7Cee9c811452634fc2739808d7830718f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637121938885721447&sdata=FiqkgiUX8k5rD%2F%2FiJQU2cF1MGExO8yXEzYOoBtpdfYU%3D&reserved=0


More information about the amd-gfx mailing list