[PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven

Felix Kuehling felix.kuehling at amd.com
Wed Jul 28 14:21:31 UTC 2021


Doesn't this break IOMMUv2? Applications that run using IOMMUv2 for
system memory access depend on correct retry handling in the SQ.
Therefore noretry must be 0 on Raven.

I believe the reason that SVM has trouble with retry enabled is, that
IOMMUv2 is catching the page faults, so the driver never gets to handle
the page fault interrupts. That breaks page-fault based migration in the
SVM code. I think the better solution is to disable SVM on APUs where
IOMMUv2 is enabled.

Alternatively, we could give up on IOMMUv2 entirely and always rely on
SVM to provide that functionality. But that requires more changes in the
amdgpu_vm code.

Regards,
  Felix


Am 2021-07-28 um 2:36 a.m. schrieb Changfeng:
> From: changzhu <Changfeng.Zhu at amd.com>
>
> From: Changfeng <Changfeng.Zhu at amd.com>
>
> It can't find any issues with noretry=1 except two SVM migrate issues.
> Oppositely, it will cause most SVM cases fail with noretry=0.
> The two SVM migrate issues also happen with noretry=0. So it can set
> default noretry=1 for raven firstly to fix most SVM fails.
>
> Change-Id: Idb5cb3c1a04104013e4ab8aed2ad4751aaec4bbc
> Signed-off-by: Changfeng <Changfeng.Zhu at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 09edfb64cce0..d7f69dbd48e6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -606,19 +606,20 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
>  		 * noretry = 0 will cause kfd page fault tests fail
>  		 * for some ASICs, so set default to 1 for these ASICs.
>  		 */
> +	case CHIP_RAVEN:
> +		/*
> +		 * TODO: Raven currently can fix most SVM issues with
> +		 * noretry =1. However it has two issues with noretry = 1
> +		 * on kfd migrate tests. It still needs to root causes
> +		 * with these two migrate fails on raven with noretry = 1.
> +		 */
>  		if (amdgpu_noretry == -1)
>  			gmc->noretry = 1;
>  		else
>  			gmc->noretry = amdgpu_noretry;
>  		break;
> -	case CHIP_RAVEN:
>  	default:
> -		/* Raven currently has issues with noretry
> -		 * regardless of what we decide for other
> -		 * asics, we should leave raven with
> -		 * noretry = 0 until we root cause the
> -		 * issues.
> -		 *
> +		/*
>  		 * default this to 0 for now, but we may want
>  		 * to change this in the future for certain
>  		 * GPUs as it can increase performance in


More information about the amd-gfx mailing list