[PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven

Felix Kuehling felix.kuehling at amd.com
Thu Aug 5 15:10:25 UTC 2021


Am 2021-08-05 um 4:51 a.m. schrieb Zhu, Changfeng:
> [AMD Official Use Only]
>
> Hi Felix,
>
> Can we set noretry=1 for dgpu path(ignore_crat=1) which doesn’t to through iommuv2 path on raven as below:

There are other possible reasons than ignore_crat for Raven to work in
dGPU mode (broken CRAT, disabled IOMMU). However, those are not known
until later in the initialization.

Regards,
  Felix


>> +	case CHIP_RAVEN:
>> +		/*
>> +		 * TODO: Raven currently can fix most SVM issues with
>> +		 * noretry =1. However it has two issues with noretry = 1
>> +		 * on kfd migrate tests. It still needs to root causes
>> +		 * with these two migrate fails on raven with noretry = 1.
>> +		 */
>>  		if (amdgpu_noretry == -1) {
>> 			If(ignore_crat)
>>  				gmc->noretry = 1;
>> 			else
>> 				gmc->noretry = 0;
>> 		}
>>  		else
>>  			gmc->noretry = amdgpu_noretry;
>>  		break;
> BR,
> Changfeng.
>
> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling at amd.com> 
> Sent: Wednesday, July 28, 2021 10:22 PM
> To: Zhu, Changfeng <Changfeng.Zhu at amd.com>; amd-gfx at lists.freedesktop.org; Huang, Ray <Ray.Huang at amd.com>; Zhang, Yifan <Yifan1.Zhang at amd.com>
> Subject: Re: [PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven
>
> Doesn't this break IOMMUv2? Applications that run using IOMMUv2 for system memory access depend on correct retry handling in the SQ.
> Therefore noretry must be 0 on Raven.
>
> I believe the reason that SVM has trouble with retry enabled is, that
> IOMMUv2 is catching the page faults, so the driver never gets to handle the page fault interrupts. That breaks page-fault based migration in the SVM code. I think the better solution is to disable SVM on APUs where
> IOMMUv2 is enabled.
>
> Alternatively, we could give up on IOMMUv2 entirely and always rely on SVM to provide that functionality. But that requires more changes in the amdgpu_vm code.
>
> Regards,
>   Felix
>
>
> Am 2021-07-28 um 2:36 a.m. schrieb Changfeng:
>> From: changzhu <Changfeng.Zhu at amd.com>
>>
>> From: Changfeng <Changfeng.Zhu at amd.com>
>>
>> It can't find any issues with noretry=1 except two SVM migrate issues.
>> Oppositely, it will cause most SVM cases fail with noretry=0.
>> The two SVM migrate issues also happen with noretry=0. So it can set 
>> default noretry=1 for raven firstly to fix most SVM fails.
>>
>> Change-Id: Idb5cb3c1a04104013e4ab8aed2ad4751aaec4bbc
>> Signed-off-by: Changfeng <Changfeng.Zhu at amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 15 ++++++++-------
>>  1 file changed, 8 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> index 09edfb64cce0..d7f69dbd48e6 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> @@ -606,19 +606,20 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
>>  		 * noretry = 0 will cause kfd page fault tests fail
>>  		 * for some ASICs, so set default to 1 for these ASICs.
>>  		 */
>> +	case CHIP_RAVEN:
>> +		/*
>> +		 * TODO: Raven currently can fix most SVM issues with
>> +		 * noretry =1. However it has two issues with noretry = 1
>> +		 * on kfd migrate tests. It still needs to root causes
>> +		 * with these two migrate fails on raven with noretry = 1.
>> +		 */
>>  		if (amdgpu_noretry == -1)
>>  			gmc->noretry = 1;
>>  		else
>>  			gmc->noretry = amdgpu_noretry;
>>  		break;
>> -	case CHIP_RAVEN:
>>  	default:
>> -		/* Raven currently has issues with noretry
>> -		 * regardless of what we decide for other
>> -		 * asics, we should leave raven with
>> -		 * noretry = 0 until we root cause the
>> -		 * issues.
>> -		 *
>> +		/*
>>  		 * default this to 0 for now, but we may want
>>  		 * to change this in the future for certain
>>  		 * GPUs as it can increase performance in


More information about the amd-gfx mailing list