[PATCH v3] drm/amd/amdgpu: set the default value of noretry to 1 for some dGPUs

Felix Kuehling felix.kuehling at amd.com
Mon Nov 30 19:31:30 UTC 2020


Another related thought: I think the reason some chips had failing VM
fault tests with noretry=0 was due to a dependency on IH rerouting of
retry faults. This dependency has been fixed by Christian recently:

commit 849c62248ee84c1e304a9ce2f673c79e23f29bf9
Author: Christian K?nig <christian.koenig at amd.com>
Date:   Sat Oct 31 18:39:54 2020 +0100

    drm/amdgpu: enabled software IH ring for Vega

    Seems like we won't get the hardware IH1/2 rings on Vega20 working.

    Signed-off-by: Christian K?nig <christian.koenig at amd.com>
    Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>

 drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 7 +++++++
 1 file changed, 7 insertions(+)

commit 198237744d85c4a23914de56d78fba0acf5a2803
Author: Christian K?nig <christian.koenig at amd.com>
Date:   Tue Nov 3 14:22:50 2020 +0100

    drm/amdgpu: enabled software IH ring for Navi

    Felix pointed out that we need this for Navi as well.

    Signed-off-by: Christian K?nig <christian.koenig at amd.com>
    Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>

 drivers/gpu/drm/amd/amdgpu/navi10_ih.c | 7 +++++++
 1 file changed, 7 insertions(+)

So it should now be safe to enable retry faults on most chips. Only on
GFXv9 there can be a performance advantage to disabling retry.

Regards,
  Felix

Am 2020-11-30 um 12:35 p.m. schrieb Felix Kuehling:
> Like I stated elsewhere, I would recommend noretry=0 for Navi and later
> GPUs because there is no performance advantage from disabling retry on
> those GPUs.
>
>
> Regards,
>   Felix
>
>
> Am 2020-11-30 um 12:22 p.m. schrieb Deucher, Alexander:
>> [AMD Public Use]
>>
>>
>> We need to figure out what the root cause is then.  If we can't figure
>> it out soon, we should revert the change for navi1x and continue to
>> debug it until we can find the root cause and we can safely re-enable it.
>>
>> Alex
>> ------------------------------------------------------------------------
>> *From:* Chen, Guchun <Guchun.Chen at amd.com>
>> *Sent:* Sunday, November 29, 2020 2:22 AM
>> *To:* Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl>; Kuehling, Felix
>> <Felix.Kuehling at amd.com>
>> *Cc:* Gui, Jack <Jack.Gui at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>;
>> amd-gfx mailing list <amd-gfx at lists.freedesktop.org>; Huang, Ray
>> <Ray.Huang at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>;
>> Zhang, Hawking <Hawking.Zhang at amd.com>
>> *Subject:* RE: [PATCH v3] drm/amd/amdgpu: set the default value of
>> noretry to 1 for some dGPUs
>>  
>> [AMD Public Use]
>>
>> Hi Bas Nieuwenhuizen,
>>
>> I don't think direct revert is one right approach, though it's able to
>> fix your problem.  noretry=0 will cause other test failure on several
>> ASICs.
>>
>> Regards,
>> Guchun
>>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Bas
>> Nieuwenhuizen
>> Sent: Sunday, November 29, 2020 8:38 AM
>> To: Kuehling, Felix <Felix.Kuehling at amd.com>
>> Cc: Gui, Jack <Jack.Gui at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>;
>> Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx mailing list
>> <amd-gfx at lists.freedesktop.org>; Huang, Ray <Ray.Huang at amd.com>;
>> Deucher, Alexander <Alexander.Deucher at amd.com>; Zhang, Hawking
>> <Hawking.Zhang at amd.com>
>> Subject: Re: [PATCH v3] drm/amd/amdgpu: set the default value of
>> noretry to 1 for some dGPUs
>>
>> Can we revert this patch to fix
>> https://gitlab.freedesktop.org/drm/amd/-/issues/1374 ?
>>
>> On Thu, Oct 15, 2020 at 4:30 PM Felix Kuehling
>> <felix.kuehling at amd.com> wrote:
>>> Am 2020-10-14 um 11:35 p.m. schrieb Chengming Gui:
>>>> noretry = 0 cause some dGPU's kfd page fault tests fail, so set
>>>> noretry to 1 for these special ASICs:
>>>> vega20/navi10/navi14/ARCTURUS
>>>>
>>>> v2: merge raven and default case due to the same setting
>>>> v3: remove ARCTURUS
>>>>
>>>> Signed-off-by: Chengming Gui <Jack.Gui at amd.com>
>>>> Change-Id: I3be70f463a49b0cd5c56456431d6c2cb98b13872
>>> Acked-by: Felix Kuhling <Felix.Kuehling at amd.com>
>>>
>>>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 23
>>>> +++++++++++++++--------
>>>>   1 file changed, 15 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>> index 36604d751d62..f26eb4e54b12 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>> @@ -425,20 +425,27 @@ void amdgpu_gmc_noretry_set(struct
>> amdgpu_device *adev)
>>>>        struct amdgpu_gmc *gmc = &adev->gmc;
>>>>
>>>>        switch (adev->asic_type) {
>>>> -     case CHIP_RAVEN:
>>>> -             /* Raven currently has issues with noretry
>>>> -              * regardless of what we decide for other
>>>> -              * asics, we should leave raven with
>>>> -              * noretry = 0 until we root cause the
>>>> -              * issues.
>>>> +     case CHIP_VEGA20:
>>>> +     case CHIP_NAVI10:
>>>> +     case CHIP_NAVI14:
>>>> +             /*
>>>> +              * noretry = 0 will cause kfd page fault tests fail
>>>> +              * for some ASICs, so set default to 1 for these ASICs.
>>>>                 */
>>>>                if (amdgpu_noretry == -1)
>>>> -                     gmc->noretry = 0;
>>>> +                     gmc->noretry = 1;
>>>>                else
>>>>                        gmc->noretry = amdgpu_noretry;
>>>>                break;
>>>> +     case CHIP_RAVEN:
>>>>        default:
>>>> -             /* default this to 0 for now, but we may want
>>>> +             /* Raven currently has issues with noretry
>>>> +              * regardless of what we decide for other
>>>> +              * asics, we should leave raven with
>>>> +              * noretry = 0 until we root cause the
>>>> +              * issues.
>>>> +              *
>>>> +              * default this to 0 for now, but we may want
>>>>                 * to change this in the future for certain
>>>>                 * GPUs as it can increase performance in
>>>>                 * certain cases.
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://list/ <https://list>
>>> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7Cgu
>>> chun.chen%40amd.com%7C6d626e2a3bae4877024f08d893ff15db%7C3dd8961fe4884
>>> e608e11a82d994e183d%7C0%7C0%7C637422071085800476%7CUnknown%7CTWFpbGZsb
>>> 3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%
>>> 7C1000&sdata=VFqegGwPCj10q3Y5BdZsVq2a%2B4Tb358mYVDaNkA9zLU%3D&
>>> reserved=0
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


More information about the amd-gfx mailing list