[PATCH] drm/amdkfd: enable heavy-weight TLB flush on Arcturus

Wed Jan 19 15:00:41 UTC 2022


On 2022-01-19 09:50, Russell, Kent wrote:
> [AMD Official Use Only]
>
>> -----Original Message-----
>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
>> Sent: Tuesday, January 18, 2022 7:16 PM
>> To: Russell, Kent <Kent.Russell at amd.com>; Huang, JinHuiEric
>> <JinHuiEric.Huang at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdkfd: enable heavy-weight TLB flush on Arcturus
>>
>> Am 2022-01-18 um 7:08 p.m. schrieb Russell, Kent:
>>> One question inline
>>>
>>>
>>> *KENT RUSSELL***
>>>
>>> Sr. Software Engineer | Linux Compute Kernel
>>>
>>> 1 Commerce Valley Drive East
>>>
>>> Markham, ON L3T 7X6
>>>
>>> *O*+(1) 289-695-2122**| Ext 72122
>>>
>>>
>>> ------------------------------------------------------------------------
>>> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of
>>> Felix Kuehling <felix.kuehling at amd.com>
>>> *Sent:* Tuesday, January 18, 2022 6:36 PM
>>> *To:* Huang, JinHuiEric <JinHuiEric.Huang at amd.com>;
>>> amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
>>> *Subject:* Re: [PATCH] drm/amdkfd: enable heavy-weight TLB flush on
>>> Arcturus
>>>
>>> Am 2022-01-18 um 5:45 p.m. schrieb Eric Huang:
>>>> SDMA FW fixes the hang issue for adding heavy-weight TLB
>>>> flush on Arcturus, so we can enable it.
>>>>
>>>> Signed-off-by: Eric Huang <jinhuieric.huang at amd.com>
>>> Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
>>>
>>>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c |  6 ------
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c         | 10 ++++++++--
>>>>   2 files changed, 8 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> index a64cbbd943ba..acb4fd973e60 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> @@ -1892,12 +1892,6 @@ int amdgpu_amdkfd_gpuvm_map_memory_to_gpu(
>>>>                                 true);
>>>>         ret = unreserve_bo_and_vms(&ctx, false, false);
>>>>
>>>> -     /* Only apply no TLB flush on Aldebaran to
>>>> -      * workaround regressions on other Asics.
>>>> -      */
>>>> -     if (table_freed && (adev->asic_type != CHIP_ALDEBARAN))
>>>> -             *table_freed = true;
>>>> -
>>>>         goto out;
>>>>
>>>>   out_unreserve:
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>>> index b570c0454ce9..485d4c52c7de 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>>> @@ -1596,6 +1596,12 @@ static int
>>> kfd_ioctl_free_memory_of_gpu(struct file *filep,
>>>>         return ret;
>>>>   }
>>>>
>>>> +static bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev) {
>>>> +     return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2)
>>> Do we need to add a check for sdma ver >=8 here?
>> What's the significance of version 8 for Aldebaran? This code was
>> working on Aldebaran without a version check before. Did we ever
>> publicly release an SDMA firmware older than version 8 that for Aldebaran?
> We released v5 for Aldebaran SDMA in ROCm 4.5 . If I remember the ticket correctly, the same fix for Arcturus was required for Aldebaran and was part of SDMA v8. But Eric is obviously watching the ticket more closely than I, so I'll defer to him there.
Yes. Aldebaran has the same bug as Arcturus in SDMA, but the bug doesn't 
cause GPU hang on Aldebaran. As Felix said heavy-weight TLB flush have 
been working on Aldebaran since it was enabled, so we don't need to 
check the version for it.

Regards,
Eric
>
>   Kent
>
>> Regards,
>>    Felix
>>
>>
>>> ||
>>>> +            (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) &&
>>>> +             dev->adev->sdma.instance[0].fw_version >= 18);
>>>> +}
>>>> +
>>>>   static int kfd_ioctl_map_memory_to_gpu(struct file *filep,
>>>>                                         struct kfd_process *p, void
>>> *data)
>>>>   {
>>>> @@ -1692,7 +1698,7 @@ static int kfd_ioctl_map_memory_to_gpu(struct
>>> file *filep,
>>>>         }
>>>>
>>>>         /* Flush TLBs after waiting for the page table updates to
>>> complete */
>>>> -     if (table_freed) {
>>>> +     if (table_freed || !kfd_flush_tlb_after_unmap(dev)) {
>>>>                 for (i = 0; i < args->n_devices; i++) {
>>>>                         peer = kfd_device_by_id(devices_arr[i]);
>>>>                         if (WARN_ON_ONCE(!peer))
>>>> @@ -1806,7 +1812,7 @@ static int
>>> kfd_ioctl_unmap_memory_from_gpu(struct file *filep,
>>>>         }
>>>>         mutex_unlock(&p->mutex);
>>>>
>>>> -     if (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2)) {
>>>> +     if (kfd_flush_tlb_after_unmap(dev)) {
>>>>                 err = amdgpu_amdkfd_gpuvm_sync_memory(dev->adev,
>>>>                                 (struct kgd_mem *) mem, true);
>>>>                 if (err) {