[PATCH] drm/amdgpu: check vm bo eviction valuable at last

Qiang Yu yuq825 at gmail.com
Thu Feb 17 10:58:06 UTC 2022


On Thu, Feb 17, 2022 at 6:39 PM Christian König
<christian.koenig at amd.com> wrote:
>
>
>
> Am 17.02.22 um 11:13 schrieb Qiang Yu:
> > On Thu, Feb 17, 2022 at 5:46 PM Christian König
> > <christian.koenig at amd.com> wrote:
> >> Am 17.02.22 um 10:40 schrieb Qiang Yu:
> >>> On Thu, Feb 17, 2022 at 5:15 PM Christian König
> >>> <christian.koenig at amd.com> wrote:
> >>>> Am 17.02.22 um 10:04 schrieb Qiang Yu:
> >>>>> Workstation application ANSA/META get this error dmesg:
> >>>>> [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-16)
> >>>>>
> >>>>> This is caused by:
> >>>>> 1. create a 256MB buffer in invisible VRAM
> >>>>> 2. CPU map the buffer and access it causes vm_fault and try to move
> >>>>>       it to visible VRAM
> >>>>> 3. force visible VRAM space and traverse all VRAM bos to check if
> >>>>>       evicting this bo is valuable
> >>>>> 4. when checking a VM bo (in invisible VRAM), amdgpu_vm_evictable()
> >>>>>       will set amdgpu_vm->evicting, but latter due to not in visible
> >>>>>       VRAM, won't really evict it so not add it to amdgpu_vm->evicted
> >>>>> 5. before next CS to clear the amdgpu_vm->evicting, user VM ops
> >>>>>       ioctl will pass amdgpu_vm_ready() (check amdgpu_vm->evicted)
> >>>>>       but fail in amdgpu_vm_bo_update_mapping() (check
> >>>>>       amdgpu_vm->evicting) and get this error log
> >>>>>
> >>>>> This error won't affect functionality as next CS will finish the
> >>>>> waiting VM ops. But we'd better make the amdgpu_vm->evicting
> >>>>> correctly reflact the vm status and clear the error log.
> >>>> Well NAK, that is intentional behavior.
> >>>>
> >>>> The VM page tables where considered for eviction, so setting the flag is
> >>>> correct even when the page tables later on are not actually evicted.
> >>>>
> >>> But this will unnecessarily stop latter user VM ops in ioctl before CS
> >>> even when the VM bos are not evicted.
> >>> Won't this have any negative effect when could do better?
> >> No, this will have a positive effect. See the VM was already considered
> >> for eviction because it is idle.
> >>
> >> Updating it immediately doesn't necessarily make sense, we should wait
> >> with that until its next usage.
> >>
> >> Additional to that this patch doesn't really fix the problem, it just
> >> mitigates it.
> >>
> >> Eviction can fail later on for a couple of reasons and we absolutely
> >> need to check the flag instead of the list in amdgpu_vm_ready().
> > The flag only for both flag and list? Looks like should be both as
> > the list indicate some vm page table need to be updated and could
> > delay the user update with the same logic as you described above.
>
> I think checking the flag should be enough. The issue is that the list
> was there initially, but to avoid race conditions we added the flag with
> separate lock protection later on.
>
But list and flag does not align always, there are cases like
list-empty/flag-set (this problem) and list-non-empty/flag-unset (non-vm bo
eviction). If only check flag list-non-empty/flag-unset change behavior.

Regards,
Qiang

> Regards,
> Christian.
>
> >
> > Regards,
> > Qiang
> >
> >> Regards,
> >> Christian.
> >>
> >>> Regards,
> >>> Qiang
> >>>
> >>>> What we should rather do is to fix amdgpu_vm_ready() to take a look at
> >>>> the flag instead of the linked list.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>> Signed-off-by: Qiang Yu <qiang.yu at amd.com>
> >>>>> ---
> >>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 85 ++++++++++++++-----------
> >>>>>     1 file changed, 47 insertions(+), 38 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> >>>>> index 5a32ee66d8c8..88a27911054f 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> >>>>> @@ -1306,45 +1306,11 @@ uint64_t amdgpu_ttm_tt_pte_flags(struct amdgpu_device *adev, struct ttm_tt *ttm,
> >>>>>         return flags;
> >>>>>     }
> >>>>>
> >>>>> -/*
> >>>>> - * amdgpu_ttm_bo_eviction_valuable - Check to see if we can evict a buffer
> >>>>> - * object.
> >>>>> - *
> >>>>> - * Return true if eviction is sensible. Called by ttm_mem_evict_first() on
> >>>>> - * behalf of ttm_bo_mem_force_space() which tries to evict buffer objects until
> >>>>> - * it can find space for a new object and by ttm_bo_force_list_clean() which is
> >>>>> - * used to clean out a memory space.
> >>>>> - */
> >>>>> -static bool amdgpu_ttm_bo_eviction_valuable(struct ttm_buffer_object *bo,
> >>>>> -                                         const struct ttm_place *place)
> >>>>> +static bool amdgpu_ttm_mem_eviction_valuable(struct ttm_buffer_object *bo,
> >>>>> +                                          const struct ttm_place *place)
> >>>>>     {
> >>>>>         unsigned long num_pages = bo->resource->num_pages;
> >>>>>         struct amdgpu_res_cursor cursor;
> >>>>> -     struct dma_resv_list *flist;
> >>>>> -     struct dma_fence *f;
> >>>>> -     int i;
> >>>>> -
> >>>>> -     /* Swapout? */
> >>>>> -     if (bo->resource->mem_type == TTM_PL_SYSTEM)
> >>>>> -             return true;
> >>>>> -
> >>>>> -     if (bo->type == ttm_bo_type_kernel &&
> >>>>> -         !amdgpu_vm_evictable(ttm_to_amdgpu_bo(bo)))
> >>>>> -             return false;
> >>>>> -
> >>>>> -     /* If bo is a KFD BO, check if the bo belongs to the current process.
> >>>>> -      * If true, then return false as any KFD process needs all its BOs to
> >>>>> -      * be resident to run successfully
> >>>>> -      */
> >>>>> -     flist = dma_resv_shared_list(bo->base.resv);
> >>>>> -     if (flist) {
> >>>>> -             for (i = 0; i < flist->shared_count; ++i) {
> >>>>> -                     f = rcu_dereference_protected(flist->shared[i],
> >>>>> -                             dma_resv_held(bo->base.resv));
> >>>>> -                     if (amdkfd_fence_check_mm(f, current->mm))
> >>>>> -                             return false;
> >>>>> -             }
> >>>>> -     }
> >>>>>
> >>>>>         switch (bo->resource->mem_type) {
> >>>>>         case AMDGPU_PL_PREEMPT:
> >>>>> @@ -1377,10 +1343,53 @@ static bool amdgpu_ttm_bo_eviction_valuable(struct ttm_buffer_object *bo,
> >>>>>                 return false;
> >>>>>
> >>>>>         default:
> >>>>> -             break;
> >>>>> +             return ttm_bo_eviction_valuable(bo, place);
> >>>>>         }
> >>>>> +}
> >>>>>
> >>>>> -     return ttm_bo_eviction_valuable(bo, place);
> >>>>> +/*
> >>>>> + * amdgpu_ttm_bo_eviction_valuable - Check to see if we can evict a buffer
> >>>>> + * object.
> >>>>> + *
> >>>>> + * Return true if eviction is sensible. Called by ttm_mem_evict_first() on
> >>>>> + * behalf of ttm_bo_mem_force_space() which tries to evict buffer objects until
> >>>>> + * it can find space for a new object and by ttm_bo_force_list_clean() which is
> >>>>> + * used to clean out a memory space.
> >>>>> + */
> >>>>> +static bool amdgpu_ttm_bo_eviction_valuable(struct ttm_buffer_object *bo,
> >>>>> +                                         const struct ttm_place *place)
> >>>>> +{
> >>>>> +     struct dma_resv_list *flist;
> >>>>> +     struct dma_fence *f;
> >>>>> +     int i;
> >>>>> +
> >>>>> +     /* Swapout? */
> >>>>> +     if (bo->resource->mem_type == TTM_PL_SYSTEM)
> >>>>> +             return true;
> >>>>> +
> >>>>> +     /* If bo is a KFD BO, check if the bo belongs to the current process.
> >>>>> +      * If true, then return false as any KFD process needs all its BOs to
> >>>>> +      * be resident to run successfully
> >>>>> +      */
> >>>>> +     flist = dma_resv_shared_list(bo->base.resv);
> >>>>> +     if (flist) {
> >>>>> +             for (i = 0; i < flist->shared_count; ++i) {
> >>>>> +                     f = rcu_dereference_protected(flist->shared[i],
> >>>>> +                             dma_resv_held(bo->base.resv));
> >>>>> +                     if (amdkfd_fence_check_mm(f, current->mm))
> >>>>> +                             return false;
> >>>>> +             }
> >>>>> +     }
> >>>>> +
> >>>>> +     /* Check by different mem type. */
> >>>>> +     if (!amdgpu_ttm_mem_eviction_valuable(bo, place))
> >>>>> +             return false;
> >>>>> +
> >>>>> +     /* VM bo should be checked at last because it will mark VM evicting. */
> >>>>> +     if (bo->type == ttm_bo_type_kernel)
> >>>>> +             return amdgpu_vm_evictable(ttm_to_amdgpu_bo(bo));
> >>>>> +
> >>>>> +     return true;
> >>>>>     }
> >>>>>
> >>>>>     static void amdgpu_ttm_vram_mm_access(struct amdgpu_device *adev, loff_t pos,
>


More information about the dri-devel mailing list