Failed to find memory space for buffer eviction

Christian König ckoenig.leichtzumerken at gmail.com
Mon Jul 20 09:25:15 UTC 2020


Am 16.07.20 um 19:05 schrieb Felix Kuehling:
> Am 2020-07-16 um 2:58 a.m. schrieb Christian König:
>> Am 15.07.20 um 17:14 schrieb Felix Kuehling:
>>> Am 2020-07-15 um 5:28 a.m. schrieb Christian König:
>>> [SNIP]
>>>>>> What could be problematic and result is an overrun is that TTM was
>>>>>> buggy and called put_node twice for the same memory.
>>>>>>
>>>>>> So I've seen that the code needs fixing as well, but I'm not 100%
>>>>>> sure
>>>>>> how you ran into your problem.
>>>>> This is in the KFD eviction test, which deliberately overcommits
>>>>> VRAM in
>>>>> order to trigger lots of evictions. It will use some GTT space
>>>>> while BOs
>>>>> are evicted. But shouldn't it move them further out of GTT and into
>>>>> SYSTEM to free up GTT space?
>>>> Yes, exactly that should happen.
>>>>
>>>> But for some reason it couldn't find a candidate to evict and the
>>>> 14371 pages left are just a bit to small for the buffer.
>>> That would be a nested eviction. A VRAM to GTT eviction requires a GTT
>>> to SYSTEM eviction to make space in GTT. Is that even possible?
>> Yes, this is the core of the TTM design problem which I talked about
>> in my FOSDEM presentation in February.
>>
>> Question do we still have this crude workaround that KFD is not taking
>> all reservations of the current process when allocating new BOs?
> Not sure if you're referring to the workarounds we had to remove
> eviction fences from reservations temporarily. Those are all gone. We're
> making full use of the sync-object fence owner logic to avoid triggering
> eviction fences unintentionally.

I was talking about this check here in amdgpu_ttm_bo_eviction_valuable():
>         /* If bo is a KFD BO, check if the bo belongs to the current 
> process.
>          * If true, then return false as any KFD process needs all its 
> BOs to
>          * be resident to run successfully
>          */
>         flist = dma_resv_get_list(bo->base.resv);
>         if (flist) {
>                 for (i = 0; i < flist->shared_count; ++i) {
>                         f = rcu_dereference_protected(flist->shared[i],
>                                 dma_resv_held(bo->base.resv));
>                         if (amdkfd_fence_check_mm(f, current->mm))
>                                 return false;
>                 }
>         }

What can happen is that the allocating process owns to much of GTT as 
well and as an end result we can't evict anything from GTT to allow for 
VRAM eviction to happen.

> I don't know why we would need to take all reservations when we allocate
> a new BO. I'm probably misunderstanding you.

Taking all reservations when you change the set of BOs allocated in a 
working context is mandatory for correct operation.

I've already noted multiple times that working around like we currently 
do is just a hack and what you see here is one of the symptoms of this.

Regards,
Christian.

>
> Regards,
>    Felix
>
>
>> That could maybe cause this as well.
>>
>> Regards,
>> Christian.
>>



More information about the amd-gfx mailing list