[PATCH 1/2] [WIP]drm/ttm: add waiter list to prevent allocation not in order
Christian König
christian.koenig at amd.com
Wed Jan 31 10:54:07 UTC 2018
> So I think preventing validation on same place is a simpler way:
> process B bo's place is fpfn~lpfn, it will only try to evict LRU BOs
> in that range, while eviction, we just prevent those validation to
> this range(fpfn~lpfn), if out of this range, the allocation/validation
> still can be go on.
>
> Any negative?
That won't work either. The most common use of fpfn~lpfn range is to
limit a BO to visible VRAM, the other use cases are to fullfil hardware
limitations.
So blocking this would result in blocking all normal GTT and VRAM
allocations, adding a mutex to validate would have the same effect.
Regards,
Christian.
Am 31.01.2018 um 11:30 schrieb Chunming Zhou:
>
>
>
> On 2018年01月26日 22:35, Christian König wrote:
>> I just realized that a change I'm thinking about for a while would
>> solve your problem as well, but keep concurrent allocation possible.
>>
>> See ttm_mem_evict_first() unlocks the BO after evicting it:
>>> ttm_bo_del_from_lru(bo);
>>> spin_unlock(&glob->lru_lock);
>>>
>>> ret = ttm_bo_evict(bo, ctx);
>>> if (locked) {
>>> ttm_bo_unreserve(bo); <-------- here
>>> } else {
>>> spin_lock(&glob->lru_lock);
>>> ttm_bo_add_to_lru(bo);
>>> spin_unlock(&glob->lru_lock);
>>> }
>>>
>>> kref_put(&bo->list_kref, ttm_bo_release_list);
>>
>> The effect is that in your example process C can not only beat
>> process B once, but many many times because we run into a ping/pong
>> situation where B evicts resources while C moves them back in.
> For ping/pong case, I want to disable busy placement for allocation
> period, only enable it for cs bo validation.
>
>>
>> For a while now I'm thinking about dropping those reservations only
>> after the original allocation succeeded.
>>
>> The effect would be that process C can still beat process B
>> initially, but sooner or process B would evict some resources from
>> process C as well and then it can succeed with its allocation.
> If it is from process C cs validation, process B still need evict the
> resource only after process C command submission completion.
>
>>
>> The problem is for this approach to work we need to core change to
>> the ww_mutexes to be able to handle this efficiently.
> Yes, ww_mutex doesn't support this net lock, which easily deadlock
> without ticket and class.
>
> So I think preventing validation on same place is a simpler way:
> process B bo's place is fpfn~lpfn, it will only try to evict LRU BOs
> in that range, while eviction, we just prevent those validation to
> this range(fpfn~lpfn), if out of this range, the allocation/validation
> still can be go on.
>
> Any negative?
>
> Regards,
> David Zhou
>>
>> Regards,
>> Christian.
>>
>> Am 26.01.2018 um 14:59 schrieb Christian König:
>>> I know, but this has the same effect. You prevent concurrent
>>> allocation from happening.
>>>
>>> What we could do is to pipeline reusing of deleted memory as well,
>>> this makes it less likely to cause the problem you are seeing
>>> because the evicting processes doesn't need to block for deleted BOs
>>> any more.
>>>
>>> But that other processes can grab memory during eviction is
>>> intentional. Otherwise greedy processes would completely dominate
>>> command submission.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 26.01.2018 um 14:50 schrieb Zhou, David(ChunMing):
>>>> I don't want to prevent all, my new approach is to prevent the
>>>> later allocation is trying and ahead of front to get the memory
>>>> space that the front made from eviction.
>>>>
>>>>
>>>> 发自坚果 Pro
>>>>
>>>> Christian K鰊ig <ckoenig.leichtzumerken at gmail.com> 于 2018年1月26日
>>>> 下午9:24写道:
>>>>
>>>> Yes, exactly that's the problem.
>>>>
>>>> See when you want to prevent a process B from allocating the memory
>>>> process A has evicted, you need to prevent all concurrent allocation.
>>>>
>>>> And we don't do that because it causes a major performance drop.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 26.01.2018 um 14:21 schrieb Zhou, David(ChunMing):
>>>>> You patch will prevent concurrent allocation, and will result in
>>>>> allocation performance drop much.
>>>>>
>>>>> 发自坚果 Pro
>>>>>
>>>>> Christian K鰊ig <ckoenig.leichtzumerken at gmail.com> 于 2018年1月26日
>>>>> 下午9:04写道:
>>>>>
>>>>> Attached is what you actually want to do cleanly implemented. But
>>>>> as I said this is a NO-GO.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 26.01.2018 um 13:43 schrieb Christian König:
>>>>>>> After my investigation, this issue should be detect of TTM
>>>>>>> design self, which breaks scheduling balance.
>>>>>> Yeah, but again. This is indented design we can't change easily.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 26.01.2018 um 13:36 schrieb Zhou, David(ChunMing):
>>>>>>> I am off work, so reply mail by phone, the format could not be text.
>>>>>>>
>>>>>>> back to topic itself:
>>>>>>> the problem indeed happen on amdgpu driver, someone reports me
>>>>>>> that application runs with two instances, the performance are
>>>>>>> different.
>>>>>>> I also reproduced the issue with unit test(bo_eviction_test).
>>>>>>> They always think our scheduler isn't working as expected.
>>>>>>>
>>>>>>> After my investigation, this issue should be detect of TTM
>>>>>>> design self, which breaks scheduling balance.
>>>>>>>
>>>>>>> Further, if we run containers for our gpu, container A could run
>>>>>>> high score, container B runs low score with same benchmark.
>>>>>>>
>>>>>>> So this is bug that we need fix.
>>>>>>>
>>>>>>> Regards,
>>>>>>> David Zhou
>>>>>>>
>>>>>>> 发自坚果 Pro
>>>>>>>
>>>>>>> Christian K鰊ig <ckoenig.leichtzumerken at gmail.com> 于 2018年1月26日
>>>>>>> 下午6:31写道:
>>>>>>>
>>>>>>> Am 26.01.2018 um 11:22 schrieb Chunming Zhou:
>>>>>>> > there is a scheduling balance issue about get node like:
>>>>>>> > a. process A allocates full memory and use it for submission.
>>>>>>> > b. process B tries to allocates memory, will wait for process
>>>>>>> A BO idle in eviction.
>>>>>>> > c. process A completes the job, process B eviction will put
>>>>>>> process A BO node,
>>>>>>> > but in the meantime, process C is comming to allocate BO,
>>>>>>> whill directly get node successfully, and do submission,
>>>>>>> > process B will again wait for process C BO idle.
>>>>>>> > d. repeat the above setps, process B could be delayed much more.
>>>>>>> >
>>>>>>> > later allocation must not be ahead of front in same place.
>>>>>>>
>>>>>>> Again NAK to the whole approach.
>>>>>>>
>>>>>>> At least with amdgpu the problem you described above never occurs
>>>>>>> because evictions are pipelined operations. We could only block for
>>>>>>> deleted regions to become free.
>>>>>>>
>>>>>>> But independent of that incoming memory requests while we make
>>>>>>> room for
>>>>>>> eviction are intended to be served first.
>>>>>>>
>>>>>>> Changing that is certainly a no-go cause that would favor memory
>>>>>>> hungry
>>>>>>> applications over small clients.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> >
>>>>>>> > Change-Id: I3daa892e50f82226c552cc008a29e55894a98f18
>>>>>>> > Signed-off-by: Chunming Zhou <david1.zhou at amd.com>
>>>>>>> > ---
>>>>>>> > drivers/gpu/drm/ttm/ttm_bo.c | 69
>>>>>>> +++++++++++++++++++++++++++++++++++++++--
>>>>>>> > include/drm/ttm/ttm_bo_api.h | 7 +++++
>>>>>>> > include/drm/ttm/ttm_bo_driver.h | 7 +++++
>>>>>>> > 3 files changed, 80 insertions(+), 3 deletions(-)
>>>>>>> >
>>>>>>> > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c
>>>>>>> b/drivers/gpu/drm/ttm/ttm_bo.c
>>>>>>> > index d33a6bb742a1..558ec2cf465d 100644
>>>>>>> > --- a/drivers/gpu/drm/ttm/ttm_bo.c
>>>>>>> > +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>>>>>>> > @@ -841,6 +841,58 @@ static int ttm_bo_add_move_fence(struct
>>>>>>> ttm_buffer_object *bo,
>>>>>>> > return 0;
>>>>>>> > }
>>>>>>> >
>>>>>>> > +static void ttm_man_init_waiter(struct ttm_bo_waiter *waiter,
>>>>>>> > + struct ttm_buffer_object *bo,
>>>>>>> > + const struct ttm_place *place)
>>>>>>> > +{
>>>>>>> > + waiter->tbo = bo;
>>>>>>> > + memcpy((void *)&waiter->place, (void *)place,
>>>>>>> sizeof(*place));
>>>>>>> > + INIT_LIST_HEAD(&waiter->list);
>>>>>>> > +}
>>>>>>> > +
>>>>>>> > +static void ttm_man_add_waiter(struct ttm_mem_type_manager *man,
>>>>>>> > + struct ttm_bo_waiter *waiter)
>>>>>>> > +{
>>>>>>> > + if (!waiter)
>>>>>>> > + return;
>>>>>>> > + spin_lock(&man->wait_lock);
>>>>>>> > + list_add_tail(&waiter->list, &man->waiter_list);
>>>>>>> > + spin_unlock(&man->wait_lock);
>>>>>>> > +}
>>>>>>> > +
>>>>>>> > +static void ttm_man_del_waiter(struct ttm_mem_type_manager *man,
>>>>>>> > + struct ttm_bo_waiter *waiter)
>>>>>>> > +{
>>>>>>> > + if (!waiter)
>>>>>>> > + return;
>>>>>>> > + spin_lock(&man->wait_lock);
>>>>>>> > + if (!list_empty(&waiter->list))
>>>>>>> > + list_del(&waiter->list);
>>>>>>> > + spin_unlock(&man->wait_lock);
>>>>>>> > + kfree(waiter);
>>>>>>> > +}
>>>>>>> > +
>>>>>>> > +int ttm_man_check_bo(struct ttm_mem_type_manager *man,
>>>>>>> > + struct ttm_buffer_object *bo,
>>>>>>> > + const struct ttm_place *place)
>>>>>>> > +{
>>>>>>> > + struct ttm_bo_waiter *waiter, *tmp;
>>>>>>> > +
>>>>>>> > + spin_lock(&man->wait_lock);
>>>>>>> > + list_for_each_entry_safe(waiter, tmp, &man->waiter_list,
>>>>>>> list) {
>>>>>>> > + if ((bo != waiter->tbo) &&
>>>>>>> > + ((place->fpfn >= waiter->place.fpfn &&
>>>>>>> > + place->fpfn <= waiter->place.lpfn) ||
>>>>>>> > + (place->lpfn <= waiter->place.lpfn &&
>>>>>>> place->lpfn >=
>>>>>>> > + waiter->place.fpfn)))
>>>>>>> > + goto later_bo;
>>>>>>> > + }
>>>>>>> > + spin_unlock(&man->wait_lock);
>>>>>>> > + return true;
>>>>>>> > +later_bo:
>>>>>>> > + spin_unlock(&man->wait_lock);
>>>>>>> > + return false;
>>>>>>> > +}
>>>>>>> > /**
>>>>>>> > * Repeatedly evict memory from the LRU for @mem_type until
>>>>>>> we create enough
>>>>>>> > * space, or we've evicted everything and there isn't enough
>>>>>>> space.
>>>>>>> > @@ -853,17 +905,26 @@ static int ttm_bo_mem_force_space(struct
>>>>>>> ttm_buffer_object *bo,
>>>>>>> > {
>>>>>>> > struct ttm_bo_device *bdev = bo->bdev;
>>>>>>> > struct ttm_mem_type_manager *man = &bdev->man[mem_type];
>>>>>>> > + struct ttm_bo_waiter waiter;
>>>>>>> > int ret;
>>>>>>> >
>>>>>>> > + ttm_man_init_waiter(&waiter, bo, place);
>>>>>>> > + ttm_man_add_waiter(man, &waiter);
>>>>>>> > do {
>>>>>>> > ret = (*man->func->get_node)(man, bo, place, mem);
>>>>>>> > - if (unlikely(ret != 0))
>>>>>>> > + if (unlikely(ret != 0)) {
>>>>>>> > + ttm_man_del_waiter(man, &waiter);
>>>>>>> > return ret;
>>>>>>> > - if (mem->mm_node)
>>>>>>> > + }
>>>>>>> > + if (mem->mm_node) {
>>>>>>> > + ttm_man_del_waiter(man, &waiter);
>>>>>>> > break;
>>>>>>> > + }
>>>>>>> > ret = ttm_mem_evict_first(bdev, mem_type,
>>>>>>> place, ctx);
>>>>>>> > - if (unlikely(ret != 0))
>>>>>>> > + if (unlikely(ret != 0)) {
>>>>>>> > + ttm_man_del_waiter(man, &waiter);
>>>>>>> > return ret;
>>>>>>> > + }
>>>>>>> > } while (1);
>>>>>>> > mem->mem_type = mem_type;
>>>>>>> > return ttm_bo_add_move_fence(bo, man, mem);
>>>>>>> > @@ -1450,6 +1511,8 @@ int ttm_bo_init_mm(struct ttm_bo_device
>>>>>>> *bdev, unsigned type,
>>>>>>> > man->use_io_reserve_lru = false;
>>>>>>> > mutex_init(&man->io_reserve_mutex);
>>>>>>> > spin_lock_init(&man->move_lock);
>>>>>>> > + spin_lock_init(&man->wait_lock);
>>>>>>> > + INIT_LIST_HEAD(&man->waiter_list);
>>>>>>> > INIT_LIST_HEAD(&man->io_reserve_lru);
>>>>>>> >
>>>>>>> > ret = bdev->driver->init_mem_type(bdev, type, man);
>>>>>>> > diff --git a/include/drm/ttm/ttm_bo_api.h
>>>>>>> b/include/drm/ttm/ttm_bo_api.h
>>>>>>> > index 2cd025c2abe7..0fce4dbd02e7 100644
>>>>>>> > --- a/include/drm/ttm/ttm_bo_api.h
>>>>>>> > +++ b/include/drm/ttm/ttm_bo_api.h
>>>>>>> > @@ -40,6 +40,7 @@
>>>>>>> > #include <linux/mm.h>
>>>>>>> > #include <linux/bitmap.h>
>>>>>>> > #include <linux/reservation.h>
>>>>>>> > +#include <drm/ttm/ttm_placement.h>
>>>>>>> >
>>>>>>> > struct ttm_bo_device;
>>>>>>> >
>>>>>>> > @@ -232,6 +233,12 @@ struct ttm_buffer_object {
>>>>>>> > struct mutex wu_mutex;
>>>>>>> > };
>>>>>>> >
>>>>>>> > +struct ttm_bo_waiter {
>>>>>>> > + struct ttm_buffer_object *tbo;
>>>>>>> > + struct ttm_place place;
>>>>>>> > + struct list_head list;
>>>>>>> > +};
>>>>>>> > +
>>>>>>> > /**
>>>>>>> > * struct ttm_bo_kmap_obj
>>>>>>> > *
>>>>>>> > diff --git a/include/drm/ttm/ttm_bo_driver.h
>>>>>>> b/include/drm/ttm/ttm_bo_driver.h
>>>>>>> > index 9b417eb2df20..dc6b8b4c9e06 100644
>>>>>>> > --- a/include/drm/ttm/ttm_bo_driver.h
>>>>>>> > +++ b/include/drm/ttm/ttm_bo_driver.h
>>>>>>> > @@ -293,6 +293,10 @@ struct ttm_mem_type_manager {
>>>>>>> > bool io_reserve_fastpath;
>>>>>>> > spinlock_t move_lock;
>>>>>>> >
>>>>>>> > + /* waiters in list */
>>>>>>> > + spinlock_t wait_lock;
>>>>>>> > + struct list_head waiter_list;
>>>>>>> > +
>>>>>>> > /*
>>>>>>> > * Protected by @io_reserve_mutex:
>>>>>>> > */
>>>>>>> > @@ -748,6 +752,9 @@ int ttm_bo_mem_space(struct
>>>>>>> ttm_buffer_object *bo,
>>>>>>> > struct ttm_mem_reg *mem,
>>>>>>> > struct ttm_operation_ctx *ctx);
>>>>>>> >
>>>>>>> > +int ttm_man_check_bo(struct ttm_mem_type_manager *man,
>>>>>>> > + struct ttm_buffer_object *bo,
>>>>>>> > + const struct ttm_place *place);
>>>>>>> > void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct
>>>>>>> ttm_mem_reg *mem);
>>>>>>> > void ttm_bo_mem_put_locked(struct ttm_buffer_object *bo,
>>>>>>> > struct ttm_mem_reg *mem);
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> dri-devel mailing list
>>>>> dri-devel at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180131/14aaf9f3/attachment-0001.html>
More information about the amd-gfx
mailing list