[PATCH 1/2] drm/amdgpu: return bo itself if userptr is cpu addr of bo (v3)

Thu Aug 2 06:20:16 UTC 2018

On 08/02/2018 08:00 AM, Marek Olšák wrote:
> On Wed, Aug 1, 2018 at 2:29 PM, Christian König
> <christian.koenig at amd.com> wrote:
>> Am 01.08.2018 um 19:59 schrieb Marek Olšák:
>>>
>>> On Wed, Aug 1, 2018 at 1:52 PM, Christian König
>>> <christian.koenig at amd.com> wrote:
>>>>
>>>> Am 01.08.2018 um 19:39 schrieb Marek Olšák:
>>>>>
>>>>> On Wed, Aug 1, 2018 at 2:32 AM, Christian König
>>>>> <christian.koenig at amd.com> wrote:
>>>>>>
>>>>>> Am 01.08.2018 um 00:07 schrieb Marek Olšák:
>>>>>>>
>>>>>>> Can this be implemented as a wrapper on top of libdrm? So that the
>>>>>>> tree (or hash table) isn't created for UMDs that don't need it.
>>>>>>
>>>>>>
>>>>>> No, the problem is that an application gets a CPU pointer from one API
>>>>>> and
>>>>>> tries to import that pointer into another one.
>>>>>>
>>>>>> In other words we need to implement this independent of the UMD who
>>>>>> mapped
>>>>>> the BO.
>>>>>
>>>>> Yeah, it could be an optional feature of libdrm, and other components
>>>>> should be able to disable it to remove the overhead.
>>>>
>>>>
>>>> The overhead is negligible, the real problem is the memory footprint.
>>>>
>>>> A brief look at the hash implementation in libdrm showed that this is
>>>> actually really inefficient.
>>>>
>>>> I think we have the choice of implementing a r/b tree to map the CPU
>>>> pointer
>>>> addresses or implement a quadratic tree to map the handles.
>>>>
>>>> The later is easy to do and would also allow to get rid of the hash table
>>>> as
>>>> well.
>>>
>>> We can also use the hash table from mesa/src/util.
>>>
>>> I don't think the overhead would be negligible. It would be a log(n)
>>> insertion in bo_map and a log(n) deletion in bo_unmap. If you did
>>> bo_map+bo_unmap 10000 times, would it be negligible?
>>
>>
>> Compared to what the kernel needs to do for updating the page tables it is
>> less than 1% of the total work.
>>
>> The real question is if it wouldn't be simpler to use a tree for the
>> handles. Since the handles are dense you can just use an unbalanced tree
>> which is really easy.
>>
>> For a tree of the CPU mappings we would need an r/b interval tree, which is
>> hard to implement and quite some overkill.
>>
>> Do you have any numbers how many BOs really get a CPU mapping in a real
>> world application?
>
> Without our suballocator, we sometimes exceeded the max. mmap limit
> (~64K). It should be much less with the suballocator with 128KB slabs,
> probably a few thousands.

Is there a way to verify that it has performance issue if moving the cpu mapping in libdrm hash table
from kernel side?

AFAIW, only one game will use that with close OGL.

Regards,
Jerry

>
> Marek
>