[PATCH libdrm] amdgpu: add a faster BO list API

Marek Olšák maraeo at gmail.com
Thu Jan 10 11:41:16 UTC 2019


On Thu, Jan 10, 2019, 4:15 AM Koenig, Christian <Christian.Koenig at amd.com
wrote:

> Am 10.01.19 um 00:39 schrieb Marek Olšák:
>
> On Wed, Jan 9, 2019 at 1:41 PM Christian König <
> ckoenig.leichtzumerken at gmail.com> wrote:
>
>> Am 09.01.19 um 17:14 schrieb Marek Olšák:
>>
>> On Wed, Jan 9, 2019 at 8:09 AM Christian König <
>> ckoenig.leichtzumerken at gmail.com> wrote:
>>
>>> Am 09.01.19 um 13:36 schrieb Marek Olšák:
>>>
>>>
>>>
>>> On Wed, Jan 9, 2019, 5:28 AM Christian König <
>>> ckoenig.leichtzumerken at gmail.com wrote:
>>>
>>>> Looks good, but I'm wondering what's the actual improvement?
>>>>
>>>
>>> No malloc calls and 1 less for loop copying the bo list.
>>>
>>>
>>> Yeah, but didn't we want to get completely rid of the bo list?
>>>
>>
>> If we have multiple IBs (e.g. gfx + compute) that share a BO list, I
>> think it's faster to send the BO list to the kernel only once.
>>
>>
>> That's not really faster.
>>
>> The only thing we safe us is a single loop over all BOs to lockup the
>> handle into a pointer and that is only a tiny fraction of the overhead.
>>
>> The majority of the overhead is locking the BOs and reserving space for
>> the submission.
>>
>> What could really help here is to submit gfx+comput together in just one
>> CS IOCTL. This way we would need the locking and space reservation only
>> once.
>>
>> It's a bit of work in the kernel side, but certainly doable.
>>
>
> OK. Any objections to this patch?
>
>
> In general I'm wondering if we couldn't avoid adding so much new interface.
>

There are Vulkan drivers that still use the bo_list interface.


> For example we can avoid the malloc() when we just cache the last freed
> bo_list structure in the device. We would just need an atomic pointer
> exchange operation for that.
>

> This way we even don't need to change mesa at all.
>

There is still the for loop that we need to get rid of.


> Regarding optimization, this chunk can be replaced by a cast on 64bit:
>
> +	chunk_array = alloca(sizeof(uint64_t) * num_chunks);
> +	for (i = 0; i < num_chunks; i++)
> +		chunk_array[i] = (uint64_t)(uintptr_t)&chunks[i];
>
> It can't. The input is an array of structures. The ioctl takes an array of
pointers.

Marek


> Regards,
> Christian.
>
>
> Thanks,
> Marek
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20190110/721dd3f9/attachment.html>


More information about the amd-gfx mailing list