[PATCH libdrm] amdgpu: add a faster BO list API
Christian König
ckoenig.leichtzumerken at gmail.com
Thu Jan 10 11:51:03 UTC 2019
Am 10.01.19 um 12:41 schrieb Marek Olšák:
>
>
> On Thu, Jan 10, 2019, 4:15 AM Koenig, Christian
> <Christian.Koenig at amd.com <mailto:Christian.Koenig at amd.com> wrote:
>
> Am 10.01.19 um 00:39 schrieb Marek Olšák:
>> On Wed, Jan 9, 2019 at 1:41 PM Christian König
>> <ckoenig.leichtzumerken at gmail.com
>> <mailto:ckoenig.leichtzumerken at gmail.com>> wrote:
>>
>> Am 09.01.19 um 17:14 schrieb Marek Olšák:
>>> On Wed, Jan 9, 2019 at 8:09 AM Christian König
>>> <ckoenig.leichtzumerken at gmail.com
>>> <mailto:ckoenig.leichtzumerken at gmail.com>> wrote:
>>>
>>> Am 09.01.19 um 13:36 schrieb Marek Olšák:
>>>>
>>>>
>>>> On Wed, Jan 9, 2019, 5:28 AM Christian König
>>>> <ckoenig.leichtzumerken at gmail.com
>>>> <mailto:ckoenig.leichtzumerken at gmail.com> wrote:
>>>>
>>>> Looks good, but I'm wondering what's the actual
>>>> improvement?
>>>>
>>>>
>>>> No malloc calls and 1 less for loop copying the bo list.
>>>
>>> Yeah, but didn't we want to get completely rid of the bo
>>> list?
>>>
>>>
>>> If we have multiple IBs (e.g. gfx + compute) that share a BO
>>> list, I think it's faster to send the BO list to the kernel
>>> only once.
>>
>> That's not really faster.
>>
>> The only thing we safe us is a single loop over all BOs to
>> lockup the handle into a pointer and that is only a tiny
>> fraction of the overhead.
>>
>> The majority of the overhead is locking the BOs and reserving
>> space for the submission.
>>
>> What could really help here is to submit gfx+comput together
>> in just one CS IOCTL. This way we would need the locking and
>> space reservation only once.
>>
>> It's a bit of work in the kernel side, but certainly doable.
>>
>>
>> OK. Any objections to this patch?
>
> In general I'm wondering if we couldn't avoid adding so much new
> interface.
>
>
> There are Vulkan drivers that still use the bo_list interface.
>
>
> For example we can avoid the malloc() when we just cache the last
> freed bo_list structure in the device. We would just need an
> atomic pointer exchange operation for that.
>
>
> This way we even don't need to change mesa at all.
>
>
> There is still the for loop that we need to get rid of.
Yeah, but that I'm fine to handle with a amdgpu_bo_list_create_raw which
only takes the handles and still returns the amdgpu_bo_list structure we
are used to.
See what I'm mostly concerned about is having another CS function to
maintain.
>
>
> Regarding optimization, this chunk can be replaced by a cast on 64bit:
>> + chunk_array = alloca(sizeof(uint64_t) * num_chunks);
>> + for (i = 0; i < num_chunks; i++)
>> + chunk_array[i] = (uint64_t)(uintptr_t)&chunks[i];
>
> It can't. The input is an array of structures. The ioctl takes an
> array of pointers.
Ah! Haven't seen this, sorry for the noise.
Christian.
>
> Marek
>
>
> Regards,
> Christian.
>
>>
>> Thanks,
>> Marek
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20190110/db7fb88e/attachment-0001.html>
More information about the amd-gfx
mailing list