[PATCH 2/3] dma-buf: sort fences in dma_fence_unwrap_merge
Tvrtko Ursulin
tursulin at ursulin.net
Fri Nov 8 16:12:59 UTC 2024
On 08/11/2024 14:58, Christian König wrote:
> Am 08.11.24 um 12:22 schrieb Tvrtko Ursulin:
>> On 07/11/2024 16:00, Tvrtko Ursulin wrote:
>>> On 24/10/2024 13:41, Christian König wrote:
>>>> The merge function initially handled only individual fences and
>>>> arrays which in turn were created by the merge function. This allowed
>>>> to create the new array by a simple merge sort based on the fence
>>>> context number.
>>>>
>>>> The problem is now that since the addition of timeline sync objects
>>>> userspace can create chain containers in basically any fence context
>>>> order.
>>>>
>>>> If those are merged together it can happen that we create really
>>>> large arrays since the merge sort algorithm doesn't work any more.
>>>>
>>>> So put an insert sort behind the merge sort which kicks in when the
>>>> input fences are not in the expected order. This isn't as efficient
>>>> as a heap sort, but has better properties for the most common use
>>>> case.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>>> ---
>>>> drivers/dma-buf/dma-fence-unwrap.c | 39
>>>> ++++++++++++++++++++++++++----
>>>> 1 file changed, 34 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/dma-buf/dma-fence-unwrap.c
>>>> b/drivers/dma-buf/dma-fence-unwrap.c
>>>> index 628af51c81af..d9aa280d9ff6 100644
>>>> --- a/drivers/dma-buf/dma-fence-unwrap.c
>>>> +++ b/drivers/dma-buf/dma-fence-unwrap.c
>>>> @@ -106,7 +106,7 @@ struct dma_fence
>>>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>>>> fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
>>>> count = 0;
>>>> - do {
>>>> + while (true) {
>>>> unsigned int sel;
>>>> restart:
>>>> @@ -144,11 +144,40 @@ struct dma_fence
>>>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>>>> }
>>>> }
>>>> - if (tmp) {
>>>> - array[count++] = dma_fence_get(tmp);
>>>> - fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>>>> + if (!tmp)
>>>> + break;
>>>> +
>>>> + /*
>>>> + * We could use a binary search here, but since the assumption
>>>> + * is that the main input are already sorted dma_fence_arrays
>>>> + * just looking from end has a higher chance of finding the
>>>> + * right location on the first try
>>>> + */
>>>> +
>>>> + for (i = count; i--;) {
>>>> + if (likely(array[i]->context < tmp->context))
>>>> + break;
>>>> +
>>>> + if (array[i]->context == tmp->context) {
>>>> + if (dma_fence_is_later(tmp, array[i])) {
>>>> + dma_fence_put(array[i]);
>>>> + array[i] = dma_fence_get(tmp);
>>>> + }
>>>> + fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>>>> + goto restart;
>>>> + }
>>>> }
>>>> - } while (tmp);
>>>> +
>>>> + ++i;
>>>> + /*
>>>> + * Make room for the fence, this should be a nop most of the
>>>> + * time.
>>>> + */
>>>> + memcpy(&array[i + 1], &array[i], (count - i) *
>>>> sizeof(*array));
>>>> + array[i] = dma_fence_get(tmp);
>>>> + fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>>>> + count++;
>>>
>>> Having ventured into this function for the first time, I can say that
>>> this is some smart code which is not easy to grasp. It could
>>> definitely benefit from a high level comment before the do-while loop
>>> to explain what it is going to do.
>>>
>>> Next and tmp local variable names I also wonder if could be renamed
>>> to something more descriptive.
>>>
>>> And the algorithmic complexity of the end result, given the multiple
>>> loops and gotos, I have no idea what it could be.
>>>
>>> Has a dumb solution been considered like a two-pass with a
>>> pessimistically allocated fence array been considered? Like:
>>>
>>> 1) Populate array with all unsignalled unwrapped fences. (O(count))
>>>
>>> 2) Bog standard include/linux/sort.h by context and seqno.
>>> (O(count*log (count)))
>>>
>>> 3) Walk array and squash same context to latest fence. (Before this
>>> patch that wasn't there, right?). (O(count)) (Overwrite in place, no
>>> memcpy needed.)
>>>
>>> Algorithmic complexity of that would be obvious and code much simpler.
>>
>> FWIW something like the below passes selftests. How does it look to
>> you? Do you think more or less efficient and more or less readable?
>
> Yeah I was considering the exact same thing.
>
> What hold me back was the fact that the heap sort() implementation is
> really inefficient for the most common use case of this. In other words
> two arrays with fences already sorted is basically just O(count).
>
> And I'm also not sure how many fences we see in those arrays in
> practice. With Vulkan basically trying to feed multiple contexts to keep
> all CPUs busy we might have quite a number here.
I can add some instrumentation and run some games next week.
Another option is adding the sort algorithm you want with the same API
as kernel's sort. Even if a local implementation that may already
increase readability of the merging process.
Regards,
Tvrtko
>
> Regards,
> Christian.
>
>>
>> commit 8a7c3ea7e7af85e813bf5fc151537ae37be1d6d9
>> Author: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>> Date: Fri Nov 8 10:14:15 2024 +0000
>>
>> __dma_fence_unwrap_merge
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>
>> diff --git a/drivers/dma-buf/dma-fence-unwrap.c
>> b/drivers/dma-buf/dma-fence-unwrap.c
>> index 628af51c81af..47d67e482e96 100644
>> --- a/drivers/dma-buf/dma-fence-unwrap.c
>> +++ b/drivers/dma-buf/dma-fence-unwrap.c
>> @@ -12,6 +12,7 @@
>> #include <linux/dma-fence-chain.h>
>> #include <linux/dma-fence-unwrap.h>
>> #include <linux/slab.h>
>> +#include <linux/sort.h>
>>
>> /* Internal helper to start new array iteration, don't use directly */
>> static struct dma_fence *
>> @@ -59,17 +60,39 @@ struct dma_fence *dma_fence_unwrap_next(struct
>> dma_fence_unwrap *cursor)
>> }
>> EXPORT_SYMBOL_GPL(dma_fence_unwrap_next);
>>
>> +
>> +static int fence_cmp(const void *_a, const void *_b)
>> +{
>> + const struct dma_fence *a = *(const struct dma_fence **)_a;
>> + const struct dma_fence *b = *(const struct dma_fence **)_b;
>> +
>> + if (a->context < b->context)
>> + return -1;
>> + else if (a->context > b->context)
>> + return 1;
>> +
>> + if (a->seqno < b->seqno)
>> + return -1;
>> + else if (a->seqno > b->seqno)
>> + return 1;
>> +
>> + return 0;
>> +}
>> +
>> /* Implementation for the dma_fence_merge() marco, don't use directly */
>> struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>> struct dma_fence **fences,
>> struct dma_fence_unwrap *iter)
>> {
>> - struct dma_fence_array *result;
>> struct dma_fence *tmp, **array;
>> + struct dma_fence_array *result;
>> ktime_t timestamp;
>> - unsigned int i;
>> - size_t count;
>> + int i, j, count;
>>
>> + /*
>> + * Count number of unwrapped fences and fince the latest signaled
>> + * timestamp.
>> + */
>> count = 0;
>> timestamp = ns_to_ktime(0);
>> for (i = 0; i < num_fences; ++i) {
>> @@ -92,63 +115,41 @@ struct dma_fence
>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>> if (count == 0)
>> return dma_fence_allocate_private_stub(timestamp);
>>
>> + /*
>> + * Allocate and populate the array.
>> + */
>> array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
>> if (!array)
>> return NULL;
>>
>> - /*
>> - * This trashes the input fence array and uses it as position for
>> the
>> - * following merge loop. This works because the dma_fence_merge()
>> - * wrapper macro is creating this temporary array on the stack
>> together
>> - * with the iterators.
>> - */
>> - for (i = 0; i < num_fences; ++i)
>> - fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
>> -
>> count = 0;
>> - do {
>> - unsigned int sel;
>> -
>> -restart:
>> - tmp = NULL;
>> - for (i = 0; i < num_fences; ++i) {
>> - struct dma_fence *next;
>> -
>> - while (fences[i] && dma_fence_is_signaled(fences[i]))
>> - fences[i] = dma_fence_unwrap_next(&iter[i]);
>> -
>> - next = fences[i];
>> - if (!next)
>> - continue;
>> -
>> - /*
>> - * We can't guarantee that inpute fences are ordered by
>> - * context, but it is still quite likely when this
>> - * function is used multiple times. So attempt to order
>> - * the fences by context as we pass over them and merge
>> - * fences with the same context.
>> - */
>> - if (!tmp || tmp->context > next->context) {
>> - tmp = next;
>> - sel = i;
>> -
>> - } else if (tmp->context < next->context) {
>> - continue;
>> -
>> - } else if (dma_fence_is_later(tmp, next)) {
>> - fences[i] = dma_fence_unwrap_next(&iter[i]);
>> - goto restart;
>> - } else {
>> - fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> - goto restart;
>> - }
>> + for (i = 0; i < num_fences; ++i) {
>> + dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
>> + if (!dma_fence_is_signaled(tmp))
>> + array[count++] = tmp;
>> }
>> + }
>> +
>> + /*
>> + * Sort in context and seqno order.
>> + */
>> + sort(array, count, sizeof(*array), fence_cmp, NULL);
>>
>> - if (tmp) {
>> - array[count++] = dma_fence_get(tmp);
>> - fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> + /*
>> + * Only keep the most recent fence for each context.
>> + */
>> + j = 0;
>> + tmp = array[0];
>> + for (i = 1; i < count; i++) {
>> + if (array[i]->context != tmp->context) {
>> + array[j++] = dma_fence_get(tmp);
>> }
>> - } while (tmp);
>> + tmp = array[i];
>> + }
>> + if (tmp->context != array[j - 1]->context) {
>> + array[j++] = dma_fence_get(tmp);
>> + }
>> + count = j;
>>
>> if (count == 0) {
>> tmp = dma_fence_allocate_private_stub(ktime_get());
>>
>>
>> Regards,
>>
>> Tvrtko
>>
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>> + };
>>>> if (count == 0) {
>>>> tmp = dma_fence_allocate_private_stub(ktime_get());
>
More information about the dri-devel
mailing list