[RFC 2/2] dma-fence: Use kernel's sort for merging fences

Thu Nov 14 11:12:46 UTC 2024

On 14/11/2024 09:05, Christian König wrote:
> Am 13.11.24 um 18:19 schrieb Tvrtko Ursulin:
>> From: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>
>> One alternative to the fix Christian proposed in
>> https://lore.kernel.org/dri-devel/20241024124159.4519-3-christian.koenig@amd.com/
>> is to replace the rather complex open coded sorting loops with the kernel
>> standard sort followed by a context squashing pass.
>>
>> Proposed advantage of this would be readability but one concern Christian
>> raised was that there could be many fences, that they are typically 
>> mostly
>> sorted, and so the kernel's heap sort would be much worse by the proposed
>> algorithm.
>>
>> I had a look running some games and vkcube to see what are the typical
>> number of input fences. Tested scenarios:
>>
>> 1) Hogwarts Legacy under Gamescope
>>
>> 450 calls per second to __dma_fence_unwrap_merge.
>>
>> Percentages per number of fences buckets, before and after checking for
>> signalled status, sorting and flattening:
>>
>>     N       Before      After
>>     0       0.91%
>>     1      69.40%
>>    2-3     28.72%       9.4%  (90.6% resolved to one fence)
>>    4-5      0.93%
>>    6-9      0.03%
>>    10+
>>
>> 2) Cyberpunk 2077 under Gamescope
>>
>> 1050 calls per second, amounting to 0.01% CPU time according to perf top.
>>
>>     N       Before      After
>>     0       1.13%
>>     1      52.30%
>>    2-3     40.34%       55.57%
>>    4-5      1.46%        0.50%
>>    6-9      2.44%
>>    10+      2.34%
>>
>> 3) vkcube under Plasma
>>
>> 90 calls per second.
>>
>>     N       Before      After
>>     0
>>     1
>>    2-3      100%         0%   (Ie. all resolved to a single fence)
>>    4-5
>>    6-9
>>    10+
>>
>> In the case of vkcube all invocations in the 2-3 bucket were actually
>> just two input fences.
>>
>>  From these numbers it looks like the heap sort should not be a
>> disadvantage, given how the dominant case is <= 2 input fences which heap
>> sort solves with just one compare and swap. (And for the case of one 
>> input
>> fence we have a fast path in the previous patch.)
>>
>> A complementary possibility is to implement a different sorting algorithm
>> under the same API as the kernel's sort() and so keep the simplicity,
>> potentially moving the new sort under lib/ if it would be found more
>> widely useful.
> 
> Well the API would need to be different from sort() since a merge sort 
> always works with multiple inputs and a single output.

I was thinking insert sort could be good for small arrays if they are 
mostly already sorted. Reference I found was 
https://dl.acm.org/doi/pdf/10.1145/359024.359026, and although they do 
not look at lists below 50 elements, I think advantage over heap would 
hold even better.

> 
> But from the number you gathered I really don't think we are going to 
> need that.
> 
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>> Cc: Christian König <christian.koenig at amd.com>
>> Cc: Friedrich Vock <friedrich.vock at gmx.de>
>> ---
>>   drivers/dma-buf/dma-fence-unwrap.c | 129 ++++++++++++++++-------------
>>   1 file changed, 73 insertions(+), 56 deletions(-)
>>
>> diff --git a/drivers/dma-buf/dma-fence-unwrap.c 
>> b/drivers/dma-buf/dma-fence-unwrap.c
>> index 75c3e37fd617..750dc20a9e9d 100644
>> --- a/drivers/dma-buf/dma-fence-unwrap.c
>> +++ b/drivers/dma-buf/dma-fence-unwrap.c
>> @@ -12,6 +12,7 @@
>>   #include <linux/dma-fence-chain.h>
>>   #include <linux/dma-fence-unwrap.h>
>>   #include <linux/slab.h>
>> +#include <linux/sort.h>
>>   /* Internal helper to start new array iteration, don't use directly */
>>   static struct dma_fence *
>> @@ -59,6 +60,25 @@ struct dma_fence *dma_fence_unwrap_next(struct 
>> dma_fence_unwrap *cursor)
>>   }
>>   EXPORT_SYMBOL_GPL(dma_fence_unwrap_next);
>> +
>> +static int fence_cmp(const void *_a, const void *_b)
>> +{
>> +    struct dma_fence *a = *(struct dma_fence **)_a;
>> +    struct dma_fence *b = *(struct dma_fence **)_b;
>> +
>> +    if (a->context < b->context)
>> +        return -1;
>> +    else if (a->context > b->context)
>> +        return 1;
>> +
>> +    if (dma_fence_is_later(b, a))
>> +        return -1;
>> +    else if (dma_fence_is_later(a, b))
>> +        return 1;
>> +
>> +    return 0;
>> +}
>> +
>>   /* Implementation for the dma_fence_merge() marco, don't use 
>> directly */
>>   struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>>                          struct dma_fence **fences,
>> @@ -67,9 +87,12 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned 
>> int num_fences,
>>       struct dma_fence *tmp, *signaled, **array;
>>       struct dma_fence_array *result;
>>       ktime_t timestamp;
>> -    unsigned int i;
>> -    size_t count;
>> +    int i, j, count;
>> +    /*
>> +     * Count number of unwrapped fences and fince the latest signaled
>> +     * timestamp.
>> +     */
> 
> What is done should be obvious from the code, only why something is done 
> needs code comment and explanation.

I was going for completing the narrative so each logical block in the 
function has a comment, versus just some. IMO it makes it easier to 
follow by making the steps nicely visually separated. But I don't feel 
strongly about this so have removed it in v2.

> 
>>       count = 0;
>>       timestamp = ns_to_ktime(0);
>>       for (i = 0; i < num_fences; ++i) {
>> @@ -98,74 +121,68 @@ struct dma_fence 
>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>>       else if (count == 1)
>>           return dma_fence_get(signaled);
>> +    /*
>> +     * Allocate and populate the array.
>> +     */
>>       array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
>>       if (!array)
>>           return NULL;
>> -    /*
>> -     * This trashes the input fence array and uses it as position for 
>> the
>> -     * following merge loop. This works because the dma_fence_merge()
>> -     * wrapper macro is creating this temporary array on the stack 
>> together
>> -     * with the iterators.
>> -     */
>> -    for (i = 0; i < num_fences; ++i)
>> -        fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
>> -
>>       count = 0;
>> -    do {
>> -        unsigned int sel;
>> -
>> -restart:
>> -        tmp = NULL;
>> -        for (i = 0; i < num_fences; ++i) {
>> -            struct dma_fence *next;
>> -
>> -            while (fences[i] && dma_fence_is_signaled(fences[i]))
>> -                fences[i] = dma_fence_unwrap_next(&iter[i]);
>> -
>> -            next = fences[i];
>> -            if (!next)
>> -                continue;
>> -
>> -            /*
>> -             * We can't guarantee that inpute fences are ordered by
>> -             * context, but it is still quite likely when this
>> -             * function is used multiple times. So attempt to order
>> -             * the fences by context as we pass over them and merge
>> -             * fences with the same context.
>> -             */
>> -            if (!tmp || tmp->context > next->context) {
>> -                tmp = next;
>> -                sel = i;
>> -
>> -            } else if (tmp->context < next->context) {
>> -                continue;
>> -
>> -            } else if (dma_fence_is_later(tmp, next)) {
>> -                fences[i] = dma_fence_unwrap_next(&iter[i]);
>> -                goto restart;
>> -            } else {
>> -                fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> -                goto restart;
>> -            }
>> +    for (i = 0; i < num_fences; ++i) {
>> +        dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
>> +            if (!dma_fence_is_signaled(tmp))
>> +                array[count++] = tmp;
> 
> Same problem as in patch #1, you need to grab a reference to tmp here.

Yep. v2 will appear on the list shortly and given my sloppy quality 
track record when jumping between the topics I would appreciate if 
someone else would smoke test it too.

Regards,

Tvrtko

> 
> Apart from that the patch looks good to me, but I would reduce the 
> comments.
> 
> When we need to explain what code does then the code need to be improved 
> and not documented.
> 
> Regards,
> Christian
> 
>>           }
>> -
>> -        if (tmp) {
>> -            array[count++] = dma_fence_get(tmp);
>> -            fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> +    }
>> +
>> +    /*
>> +     * Equal fast-path as the above one, in case some fences got 
>> signalled
>> +     * in the meantime.
>> +     */
>> +    if (count == 0) {
>> +        tmp = dma_fence_allocate_private_stub(timestamp);
>> +        goto return_tmp;
>> +    } else if (count == 1) {
>> +        tmp = dma_fence_get(array[0]);
>> +        goto return_tmp;
>> +    }
>> +
>> +    /*
>> +     * Sort in context and seqno order.
>> +     */
>> +    sort(array, count, sizeof(*array), fence_cmp, NULL);
>> +
>> +    /*
>> +     * Only keep the most recent fence for each context.
>> +     */
>> +    j = 0;
>> +    tmp = array[0];
>> +    for (i = 1; i < count; i++) {
>> +        if (array[i]->context != tmp->context) {
>> +            array[j++] = dma_fence_get(tmp);
>>           }
>> -    } while (tmp);
>> -
>> +        tmp = array[i];
>> +    }
>> +    if (j == 0 || tmp->context != array[j - 1]->context) {
>> +        array[j++] = dma_fence_get(tmp);
>> +    }
>> +    count = j;
>> +
>> +    /*
>> +     * And another fast-path as the earlier ones.
>> +     */
>>       if (count == 0) {
>>           tmp = dma_fence_allocate_private_stub(ktime_get());
>>           goto return_tmp;
>> -    }
>> -
>> -    if (count == 1) {
>> +    } else if (count == 1) {
>>           tmp = array[0];
>>           goto return_tmp;
>>       }
>> +    /*
>> +     * Finnaly create the output fence array.
>> +     */
>>       result = dma_fence_array_create(count, array,
>>                       dma_fence_context_alloc(1),
>>                       1, false);
>