[PATCH 1/2] dma-fence: Add a single fence fast path for fence merging

Thu Nov 14 09:05:18 UTC 2024

On 14/11/2024 08:53, Christian König wrote:
> Am 13.11.24 um 18:19 schrieb Tvrtko Ursulin:
>> From: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>
>> Testing some workloads in two different scenarios, such as games running
>> under Gamescope on a Steam Deck, or vkcube under a Plasma desktop, shows
>> that in a significant portion of calls the dma_fence_unwrap_merge helper
>> is called with just a single unsignalled fence.
>>
>> Therefore it is worthile to add a fast path for that case and so bypass
>> the memory allocation and insertion sort attempts.
> 
> You should probably re-order the patches since we need to backport the 
> second as bug fix while the first is just an improvement.

Ok.

> There is also a bug in this patch which needs to be fixed.
> 
>> Tested scenarios:
>>
>> 1) Hogwarts Legacy under Gamescope
>>
>> 450 calls per second to __dma_fence_unwrap_merge.
>>
>> Percentages per number of fences buckets, before and after checking for
>> signalled status, sorting and flattening:
>>
>>     N       Before      After
>>     0       0.85%
>>     1      69.80%        ->   The new fast path.
>>    2-9     29.36%        9%   (Ie. 91% of this bucket flattened to 1 
>> fence)
>>   10-19
>>   20-40
>>    50+
>>
>> 2) Cyberpunk 2077 under Gamescope
>>
>> 1050 calls per second.
>>
>>     N       Before      After
>>     0       0.71%
>>     1      52.53%        ->    The new fast path.
>>    2-9     44.38%      50.60%  (Ie. half resolved to a single fence)
>>   10-19     2.34%
>>   20-40     0.06%
>>    50+
>>
>> 3) vkcube under Plasma
>>
>> 90 calls per second.
>>
>>     N       Before      After
>>     0
>>     1
>>    2-9      100%         0%   (Ie. all resolved to a single fence)
>>   10-19
>>   20-40
>>    50+
>>
>> In the case of vkcube all invocations in the 2-9 bucket were actually
>> just two input fences.
> 
> Nice to have some numbers at hand.
> 
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>> Cc: Christian König <christian.koenig at amd.com>
>> Cc: Friedrich Vock <friedrich.vock at gmx.de>
>> ---
>>   drivers/dma-buf/dma-fence-unwrap.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/dma-buf/dma-fence-unwrap.c 
>> b/drivers/dma-buf/dma-fence-unwrap.c
>> index 628af51c81af..75c3e37fd617 100644
>> --- a/drivers/dma-buf/dma-fence-unwrap.c
>> +++ b/drivers/dma-buf/dma-fence-unwrap.c
>> @@ -64,8 +64,8 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned 
>> int num_fences,
>>                          struct dma_fence **fences,
>>                          struct dma_fence_unwrap *iter)
>>   {
>> +    struct dma_fence *tmp, *signaled, **array;
> 
> I would name that unsignaled instead.

Indeed. And a polite way of pointing out my brain was completely 
reversed. :)

>>       struct dma_fence_array *result;
>> -    struct dma_fence *tmp, **array;
>>       ktime_t timestamp;
>>       unsigned int i;
>>       size_t count;
>> @@ -75,6 +75,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned 
>> int num_fences,
>>       for (i = 0; i < num_fences; ++i) {
>>           dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
>>               if (!dma_fence_is_signaled(tmp)) {
>> +                signaled = tmp;
> 
> You need to grab a reference to tmp here if you want to keep it.
> 
> It's perfectly possible that tmp is garbage collected as soon as you go 
> to the next iteration or leave the loop.

Yep.. same bug in the second patch.

Regards,

Tvrtko

>>                   ++count;
>>               } else {
>>                   ktime_t t = dma_fence_timestamp(tmp);
>> @@ -88,9 +89,14 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned 
>> int num_fences,
>>       /*
>>        * If we couldn't find a pending fence just return a private 
>> signaled
>>        * fence with the timestamp of the last signaled one.
>> +     *
>> +     * Or if there was a single unsignaled fence left we can return it
>> +     * directly and early since that is a major path on many workloads.
>>        */
>>       if (count == 0)
>>           return dma_fence_allocate_private_stub(timestamp);
>> +    else if (count == 1)
>> +        return dma_fence_get(signaled);
>>       array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
>>       if (!array)
>