[PATCH 1/2] dma-fence: Add a single fence fast path for fence merging

Thu Nov 14 08:53:49 UTC 2024

Am 13.11.24 um 18:19 schrieb Tvrtko Ursulin:
> From: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>
> Testing some workloads in two different scenarios, such as games running
> under Gamescope on a Steam Deck, or vkcube under a Plasma desktop, shows
> that in a significant portion of calls the dma_fence_unwrap_merge helper
> is called with just a single unsignalled fence.
>
> Therefore it is worthile to add a fast path for that case and so bypass
> the memory allocation and insertion sort attempts.

You should probably re-order the patches since we need to backport the 
second as bug fix while the first is just an improvement.

There is also a bug in this patch which needs to be fixed.

> Tested scenarios:
>
> 1) Hogwarts Legacy under Gamescope
>
> 450 calls per second to __dma_fence_unwrap_merge.
>
> Percentages per number of fences buckets, before and after checking for
> signalled status, sorting and flattening:
>
>     N       Before      After
>     0       0.85%
>     1      69.80%        ->   The new fast path.
>    2-9     29.36%        9%   (Ie. 91% of this bucket flattened to 1 fence)
>   10-19
>   20-40
>    50+
>
> 2) Cyberpunk 2077 under Gamescope
>
> 1050 calls per second.
>
>     N       Before      After
>     0       0.71%
>     1      52.53%        ->    The new fast path.
>    2-9     44.38%      50.60%  (Ie. half resolved to a single fence)
>   10-19     2.34%
>   20-40     0.06%
>    50+
>
> 3) vkcube under Plasma
>
> 90 calls per second.
>
>     N       Before      After
>     0
>     1
>    2-9      100%         0%   (Ie. all resolved to a single fence)
>   10-19
>   20-40
>    50+
>
> In the case of vkcube all invocations in the 2-9 bucket were actually
> just two input fences.

Nice to have some numbers at hand.

>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
> Cc: Christian König <christian.koenig at amd.com>
> Cc: Friedrich Vock <friedrich.vock at gmx.de>
> ---
>   drivers/dma-buf/dma-fence-unwrap.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
> index 628af51c81af..75c3e37fd617 100644
> --- a/drivers/dma-buf/dma-fence-unwrap.c
> +++ b/drivers/dma-buf/dma-fence-unwrap.c
> @@ -64,8 +64,8 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   					   struct dma_fence **fences,
>   					   struct dma_fence_unwrap *iter)
>   {
> +	struct dma_fence *tmp, *signaled, **array;

I would name that unsignaled instead.

>   	struct dma_fence_array *result;
> -	struct dma_fence *tmp, **array;
>   	ktime_t timestamp;
>   	unsigned int i;
>   	size_t count;
> @@ -75,6 +75,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   	for (i = 0; i < num_fences; ++i) {
>   		dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
>   			if (!dma_fence_is_signaled(tmp)) {
> +				signaled = tmp;

You need to grab a reference to tmp here if you want to keep it.

It's perfectly possible that tmp is garbage collected as soon as you go 
to the next iteration or leave the loop.

Regards,
Christian.

>   				++count;
>   			} else {
>   				ktime_t t = dma_fence_timestamp(tmp);
> @@ -88,9 +89,14 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   	/*
>   	 * If we couldn't find a pending fence just return a private signaled
>   	 * fence with the timestamp of the last signaled one.
> +	 *
> +	 * Or if there was a single unsignaled fence left we can return it
> +	 * directly and early since that is a major path on many workloads.
>   	 */
>   	if (count == 0)
>   		return dma_fence_allocate_private_stub(timestamp);
> +	else if (count == 1)
> +		return dma_fence_get(signaled);
>   
>   	array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
>   	if (!array)