[PATCH] drm/amdgpu: restore wait-free fastpath for GEM_WAIT_IDLE

Mon Jan 27 18:50:15 UTC 2025

Am 27.01.25 um 17:33 schrieb Lucas Stach:
> Hi Christian,
>
> Am Montag, dem 27.01.2025 um 17:14 +0100 schrieb Christian König:
>> Am 27.01.25 um 17:02 schrieb Lucas Stach:
>>> This effectively reverts 0fea2ed61e7f ("drm/amdgpu: Remove call to
>>> reservation_object_test_signaled_rcu before wait"), as the premise of
>>> that commit is wrong. dma_resv_wait_timeout() without a timeout will
>>> not turn into a wait-free dma_resv_test_signaled, but will wait a
>>> jiffy for unsignaled fences, which is not at all what userspace expects
>>> when it calls GEM_WAIT_IDLE with a timeout of 0.
>> Marek pinged me about that strange behavior as well. That sounds like a
>> separate bug in dma_resv_wait_timeout() to me.
>>
>> I don't see why the function should be waiting with a timeout of 0 and
>> we have multiple cases where that is assumed.
>>
>> What should happen is that dma_resv_wait_timeout() should return 1 when
>> all fences are signaled even when the timeout is 0.
>>
>> My educated guess is that this behavior is somehow broken and instead we
>> wait for at least 1 jiffies.
>>
>> This here is probably just the tip of the iceberg.
>>
> dma_resv_wait_timeout() explicitly sets timeout to a single jiffy when
> it is entered with timeout == 0. This behavior was introduced with your
> commit 06a66b5c77ed ("reservation: revert "wait only with non-zero
> timeout specified (v3)" v2"), which seems to fix a real bug.

Ah, yes! I see where the problem is now.

> Between the two behaviors I think it is wrong for callers of
> dma_resv_wait_timeout() to assume that the call is necessarily wait-
> free in case of timeout == 0. If you want wait-free but are able to
> deal with false positive unsignaled returns use dma_resv_test_signaled,
> otherwise use dma_resv_wait_timeout to get accurate answers with a
> possible wait.

Well dma_resv_wait_timeout() with a zero timeout is indeed supposed to 
return immediately. Anything else would be quite a difference to all 
other kernel functions which take a jiffies timeout.

But the difference between dma_resv_test_signaled() and 
dma_resv_wait_timeout() with a zero timeout is that the later should 
still make sure to enable signaling!

A couple of people have pinged me about that as well and we also 
discussed the desired behavior on a bug report with Faith and a few others.

Going to take a look, thanks for pointing this out.

Regards,
Christian.

>
> Regards,
> Lucas
>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Lucas Stach <l.stach at pengutronix.de>
>>> ---
>>> This is most likely the correct kernel-side solution for the unexpected
>>> slowdown worked around with in userspace with this Mesa series:
>>> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32877
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 9 +++++++--
>>>    1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> index 1a5df8b94661..75b5d5149911 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> @@ -567,8 +567,13 @@ int amdgpu_gem_wait_idle_ioctl(struct drm_device *dev, void *data,
>>>    		return -ENOENT;
>>>    
>>>    	robj = gem_to_amdgpu_bo(gobj);
>>> -	ret = dma_resv_wait_timeout(robj->tbo.base.resv, DMA_RESV_USAGE_READ,
>>> -				    true, timeout);
>>> +	if (timeout == 0)
>>> +		ret = dma_resv_test_signaled(robj->tbo.base.resv,
>>> +					     DMA_RESV_USAGE_READ);
>>> +	else
>>> +		ret = dma_resv_wait_timeout(robj->tbo.base.resv,
>>> +					    DMA_RESV_USAGE_READ,
>>> +					    true, timeout);
>>>    
>>>    	/* ret == 0 means not signaled,
>>>    	 * ret > 0 means signaled