[Intel-gfx] [PATCH v3 2/2] drm/i915: Never return 0 if not all requests retired

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Thu Nov 24 12:27:37 UTC 2022


On 23/11/2022 16:21, Janusz Krzysztofik wrote:
> On Wednesday, 23 November 2022 13:57:26 CET Tvrtko Ursulin wrote:
>>
>> On 23/11/2022 09:28, Janusz Krzysztofik wrote:
>>> Hi Tvrtko,
>>>
>>> Thanks for your comments.
>>>
>>> On Tuesday, 22 November 2022 11:50:38 CET Tvrtko Ursulin wrote:
>>>>
>>>> On 21/11/2022 14:56, Janusz Krzysztofik wrote:
>>>>> Users of intel_gt_retire_requests_timeout() expect 0 return value on
>>>>> success.  However, we have no protection from passing back 0 potentially
>>>>> returned by a call to dma_fence_wait_timeout() when it succedes right
>>>>> after its timeout has expired.
>>>>
>>>> Is this talking about a potential weakness, or ambiguous kerneldoc, of
>>>> dma_fence_wait_timeout, dma_fence_default_wait and
>>>> i915_request_wait_timeout? They appear to say 0 return means timeout,
>>>> implying unsignaled fence. In other words signaled must return positive
>>>> remaining timeout. Implementations seems to allow a race which indeed
>>>> appears that return 0 and signaled fence is possible.
>>>
>>> While my initial analysis was indeed focused on inconsistent semantics of 0
>>> return values from different dma_fence_default_wait() backends, I should have
>>> also mentioned in this commit description that users may perfectly
>>> call intel_gt_retire_requests_timeout() with 0 timeout, in which case the
>>> false positive 0 value can be returned regardless of dma_fence_wait_timeout()
>>> potential issues.  Would you like me to reword and resubmit?
>>
>> Not sure yet.
>>
>> So the only caller which passes in zero to
>> intel_gt_retire_requests_timeout appears to be intel_gt_retire_requests
>> and it eats the return value anyway so this patch is immaterial for that
>> one.
> 
> Right.
> 
>> I guess it can change how intel_gt_wait_for_idle behaves with short-ish
>> timeouts. In case this race is hit. But then wouldn't it make sense to
>> follow up with a patch which addresses this race by re-checking the "is
>> signaled" when timeout expires,
> 
> But inside intel_gt_retire_requests_timeout() we generally don't care if
> fences have been signaled.  As long as user requested timeout hasn't expired,
> we use dma_fence_wait_timeout() as an aid, otherwise we keep trying to retire
> requests without waiting on fences. If the retirement succeeds then we return
> 0 (success) regardless of what the return value from the last called
> dma_fence_wait_timeout() was.  If it was 0 then the only useful information is
> that no more time has been left, and no matter if that 0 meant signaled or not
> signaled, we must return an error if there are still some requests not
> retired, I believe.

Yes I agree with all that. Sorry if my reply was misleading, I mentioned 
a follow-up, didn't mean that these two are not ready to go.

>> either in dma_fence_wait_timeout, or to
>> intel_gt_retire_requests_timeout. Or if not that at least better
>> document the dma_fence_wait_timeout and/or
>> intel_gt_retire_requests_timeout. Makes sense?
> 
> Documenting -- yes, as soon as we get into an agreement on what's the core of
> this issue -- whether that potential weakness, or ambiguous kerneldoc, of
> dma_fence_wait_timeout, dma_fence_default_wait and i915_request_wait_timeout,
> as you've stated, that we have to address somehow, or potentially incorrect
> direct use of the timeout variable, intended for storing time left to spend on
> fence waits, as our return value when timeout has expired.  And if the former
> then maybe we should also try to finally resolve that over a year old conflict
> (whether 0 means signaled on not signaled) inside our implementation of
> dma_fence_ops.wait, and simply use a wrapper around it for either our internal
> use if we decide to follow the reference implementation, or for dma_fence_ops
> use otherwise.  Or maybe the reference implementation should be fixed if
> problematic.  I don't feel competent enough to decide.

Right, we can leave that for later. I'll pull these two in if someone 
hasn't already.

Regards,

Tvrtko

> 
> Thanks,
> Janusz
>   
>>
>> Regards,
>>
>> Tvrtko
>>
>>>
>>>> If dma_fence_wait can indeed return 0 even when a request is signaled,
>>>> then how is timeout ?: -ETIME below correct? It isn't a chance for false
>>>> negative in its' callers?
>>>
>>> The goal of intel_gt_retire_requests_timeout() is to retire requests.  When
>>> that goal has been reached, i.e., all requests have been retired, active count
>>> is 0, and 0 is correctly returned, regardless of timeout value.
>>>
>>> The value of timeout is used only when there are still pending requests, which
>>> means that the goal hasn't been reached and the function hasn't succeeded.
>>> Then, no false negative is possible, unlike the false positive that we now
>>> have when we return 0  while some requests are still pending.
>>>
>>> Thanks,
>>> Janusz
>>>
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>> Replace 0 with -ETIME before potentially using the timeout value as return
>>>>> code, so -ETIME is returned if there are still some requests not retired
>>>>> after timeout, 0 otherwise.
>>>>>
>>>>> v3: Use conditional expression, more compact but also better reflecting
>>>>>        intention standing behind the change.
>>>>>
>>>>> v2: Move the added lines down so flush_submission() is not affected.
>>>>>
>>>>> Fixes: f33a8a51602c ("drm/i915: Merge wait_for_timelines with retire_request")
>>>>> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik at linux.intel.com>
>>>>> Reviewed-by: Andrzej Hajda <andrzej.hajda at intel.com>
>>>>> Cc: stable at vger.kernel.org # v5.5+
>>>>> ---
>>>>>     drivers/gpu/drm/i915/gt/intel_gt_requests.c | 2 +-
>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
>>>>> index edb881d756309..1dfd01668c79c 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
>>>>> @@ -199,7 +199,7 @@ out_active:	spin_lock(&timelines->lock);
>>>>>     	if (remaining_timeout)
>>>>>     		*remaining_timeout = timeout;
>>>>>     
>>>>> -	return active_count ? timeout : 0;
>>>>> +	return active_count ? timeout ?: -ETIME : 0;
>>>>>     }
>>>>>     
>>>>>     static void retire_work_handler(struct work_struct *work)
>>>>
>>>
>>>
>>>
>>>
>>
> 
> 
> 
> 


More information about the Intel-gfx mailing list