[Intel-gfx] [PATCH v2 2/2] drm/i915: Never return 0 if not all requests retired

Mon Nov 21 23:13:47 UTC 2022

Hi Andrzej,

Thanks for providing your R-b, however, I'd still like to convince you that my 
approach, which you accepted anyway, is better justified than if we updated 0 
timeout with -ETIME immediately after returned by dma_fence_wait_timeout().

On Monday, 21 November 2022 13:12:00 CET Andrzej Hajda wrote:
> On 21.11.2022 11:59, Janusz Krzysztofik wrote:
> > On Monday, 21 November 2022 11:51:15 CET Janusz Krzysztofik wrote:
> >> Hi Andrzej,
> >>
> >> Thanks for your comment.
> >>
> >> On Monday, 21 November 2022 11:17:42 CET Andrzej Hajda wrote:
> >>>
> >>> On 21.11.2022 09:30, Janusz Krzysztofik wrote:
> >>>> Hi Nimroy,
> >>>>
> >>>> Thanks for looking at this.
> >>>>
> >>>> On Friday, 18 November 2022 20:56:50 CET Das, Nirmoy wrote:
> >>>>> On 11/18/2022 11:42 AM, Janusz Krzysztofik wrote:
> >>>>>> Users of intel_gt_retire_requests_timeout() expect 0 return value on
> >>>>>> success.  However, we have no protection from passing back 0 potentially
> >>>>>> returned by a call to dma_fence_wait_timeout() when it succedes right
> >>>>>> after its timeout has expired.
> >>>>>>
> >>>>>> Replace 0 with -ETIME before potentially using the timeout value as return
> >>>>>> code, so -ETIME is returned if there are still some requests not retired
> >>>>>> after timeout, 0 otherwise.
> >>>>>>
> >>>>>> v2: Move the added lines down so flush_submission() is not affected.
> >>>>>>
> >>>>>> Fixes: f33a8a51602c ("drm/i915: Merge wait_for_timelines with
> >>>> retire_request")
> >>>>>> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik at linux.intel.com>
> >>>>>> Cc: stable at vger.kernel.org # v5.5+
> >>>>>> ---
> >>>>>>     drivers/gpu/drm/i915/gt/intel_gt_requests.c | 3 +++
> >>>>>>     1 file changed, 3 insertions(+)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c b/drivers/gpu/
> >>>> drm/i915/gt/intel_gt_requests.c
> >>>>>> index edb881d756309..3ac4603eeb4ee 100644
> >>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> >>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> >>>>>> @@ -199,6 +199,9 @@ out_active:	spin_lock(&timelines->lock);
> >>>>>>     	if (remaining_timeout)
> >>>>>>     		*remaining_timeout = timeout;
> >>>>>>     
> >>>>>> +	if (!timeout)
> >>>>>> +		timeout = -ETIME;
> >>>>> This will return error, -ETIME when 0 timeout is passed,
> >>>>> intel_gt_retire_requests().
> >>>> Yes, but only when active_count is not 0 after we loop through
> >>>> timelines->active_list calling retire_requests() on each and counting up
> >>>> failures in active_count.
> >>>
> >>> Moving this line just after the call to dma_fence_wait_timeout should
> >>> solve the controversy.
> >>
> >> But that would break our need to pass 0, not -ETIME, to flush_submission() in
> >> case the initial value of timeout was 0, as pointed out by Chris during our
> >> discussion on v2.
> >>
> >> Maybe an inline comment above the added lines that explains why we are doing
> >> this could help?
> > 
> > How about not adding those two lines but modifying the return line instead?
> > 
> > -	return active_count ? timeout : 0;
> > +	return active_count ? timeout ?: -ETIME : 0;
> 
> Personally I would translate ret value from dma_fence* API ASAP, 

I think that would suggest we are trying to fix a problematic 0 response from 
dma_fence_wait_timeout() on success, while we already agreed with Chris' 
opinion that 0 is perfectl OK in that case, and returning 1 should be rather 
considered as problematic, since 0 just means success but no time left, and 
-ETIME means no success within timeout.  That's what had been implemented one 
time in our i915_request_wait_timeuout() backend, regardless of any breakage 
potentially introduced by later patches.

Then, fixing 0 return value from dma_fence_wait_timeout(), which is OK, is not 
what this patch is about.  The real problem is inconsistency between our 
declared API of i915_retire_reqiests_wait_timeout(), which promises to return 
0 on success, and that 0 remaining timeout value from dma_fence_wait_timeout() 
that we can potentially return when not all requests have been retired.  
That's what the patch is trying to fix, regardless of what that 0 timeout 
value can tell us about success or failure of a single call to 
dma_fence_wait_timeout(), not even speaking of a case when the function is 
called with timeout already equal 0.  Focused on success of retire_requests() 
rather than dma_fence_wait_timeout(), we generally ignore error codes from the 
latter, using them only for skipping next calls to that function, based on an 
assumption that no more time has been left.

Then, clearly fixing just our return value in the problematic case of 0 time 
left while not all requests have been retired seems the best option to me.

I've added your R-b to my v3 which implements just what you've accepted -- I 
hope you don't mind.

Thanks,
Janusz

> and 
> call flush_submission conditionally - to limit coexistence of both APIs.
> But this looks correct to me, as well.
> 
> Reviewed-by: Andrzej Hajda <andrzej.hajda at intel.com>
> 
> Regards
> Andrzej
> 
> > 
> > Would that be self explanatory?
> > 
> > Thanks,
> > Janusz
> > 
> >>
> >> Thanks,
> >> Janusz
> >>
> >>>
> >>> Regards
> >>> Andrzej
> >>>
> >>>>
> >>>>> We don't want that.
> >>>> When 0 timeout is passed to intel_gt_retire_requests(), do we really want it
> >>>> to return 0 unconditionally, or are we rather interested if those calls to
> >>>> retire_requests() succeeded?
> >>>>
> >>>>> I think you can use a separate variable to store
> >>>>> return val from the dma_fence_wait_timeout()
> >>>>>
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>> Nirmoy
> >>>>>
> >>>>>> +
> >>>>>>     	return active_count ? timeout : 0;
> >>>> If active count is 0, we return 0 regardless of timeout value, and that's OK.
> >>>> However, if active_count is not 0, we shouldn't return 0, I believe, we should
> >>>> return either remaining time if some left, or error (-ETIME) if not.  If you
> >>>> think I'm wrong, please explain why.
> >>>>
> >>>> Thanks,
> >>>> Janusz
> >>>>
> >>>>>>     }
> >>>>>>     
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> > 
> > 
> > 
> > 
> 
>