[Intel-gfx] [PATCH 2/5] drm/i915/selftests: Flush the active callbacks

Fri Nov 22 14:25:42 UTC 2019

Quoting Tvrtko Ursulin (2019-11-22 14:17:42)
> 
> On 22/11/2019 13:55, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2019-11-22 13:43:17)
> >>
> >> On 22/11/2019 13:09, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2019-11-22 13:01:56)
> >>>>
> >>>> On 22/11/2019 11:21, Chris Wilson wrote:
> >>>>> Before checking the current i915_active state for the asynchronous work
> >>>>> we submitted, flush any ongoing callback. This ensures that our sampling
> >>>>> is robust and does not sporadically fail due to bad timing as the work
> >>>>> is running on another cpu.
> >>>>>
> >>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>> ---
> >>>>>     drivers/gpu/drm/i915/gt/selftest_context.c | 19 +++++++++++++------
> >>>>>     1 file changed, 13 insertions(+), 6 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/i915/gt/selftest_context.c b/drivers/gpu/drm/i915/gt/selftest_context.c
> >>>>> index 3586af636304..939798338242 100644
> >>>>> --- a/drivers/gpu/drm/i915/gt/selftest_context.c
> >>>>> +++ b/drivers/gpu/drm/i915/gt/selftest_context.c
> >>>>> @@ -48,20 +48,25 @@ static int context_sync(struct intel_context *ce)
> >>>>>     
> >>>>>         mutex_lock(&tl->mutex);
> >>>>>         do {
> >>>>> -             struct dma_fence *fence;
> >>>>> +             struct i915_request *rq;
> >>>>>                 long timeout;
> >>>>>     
> >>>>> -             fence = i915_active_fence_get(&tl->last_request);
> >>>>> -             if (!fence)
> >>>>> +             if (list_empty(&tl->requests))
> >>>>>                         break;
> >>>>>     
> >>>>> -             timeout = dma_fence_wait_timeout(fence, false, HZ / 10);
> >>>>> +             rq = list_last_entry(&tl->requests, typeof(*rq), link);
> >>>>> +             i915_request_get(rq);
> >>>>> +
> >>>>> +             timeout = i915_request_wait(rq, 0, HZ / 10);
> >>>>>                 if (timeout < 0)
> >>>>>                         err = timeout;
> >>>>>                 else
> >>>>> -                     i915_request_retire_upto(to_request(fence));
> >>>>> +                     i915_request_retire_upto(rq);
> >>>>>     
> >>>>> -             dma_fence_put(fence);
> >>>>> +             spin_lock_irq(&rq->lock);
> >>>>> +             spin_unlock_irq(&rq->lock);
> >>>>
> >>>> Ugh.. this at least needs a comment.
> >>>>
> >>>> What is the actual race?
> >>>
> >>> We complete the sync before the interrupt handler's irq work has
> >>> finished executing the callback to mark the barrier as completed.
> >>> So when we look at whether the engine is parked or not, we see the state
> >>> before the request has finished processing and find it still active.
> >>
> >> You mean i915_active_fence_get returns not really last request? How can
> >> that be? Or the key is just intel_engine_pm_flush?
> > 
> > The active ingredient in this patch should be using the tl->requests
> > instead of tl->last_request and then using pm_flush()
> > 
> > Looking at what was there before:
> >>>>> -             fence = i915_active_fence_get(&tl->last_request);
> > 
> > is definitely where I went "yikes, if we are skipping because
> > last_request is NULL, we had better make sure that the barrier callbacks
> > on the rq->fence.cb_list were executed" which means waiting until the
> > interrupt handler dropped rq->lock.
> > 
> > Looking at the code that is there now, retiring the known list of
> > requests (and then looping the engine_park addition) should be accurate.
> 
> Hm I hate this very intimate knowledge of operation. And I am still not 
> following. :( Is some path bypassing adding to tl->last_request?

No, the i915_request_retire_upto() here may add to tl->requests (and
setting tl->last_request). But tl->last_request is unset by the
interrupt callback (if there is one, and there may be on a whim). So if
it is unset, we forget to loop to wait on the engine parking and so our
check that the engine is parked is too early.

> I see that the selftest does two things, first has i915_active_is_idle 
> which needs the context to be retired.

Right. And context retirement needs an idle barrier, hence waiting for
the kernel_context to provide the barrier on parking.

This could be a wait for intel_engine_pm_is_idle instead, I choose to
try and keep the wait explicit to the idle barrier, as the underlying
goal here is to prove control over the idle barrier for the ordinary
struct intel_context.

> Second bit is intel_engine_pm_is_awake which alone would be handled by 
> intel_engine_gt_pm_flush.

Right, so long as we sure the engine is in the process of parking (which
it should be having sent the idle barrier observed above).

> So the first i915_active_is_idle check is the problematic one I think.

It's definitely the first one to trip over ;)
-Chris