[Intel-gfx] [PATCH 2/5] drm/i915/selftests: Flush the active callbacks

Fri Nov 22 13:55:19 UTC 2019

Quoting Tvrtko Ursulin (2019-11-22 13:43:17)
> 
> On 22/11/2019 13:09, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2019-11-22 13:01:56)
> >>
> >> On 22/11/2019 11:21, Chris Wilson wrote:
> >>> Before checking the current i915_active state for the asynchronous work
> >>> we submitted, flush any ongoing callback. This ensures that our sampling
> >>> is robust and does not sporadically fail due to bad timing as the work
> >>> is running on another cpu.
> >>>
> >>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>> ---
> >>>    drivers/gpu/drm/i915/gt/selftest_context.c | 19 +++++++++++++------
> >>>    1 file changed, 13 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/gt/selftest_context.c b/drivers/gpu/drm/i915/gt/selftest_context.c
> >>> index 3586af636304..939798338242 100644
> >>> --- a/drivers/gpu/drm/i915/gt/selftest_context.c
> >>> +++ b/drivers/gpu/drm/i915/gt/selftest_context.c
> >>> @@ -48,20 +48,25 @@ static int context_sync(struct intel_context *ce)
> >>>    
> >>>        mutex_lock(&tl->mutex);
> >>>        do {
> >>> -             struct dma_fence *fence;
> >>> +             struct i915_request *rq;
> >>>                long timeout;
> >>>    
> >>> -             fence = i915_active_fence_get(&tl->last_request);
> >>> -             if (!fence)
> >>> +             if (list_empty(&tl->requests))
> >>>                        break;
> >>>    
> >>> -             timeout = dma_fence_wait_timeout(fence, false, HZ / 10);
> >>> +             rq = list_last_entry(&tl->requests, typeof(*rq), link);
> >>> +             i915_request_get(rq);
> >>> +
> >>> +             timeout = i915_request_wait(rq, 0, HZ / 10);
> >>>                if (timeout < 0)
> >>>                        err = timeout;
> >>>                else
> >>> -                     i915_request_retire_upto(to_request(fence));
> >>> +                     i915_request_retire_upto(rq);
> >>>    
> >>> -             dma_fence_put(fence);
> >>> +             spin_lock_irq(&rq->lock);
> >>> +             spin_unlock_irq(&rq->lock);
> >>
> >> Ugh.. this at least needs a comment.
> >>
> >> What is the actual race?
> > 
> > We complete the sync before the interrupt handler's irq work has
> > finished executing the callback to mark the barrier as completed.
> > So when we look at whether the engine is parked or not, we see the state
> > before the request has finished processing and find it still active.
> 
> You mean i915_active_fence_get returns not really last request? How can 
> that be? Or the key is just intel_engine_pm_flush?

The active ingredient in this patch should be using the tl->requests
instead of tl->last_request and then using pm_flush()

Looking at what was there before:
> >>> -             fence = i915_active_fence_get(&tl->last_request);

is definitely where I went "yikes, if we are skipping because
last_request is NULL, we had better make sure that the barrier callbacks
on the rq->fence.cb_list were executed" which means waiting until the
interrupt handler dropped rq->lock.

Looking at the code that is there now, retiring the known list of
requests (and then looping the engine_park addition) should be accurate.
-Chris