[Intel-gfx] [PATCH 2/5] drm/i915/selftests: Flush the active callbacks
Chris Wilson
chris at chris-wilson.co.uk
Fri Nov 22 14:25:42 UTC 2019
Quoting Tvrtko Ursulin (2019-11-22 14:17:42)
>
> On 22/11/2019 13:55, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2019-11-22 13:43:17)
> >>
> >> On 22/11/2019 13:09, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2019-11-22 13:01:56)
> >>>>
> >>>> On 22/11/2019 11:21, Chris Wilson wrote:
> >>>>> Before checking the current i915_active state for the asynchronous work
> >>>>> we submitted, flush any ongoing callback. This ensures that our sampling
> >>>>> is robust and does not sporadically fail due to bad timing as the work
> >>>>> is running on another cpu.
> >>>>>
> >>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>> ---
> >>>>> drivers/gpu/drm/i915/gt/selftest_context.c | 19 +++++++++++++------
> >>>>> 1 file changed, 13 insertions(+), 6 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/i915/gt/selftest_context.c b/drivers/gpu/drm/i915/gt/selftest_context.c
> >>>>> index 3586af636304..939798338242 100644
> >>>>> --- a/drivers/gpu/drm/i915/gt/selftest_context.c
> >>>>> +++ b/drivers/gpu/drm/i915/gt/selftest_context.c
> >>>>> @@ -48,20 +48,25 @@ static int context_sync(struct intel_context *ce)
> >>>>>
> >>>>> mutex_lock(&tl->mutex);
> >>>>> do {
> >>>>> - struct dma_fence *fence;
> >>>>> + struct i915_request *rq;
> >>>>> long timeout;
> >>>>>
> >>>>> - fence = i915_active_fence_get(&tl->last_request);
> >>>>> - if (!fence)
> >>>>> + if (list_empty(&tl->requests))
> >>>>> break;
> >>>>>
> >>>>> - timeout = dma_fence_wait_timeout(fence, false, HZ / 10);
> >>>>> + rq = list_last_entry(&tl->requests, typeof(*rq), link);
> >>>>> + i915_request_get(rq);
> >>>>> +
> >>>>> + timeout = i915_request_wait(rq, 0, HZ / 10);
> >>>>> if (timeout < 0)
> >>>>> err = timeout;
> >>>>> else
> >>>>> - i915_request_retire_upto(to_request(fence));
> >>>>> + i915_request_retire_upto(rq);
> >>>>>
> >>>>> - dma_fence_put(fence);
> >>>>> + spin_lock_irq(&rq->lock);
> >>>>> + spin_unlock_irq(&rq->lock);
> >>>>
> >>>> Ugh.. this at least needs a comment.
> >>>>
> >>>> What is the actual race?
> >>>
> >>> We complete the sync before the interrupt handler's irq work has
> >>> finished executing the callback to mark the barrier as completed.
> >>> So when we look at whether the engine is parked or not, we see the state
> >>> before the request has finished processing and find it still active.
> >>
> >> You mean i915_active_fence_get returns not really last request? How can
> >> that be? Or the key is just intel_engine_pm_flush?
> >
> > The active ingredient in this patch should be using the tl->requests
> > instead of tl->last_request and then using pm_flush()
> >
> > Looking at what was there before:
> >>>>> - fence = i915_active_fence_get(&tl->last_request);
> >
> > is definitely where I went "yikes, if we are skipping because
> > last_request is NULL, we had better make sure that the barrier callbacks
> > on the rq->fence.cb_list were executed" which means waiting until the
> > interrupt handler dropped rq->lock.
> >
> > Looking at the code that is there now, retiring the known list of
> > requests (and then looping the engine_park addition) should be accurate.
>
> Hm I hate this very intimate knowledge of operation. And I am still not
> following. :( Is some path bypassing adding to tl->last_request?
No, the i915_request_retire_upto() here may add to tl->requests (and
setting tl->last_request). But tl->last_request is unset by the
interrupt callback (if there is one, and there may be on a whim). So if
it is unset, we forget to loop to wait on the engine parking and so our
check that the engine is parked is too early.
> I see that the selftest does two things, first has i915_active_is_idle
> which needs the context to be retired.
Right. And context retirement needs an idle barrier, hence waiting for
the kernel_context to provide the barrier on parking.
This could be a wait for intel_engine_pm_is_idle instead, I choose to
try and keep the wait explicit to the idle barrier, as the underlying
goal here is to prove control over the idle barrier for the ordinary
struct intel_context.
> Second bit is intel_engine_pm_is_awake which alone would be handled by
> intel_engine_gt_pm_flush.
Right, so long as we sure the engine is in the process of parking (which
it should be having sent the idle barrier observed above).
> So the first i915_active_is_idle check is the problematic one I think.
It's definitely the first one to trip over ;)
-Chris
More information about the Intel-gfx
mailing list