[Intel-gfx] [PATCH 2/5] drm/i915/selftests: Flush the active callbacks

Fri Nov 22 14:17:42 UTC 2019

On 22/11/2019 13:55, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-11-22 13:43:17)
>>
>> On 22/11/2019 13:09, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2019-11-22 13:01:56)
>>>>
>>>> On 22/11/2019 11:21, Chris Wilson wrote:
>>>>> Before checking the current i915_active state for the asynchronous work
>>>>> we submitted, flush any ongoing callback. This ensures that our sampling
>>>>> is robust and does not sporadically fail due to bad timing as the work
>>>>> is running on another cpu.
>>>>>
>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/gt/selftest_context.c | 19 +++++++++++++------
>>>>>     1 file changed, 13 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/selftest_context.c b/drivers/gpu/drm/i915/gt/selftest_context.c
>>>>> index 3586af636304..939798338242 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/selftest_context.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/selftest_context.c
>>>>> @@ -48,20 +48,25 @@ static int context_sync(struct intel_context *ce)
>>>>>     
>>>>>         mutex_lock(&tl->mutex);
>>>>>         do {
>>>>> -             struct dma_fence *fence;
>>>>> +             struct i915_request *rq;
>>>>>                 long timeout;
>>>>>     
>>>>> -             fence = i915_active_fence_get(&tl->last_request);
>>>>> -             if (!fence)
>>>>> +             if (list_empty(&tl->requests))
>>>>>                         break;
>>>>>     
>>>>> -             timeout = dma_fence_wait_timeout(fence, false, HZ / 10);
>>>>> +             rq = list_last_entry(&tl->requests, typeof(*rq), link);
>>>>> +             i915_request_get(rq);
>>>>> +
>>>>> +             timeout = i915_request_wait(rq, 0, HZ / 10);
>>>>>                 if (timeout < 0)
>>>>>                         err = timeout;
>>>>>                 else
>>>>> -                     i915_request_retire_upto(to_request(fence));
>>>>> +                     i915_request_retire_upto(rq);
>>>>>     
>>>>> -             dma_fence_put(fence);
>>>>> +             spin_lock_irq(&rq->lock);
>>>>> +             spin_unlock_irq(&rq->lock);
>>>>
>>>> Ugh.. this at least needs a comment.
>>>>
>>>> What is the actual race?
>>>
>>> We complete the sync before the interrupt handler's irq work has
>>> finished executing the callback to mark the barrier as completed.
>>> So when we look at whether the engine is parked or not, we see the state
>>> before the request has finished processing and find it still active.
>>
>> You mean i915_active_fence_get returns not really last request? How can
>> that be? Or the key is just intel_engine_pm_flush?
> 
> The active ingredient in this patch should be using the tl->requests
> instead of tl->last_request and then using pm_flush()
> 
> Looking at what was there before:
>>>>> -             fence = i915_active_fence_get(&tl->last_request);
> 
> is definitely where I went "yikes, if we are skipping because
> last_request is NULL, we had better make sure that the barrier callbacks
> on the rq->fence.cb_list were executed" which means waiting until the
> interrupt handler dropped rq->lock.
> 
> Looking at the code that is there now, retiring the known list of
> requests (and then looping the engine_park addition) should be accurate.

Hm I hate this very intimate knowledge of operation. And I am still not 
following. :( Is some path bypassing adding to tl->last_request?

I see that the selftest does two things, first has i915_active_is_idle 
which needs the context to be retired.

Second bit is intel_engine_pm_is_awake which alone would be handled by 
intel_engine_gt_pm_flush.

So the first i915_active_is_idle check is the problematic one I think.

Regards,

Tvrtko