[Intel-gfx] [PATCH 3/4] drm/i915/gt: Perform an arbitration check before busywaiting

Tue Jan 12 09:21:17 UTC 2021

On 11/01/2021 21:54, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2021-01-11 17:12:57)
>>
>> On 11/01/2021 16:27, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2021-01-11 16:19:40)
>>>>
>>>> On 11/01/2021 10:57, Chris Wilson wrote:
>>>>> During igt_reset_nop_engine, it was observed that an unexpected failed
>>>>> engine reset lead to us busywaiting on the stop-ring semaphore (set
>>>>> during the reset preparations) on the first request afterwards. There was
>>>>> no explicit MI_ARB_CHECK in this sequence as the presumption was that
>>>>> the failed MI_SEMAPHORE_WAIT would itself act as an arbitration point.
>>>>> It did not in this circumstance, so force it.
>>>>
>>>> In other words MI_SEMAPHORE_POLL is not a preemption point? Can't
>>>> remember if I knew that or not..
>>>
>>> MI_SEMAPHORE_WAIT | POLL is most definitely a preemption point on a
>>> miss.
>>>
>>>> 1)
>>>> Why not the same handling in !gen12 version?
>>>
>>> Because I think it's a bug in tgl [a0 at least]. I think I've seen the
>>> same symptoms on tgl before, but not earlier. This is the first time the
>>> sequence clicked as to why it was busy spinning. Random engine reset
>>> failures are rare enough -- I was meant to also write a test case to
>>> inject failure.
>>
>> Random engine reset failure you think is a TGL issue?
> 
> The MI_SEMAPHORE_WAIT | POLL miss not generating an arbitration point.
> We have quite a few selftests and IGT that use this feature.
> 
> So I was wondering if this was similar to one of those tgl issues with
> semaphores and CS events.
> 
> The random engine reset failure here is also decidedly odd. The engine
> was idle!
> 
>>>> 2)
>>>> Failed reset leads to busy-hang in following request _tail_? But there
>>>> is an arb check at the start of following request as well. Or in cases
>>>> where we context switch into the middle of a previously executing request?
>>>
>>> It was the first request submitted after the failed reset. We expect to
>>> clear the ring-stop flag on the CS IDLE->ACTIVE event.
>>>
>>>> But why would that busy hang? Hasn't the failed request unpaused the ring?
>>>
>>> The engine was idle at the time of the failed reset. We left the
>>> ring-stop set, and submitted the next batch of requests. We hit the
>>> MI_SEMAPHORE_WAIT(ring-stop) at the end of the first request, but
>>> without hitting an arbitration point (first request, no init-breadcrumb
>>> in this case), the semaphore was stuck.
>>
>> So a kernel context request?
> 
> Ish. The selftest is using empty requests, and not emitting the
> initial breadcrumb. (So acting like a kernel context.)
> 
>> Why hasn't IDLE->ACTIVE cleared ring stop?
> 
> There hasn't been an idle->active event, not a single CS event after
> writing to ELSP and timing out while still spinning on the semaphore.
> 
>> Presumably this CSB must come after the first request has been submitted
>> so apparently I am still not getting how it hangs.
> 
> It was never sent. The context is still in pending[0] (not active[0])
> and there's no sign in the trace of any interrupts/tasklet handing other
> than the semaphore-wait interrupt.
>   
>> Just because igt_reset_nop_engine does things "quickly"? It prevents the
>> CSB from arriving?
> 
> More that the since we do very little we hit the semaphore before the CS
> has recovered from the shock of being asked to do something.
> 
>> So ARB_CHECK pickups up on the fact ELSP has been
>> rewritten before the IDLE->ACTIVE even received and/or engine reset
>> prevented it from arriving?
> 
> The ARB_CHECK should trigger the CS to generate the IDLE->ACTIVE event.
> (Of course assuming that the bug is in the semaphore not triggering the
> event due to strange circumstances and not a bug in the event generator
> itself.) I'm suspicious of the semaphore due to the earlier CS bugs with
> lite-restores + semaphores, and am expecting that since the MI_ARB_CHECK
> is explicit, it actually works.

Okay got it, thanks. I suggest it would be good to slightly improve the 
commit message so it is clear what are the suspected TGL quirks. But in 
general:

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>

Regards,

Tvrtko