[Intel-gfx] [PATCH 3/4] drm/i915/gt: Perform an arbitration check before busywaiting

Mon Jan 11 17:12:57 UTC 2021

On 11/01/2021 16:27, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2021-01-11 16:19:40)
>>
>> On 11/01/2021 10:57, Chris Wilson wrote:
>>> During igt_reset_nop_engine, it was observed that an unexpected failed
>>> engine reset lead to us busywaiting on the stop-ring semaphore (set
>>> during the reset preparations) on the first request afterwards. There was
>>> no explicit MI_ARB_CHECK in this sequence as the presumption was that
>>> the failed MI_SEMAPHORE_WAIT would itself act as an arbitration point.
>>> It did not in this circumstance, so force it.
>>
>> In other words MI_SEMAPHORE_POLL is not a preemption point? Can't
>> remember if I knew that or not..
> 
> MI_SEMAPHORE_WAIT | POLL is most definitely a preemption point on a
> miss.
> 
>> 1)
>> Why not the same handling in !gen12 version?
> 
> Because I think it's a bug in tgl [a0 at least]. I think I've seen the
> same symptoms on tgl before, but not earlier. This is the first time the
> sequence clicked as to why it was busy spinning. Random engine reset
> failures are rare enough -- I was meant to also write a test case to
> inject failure.

Random engine reset failure you think is a TGL issue?

>   
>> 2)
>> Failed reset leads to busy-hang in following request _tail_? But there
>> is an arb check at the start of following request as well. Or in cases
>> where we context switch into the middle of a previously executing request?
> 
> It was the first request submitted after the failed reset. We expect to
> clear the ring-stop flag on the CS IDLE->ACTIVE event.
> 
>> But why would that busy hang? Hasn't the failed request unpaused the ring?
> 
> The engine was idle at the time of the failed reset. We left the
> ring-stop set, and submitted the next batch of requests. We hit the
> MI_SEMAPHORE_WAIT(ring-stop) at the end of the first request, but
> without hitting an arbitration point (first request, no init-breadcrumb
> in this case), the semaphore was stuck.

So a kernel context request? Why hasn't IDLE->ACTIVE cleared ring stop? 
Presumably this CSB must come after the first request has been submitted 
so apparently I am still not getting how it hangs.

Just because igt_reset_nop_engine does things "quickly"? It prevents the 
CSB from arriving? So ARB_CHECK pickups up on the fact ELSP has been 
rewritten before the IDLE->ACTIVE even received and/or engine reset 
prevented it from arriving?

Regards,

Tvrtko