[Intel-gfx] [PATCH 3/4] drm/i915/gt: Perform an arbitration check before busywaiting

Mon Jan 11 16:27:48 UTC 2021

Quoting Tvrtko Ursulin (2021-01-11 16:19:40)
> 
> On 11/01/2021 10:57, Chris Wilson wrote:
> > During igt_reset_nop_engine, it was observed that an unexpected failed
> > engine reset lead to us busywaiting on the stop-ring semaphore (set
> > during the reset preparations) on the first request afterwards. There was
> > no explicit MI_ARB_CHECK in this sequence as the presumption was that
> > the failed MI_SEMAPHORE_WAIT would itself act as an arbitration point.
> > It did not in this circumstance, so force it.
> 
> In other words MI_SEMAPHORE_POLL is not a preemption point? Can't 
> remember if I knew that or not..

MI_SEMAPHORE_WAIT | POLL is most definitely a preemption point on a
miss.

> 1)
> Why not the same handling in !gen12 version?

Because I think it's a bug in tgl [a0 at least]. I think I've seen the
same symptoms on tgl before, but not earlier. This is the first time the
sequence clicked as to why it was busy spinning. Random engine reset
failures are rare enough -- I was meant to also write a test case to
inject failure.

> 2)
> Failed reset leads to busy-hang in following request _tail_? But there 
> is an arb check at the start of following request as well. Or in cases 
> where we context switch into the middle of a previously executing request?

It was the first request submitted after the failed reset. We expect to
clear the ring-stop flag on the CS IDLE->ACTIVE event.

> But why would that busy hang? Hasn't the failed request unpaused the ring?

The engine was idle at the time of the failed reset. We left the
ring-stop set, and submitted the next batch of requests. We hit the
MI_SEMAPHORE_WAIT(ring-stop) at the end of the first request, but
without hitting an arbitration point (first request, no init-breadcrumb
in this case), the semaphore was stuck.
-Chris