[Intel-gfx] [PATCH 3/4] drm/i915/gt: Perform an arbitration check before busywaiting

Mon Jan 11 21:54:36 UTC 2021

Quoting Tvrtko Ursulin (2021-01-11 17:12:57)
> 
> On 11/01/2021 16:27, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2021-01-11 16:19:40)
> >>
> >> On 11/01/2021 10:57, Chris Wilson wrote:
> >>> During igt_reset_nop_engine, it was observed that an unexpected failed
> >>> engine reset lead to us busywaiting on the stop-ring semaphore (set
> >>> during the reset preparations) on the first request afterwards. There was
> >>> no explicit MI_ARB_CHECK in this sequence as the presumption was that
> >>> the failed MI_SEMAPHORE_WAIT would itself act as an arbitration point.
> >>> It did not in this circumstance, so force it.
> >>
> >> In other words MI_SEMAPHORE_POLL is not a preemption point? Can't
> >> remember if I knew that or not..
> > 
> > MI_SEMAPHORE_WAIT | POLL is most definitely a preemption point on a
> > miss.
> > 
> >> 1)
> >> Why not the same handling in !gen12 version?
> > 
> > Because I think it's a bug in tgl [a0 at least]. I think I've seen the
> > same symptoms on tgl before, but not earlier. This is the first time the
> > sequence clicked as to why it was busy spinning. Random engine reset
> > failures are rare enough -- I was meant to also write a test case to
> > inject failure.
> 
> Random engine reset failure you think is a TGL issue?

The MI_SEMAPHORE_WAIT | POLL miss not generating an arbitration point.
We have quite a few selftests and IGT that use this feature.

So I was wondering if this was similar to one of those tgl issues with
semaphores and CS events.

The random engine reset failure here is also decidedly odd. The engine
was idle!

> >> 2)
> >> Failed reset leads to busy-hang in following request _tail_? But there
> >> is an arb check at the start of following request as well. Or in cases
> >> where we context switch into the middle of a previously executing request?
> > 
> > It was the first request submitted after the failed reset. We expect to
> > clear the ring-stop flag on the CS IDLE->ACTIVE event.
> > 
> >> But why would that busy hang? Hasn't the failed request unpaused the ring?
> > 
> > The engine was idle at the time of the failed reset. We left the
> > ring-stop set, and submitted the next batch of requests. We hit the
> > MI_SEMAPHORE_WAIT(ring-stop) at the end of the first request, but
> > without hitting an arbitration point (first request, no init-breadcrumb
> > in this case), the semaphore was stuck.
> 
> So a kernel context request?

Ish. The selftest is using empty requests, and not emitting the
initial breadcrumb. (So acting like a kernel context.)

> Why hasn't IDLE->ACTIVE cleared ring stop? 

There hasn't been an idle->active event, not a single CS event after
writing to ELSP and timing out while still spinning on the semaphore.

> Presumably this CSB must come after the first request has been submitted 
> so apparently I am still not getting how it hangs.

It was never sent. The context is still in pending[0] (not active[0])
and there's no sign in the trace of any interrupts/tasklet handing other
than the semaphore-wait interrupt.

> Just because igt_reset_nop_engine does things "quickly"? It prevents the 
> CSB from arriving?

More that the since we do very little we hit the semaphore before the CS
has recovered from the shock of being asked to do something.

> So ARB_CHECK pickups up on the fact ELSP has been 
> rewritten before the IDLE->ACTIVE even received and/or engine reset 
> prevented it from arriving?

The ARB_CHECK should trigger the CS to generate the IDLE->ACTIVE event.
(Of course assuming that the bug is in the semaphore not triggering the
event due to strange circumstances and not a bug in the event generator
itself.) I'm suspicious of the semaphore due to the earlier CS bugs with
lite-restores + semaphores, and am expecting that since the MI_ARB_CHECK
is explicit, it actually works.
-Chris