[Intel-gfx] [PATCH 3/4] drm/i915/gt: Perform an arbitration check before busywaiting
chris at chris-wilson.co.uk
Mon Jan 11 21:54:36 UTC 2021
Quoting Tvrtko Ursulin (2021-01-11 17:12:57)
> On 11/01/2021 16:27, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2021-01-11 16:19:40)
> >> On 11/01/2021 10:57, Chris Wilson wrote:
> >>> During igt_reset_nop_engine, it was observed that an unexpected failed
> >>> engine reset lead to us busywaiting on the stop-ring semaphore (set
> >>> during the reset preparations) on the first request afterwards. There was
> >>> no explicit MI_ARB_CHECK in this sequence as the presumption was that
> >>> the failed MI_SEMAPHORE_WAIT would itself act as an arbitration point.
> >>> It did not in this circumstance, so force it.
> >> In other words MI_SEMAPHORE_POLL is not a preemption point? Can't
> >> remember if I knew that or not..
> > MI_SEMAPHORE_WAIT | POLL is most definitely a preemption point on a
> > miss.
> >> 1)
> >> Why not the same handling in !gen12 version?
> > Because I think it's a bug in tgl [a0 at least]. I think I've seen the
> > same symptoms on tgl before, but not earlier. This is the first time the
> > sequence clicked as to why it was busy spinning. Random engine reset
> > failures are rare enough -- I was meant to also write a test case to
> > inject failure.
> Random engine reset failure you think is a TGL issue?
The MI_SEMAPHORE_WAIT | POLL miss not generating an arbitration point.
We have quite a few selftests and IGT that use this feature.
So I was wondering if this was similar to one of those tgl issues with
semaphores and CS events.
The random engine reset failure here is also decidedly odd. The engine
> >> 2)
> >> Failed reset leads to busy-hang in following request _tail_? But there
> >> is an arb check at the start of following request as well. Or in cases
> >> where we context switch into the middle of a previously executing request?
> > It was the first request submitted after the failed reset. We expect to
> > clear the ring-stop flag on the CS IDLE->ACTIVE event.
> >> But why would that busy hang? Hasn't the failed request unpaused the ring?
> > The engine was idle at the time of the failed reset. We left the
> > ring-stop set, and submitted the next batch of requests. We hit the
> > MI_SEMAPHORE_WAIT(ring-stop) at the end of the first request, but
> > without hitting an arbitration point (first request, no init-breadcrumb
> > in this case), the semaphore was stuck.
> So a kernel context request?
Ish. The selftest is using empty requests, and not emitting the
initial breadcrumb. (So acting like a kernel context.)
> Why hasn't IDLE->ACTIVE cleared ring stop?
There hasn't been an idle->active event, not a single CS event after
writing to ELSP and timing out while still spinning on the semaphore.
> Presumably this CSB must come after the first request has been submitted
> so apparently I am still not getting how it hangs.
It was never sent. The context is still in pending (not active)
and there's no sign in the trace of any interrupts/tasklet handing other
than the semaphore-wait interrupt.
> Just because igt_reset_nop_engine does things "quickly"? It prevents the
> CSB from arriving?
More that the since we do very little we hit the semaphore before the CS
has recovered from the shock of being asked to do something.
> So ARB_CHECK pickups up on the fact ELSP has been
> rewritten before the IDLE->ACTIVE even received and/or engine reset
> prevented it from arriving?
The ARB_CHECK should trigger the CS to generate the IDLE->ACTIVE event.
(Of course assuming that the bug is in the semaphore not triggering the
event due to strange circumstances and not a bug in the event generator
itself.) I'm suspicious of the semaphore due to the earlier CS bugs with
lite-restores + semaphores, and am expecting that since the MI_ARB_CHECK
is explicit, it actually works.
More information about the Intel-gfx