[Intel-gfx] [PATCH] drm/i915/execlists: Reset ring registers on rebinding contexts

Wed Mar 28 08:32:10 UTC 2018

Quoting Mika Kuoppala (2018-03-28 08:58:38)
> Chris Wilson <chris at chris-wilson.co.uk> writes:
> 
> > Tvrtko uncovered a fun issue with recovering from a wedge device. In his
> > tests, he wedged the driver by injecting an unrecoverable hang whilst a
> > batch was spinning. As we reset the gpu in the middle of the spinner,
> > when resumed it would continue on from the next instruction in the ring
> > and write it's breadcrumb. However, on wedging we updated our
> > bookkeeping to indicate that the GPU had completed executing and would
> > restart from after the breadcrumb; so the emission of the stale
> > breadcrumb from before the reset came as a bit of a surprise.
> >
> 
> Ok trying to make sense of the above and how the wedging works.
> Here is my assertions.
> 
> The spinning batch was never found to be guilty of anything.

It was definitely guilty.

> On wedge we fast forwarded all engine seqnos to be what
> was last submitted.

Correct.

> We did hw reset.

Correct.

> On context image, the RING_HEAD was pointing to bb start
> of spin batch (or the instruction after it)

Instruction after.

> On resubmitting the context, we saw a seqno write from pre
> reset era.

Correct.

> So this doesn't affect only spinning batches but any busy
> batch that was running while we wedged?

Correct. Any execlists recovery from _wedged_ would be prone to hitting
this bug. legacy submission already applies the ring registers reset on
recovery.

Thinking of which, if we could, we should ban all contexts on wedging?
Or at least process the ban accounting for a failed reset. That sounds
more plausible (set_wedge() is a nasty lockless affair).
-Chris