[Intel-gfx] [PATCH 41/53] drm/i915/bdw: Avoid non-lite-restore preemptions

Mon Jun 23 13:52:11 CEST 2014

> -----Original Message-----
> From: Daniel Vetter [mailto:daniel.vetter at ffwll.ch] On Behalf Of Daniel
> Vetter
> Sent: Wednesday, June 18, 2014 9:49 PM
> To: Mateo Lozano, Oscar
> Cc: intel-gfx at lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH 41/53] drm/i915/bdw: Avoid non-lite-restore
> preemptions
> 
> On Fri, Jun 13, 2014 at 04:37:59PM +0100, oscar.mateo at intel.com wrote:
> > From: Oscar Mateo <oscar.mateo at intel.com>
> >
> > In the current Execlists feeding mechanism, full preemption is not
> > supported yet: only lite-restores are allowed (this is: the GPU simply
> > samples a new tail pointer for the context currently in execution).
> >
> > But we have identified an scenario in which a full preemption occurs:
> > 1) We submit two contexts for execution (A & B).
> > 2) The GPU finishes with the first one (A), switches to the second one
> > (B) and informs us.
> > 3) We submit B again (hoping to cause a lite restore) together with C,
> > but in the time we spend writing to the ELSP, the GPU finishes B.
> > 4) The GPU start executing B again (since we told it so).
> > 5) We receive a B finished interrupt and, mistakenly, we submit C
> > (again) and D, causing a full preemption of B.
> >
> > By keeping a better track of our submissions, we can avoid the
> > scenario described above.
> 
> How? I don't see a way to fundamentally avoid the above race, and I don't
> really see an issue with it - the gpu should notice that there's not really any
> work done and then switch to C.
> 
> Or am I completely missing the point here?
> 
> With no clue at all this looks really scary.

The race is avoided by keeping track of how many times a context has been submitted to the hardware and by better discriminating the received context switch interrupts: in the example, when we have submitted B twice, we won´t submit C and D as soon as we receive the notification that B is completed because we were expecting to get a LITE_RESTORE and we didn´t, so we know a second completion will be received shortly.

Without this explicit checking, the race condition happens and, somehow, the batch buffer execution order gets messed with. This can be verified with the IGT test I sent together with the series. I don´t know the exact mechanism by which the pre-emption messes with the execution order but, since other people is working on the Scheduler + Preemption on Execlists, I didn´t try to fix it. In these series, only Lite Restores are supported (other kind of preemptions WARN).

I´ll add this clarification to the commit message.

-- Oscar