[Intel-gfx] [PATCH 2/2] drm/i915: Protect engine request list with spinlock

Daniel Vetter daniel at ffwll.ch
Tue Feb 24 04:57:56 PST 2015


On Tue, Feb 24, 2015 at 10:52:49AM +0000, Chris Wilson wrote:
> On Tue, Feb 24, 2015 at 11:39:08AM +0100, Daniel Vetter wrote:
> > On Tue, Feb 24, 2015 at 08:31:18AM +0000, Chris Wilson wrote:
> > > On Tue, Feb 24, 2015 at 12:58:19AM +0100, Daniel Vetter wrote:
> > > > On Thu, Feb 19, 2015 at 04:41:12PM +0000, Chris Wilson wrote:
> > > > > On Thu, Feb 19, 2015 at 06:18:55PM +0200, Mika Kuoppala wrote:
> > > > > > There are multiple players interested in the ring->request_list
> > > > > > state. Request submission can happen in kernel or user context,
> > > > > > idle worker is going through request list to free items. And then there
> > > > > > is hangcheck worker which tries to figure out if particular ring is
> > > > > > healthy by peeking at the request list among other things. And if
> > > > > > judged stuck by hangcheck, error state is colleted. Which in turns
> > > > > > needs access to ring->request_list.
> > > > > 
> > > > > We have discussed this before. Hangcheck does not need the lock so long
> > > > > as it is serialised with deletion. List processing with hangcheck during
> > > > > concurrent addition is safe.
> > > > > 
> > > > > For example, I expect the request locking to look like
> > > > > 
> > > > > http://cgit.freedesktop.org/~ickle/linux-2.6/tree/drivers/gpu/drm/i915/i915_gem_request.c#n691
> > > > 
> > > > I think longer-term with per-engine reset and fun stuff like that we
> > > > probably want the spinlock, just to avoid too many headaches with locking
> > > > auditing. For the execbuf fastpath it should just be one more spinlock per
> > > > ioctl, so hopefully bearable.
> > > 
> > > But it is not even the locking bug that breaks capture, so what's the
> > > point?
> > 
> > Oh I've read the patch as general prep work for more finegrained reset
> > support not as a fix for the referenced bug. I guess the bug is just the
> > usual incoherent seqno/irq thing that's been plagueing us ever since gen6?
> 
> I presumed Mika wants to fix that hangcheck and capture may explode as
> requests are completed concurrently. The bug that I expect will remain
> is that we peek at the bo without locks during capture.

Well my idea was that if we prevent request retiring with some minimal
spinlock then that should be enough to prevent objects from getting
retired. Which I hoped should be all we need to prevent everything else
from going poof too.

But thinking a bit more about this we need some additional checks too in
the retire code: If it grabs the request spinlock then it also needs to
check for in-progress reset. And if that's signalled it may not retire any
objects.

I think for hangcheck itself the spinlock alone should give sufficient
protection, as long as we store any state needed by hangcheck somewhere in
the request struct (or stuff hanging off it like contexts). But there's
indeed more trouble in the error capture code on top of that.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


More information about the Intel-gfx mailing list