[Intel-gfx] [PATCH] RFC drm/i915: Stop the machine whilst capturing the GPU crash dump

Chris Wilson chris at chris-wilson.co.uk
Tue Oct 13 05:24:53 PDT 2015


On Tue, Oct 13, 2015 at 02:09:59PM +0200, Daniel Vetter wrote:
> On Fri, Oct 09, 2015 at 06:55:23PM +0100, Chris Wilson wrote:
> > On Fri, Oct 09, 2015 at 07:33:23PM +0200, Daniel Vetter wrote:
> > > On Fri, Oct 09, 2015 at 01:21:45PM +0100, Chris Wilson wrote:
> > > > The error state is purposefully racy as we expect it to be called at any
> > > > time and so have avoided any locking whilst capturing the crash dump.
> > > > However, with multi-engine GPUs and multiple CPUs, those races can
> > > > manifest into OOPSes as we attempt to chase dangling pointers freed on
> > > > other CPUs. Under discussion are lots of ways to slow down normal
> > > > operation in order to protect the post-mortem error capture, but what it
> > > > we take the opposite approach and freeze the machine whilst the error
> > > > catpure runs (note the GPU may still running, but as long as we don't
> > > > process any of the results the driver's bookkeeping will be static).
> > > > 
> > > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > > 
> > > One risk I see is that the list walking might still go off the rails when
> > > we stop the machine right in the middle of a list_move. With that we might
> > > start scanning the active list (of objects) on one engine and then midway
> > > through get to another engine and so never see the list_head again,
> > > looping forever. No idea yet what to do with that.
> > 
> > A move is a del followed by an insertion, you cannot step into an entry
> > that is the middle of moving lists - don't forget that stop_machine() is
> > a very heavy memory barrier. Similarly, the list_add() should ensure we
> > can't step forward into an element that will not lead back to the list.
> > So I am not convinced that it would be suspectible to that style of
> > hijacking.
> 
> The compiler could do havoc, so I think we need at least somewhat ordered
> lists updates. Using rcu lists primitives but stop_machine instead of
> kfree_rcu might do the trick.

I'd take the compiler barriers, but I don't want the mb() inside every
list update. And with a barrier, only walking the lists forwards in the
error capture, and the error capture being inside a stop_machine (so
mb() and no concurrent access) is safe. (Quite a list of brittle
caveats.)
 
> > The only alternative I see to list walking, is not to do any from the
> > error capture and rely on attaching enough information to the request
> > (along with register state) to be able to do postmortems.
> 
> That still means we need to at least protect the request lists to get at
> said request. And it sounded like you wouldn't like a kfree_rcu in there
> that much.

The burden has to be on the error capture side as having to do any atomic
operations whilst processing the requests quickly show up in the
profiles (at the moment here those profiles are dominated by the memory
access required to update the lists, where once those accesses were
dwarfed by the locked operations.) so I don't even relish the prospect
of adding atomic operations around list walking in the normal case.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre


More information about the Intel-gfx mailing list