[Intel-gfx] [PATCH 32/38] drm/i915: Stop the machine whilst capturing the GPU crash dump

Chris Wilson chris at chris-wilson.co.uk
Wed Jun 8 11:37:17 UTC 2016


On Wed, Jun 08, 2016 at 12:06:05PM +0200, Daniel Vetter wrote:
> On Fri, Jun 03, 2016 at 05:55:47PM +0100, Chris Wilson wrote:
> > The error state is purposefully racy as we expect it to be called at any
> > time and so have avoided any locking whilst capturing the crash dump.
> > However, with multi-engine GPUs and multiple CPUs, those races can
> > manifest into OOPSes as we attempt to chase dangling pointers freed on
> > other CPUs. Under discussion are lots of ways to slow down normal
> > operation in order to protect the post-mortem error capture, but what it
> > we take the opposite approach and freeze the machine whilst the error
> > capture runs (note the GPU may still running, but as long as we don't
> > process any of the results the driver's bookkeeping will be static).
> > 
> > Note that by of itself, this is not a complete fix. It also depends on
> > the compiler barriers in list_add/list_del to prevent traversing the
> > lists into the void.
> > 
> > v2: Avoid drm_clflush_pages() inside stop_machine() as it may use
> > stop_machine() itself for its wbinvd fallback.
> > 
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> 
> rt folks will hate us for this I think. But yeah the only other options is
> mass-rcu-ifying everything, which is much more fragile. Ack on the general
> idea at least, need to look at what's all needed for list manipulation
> still.

General answer, if you have a problem with a GPU hang talk to whoever
caused it and tell them to stop ;)

Last inspection of list.h, suggests they are safe for our usage:

static inline void __list_del(struct list_head * prev, struct list_head * next)
{
	next->prev = prev;
	WRITE_ONCE(prev->next, next);
}


static inline void __list_add(struct list_head *new,
                              struct list_head *prev,
                              struct list_head *next)
{
        next->prev = new;
        new->next = next;
        new->prev = prev;
        WRITE_ONCE(prev->next, new);
}

i.e. they have gained the compiler barriers to prevent us from seeing a
partial list manipulation (they are basically RCU-safe by default now).

I also do passes over the error capture trying to minimise our list
usage so that we only have to verify the request list (and all GPU state
associated with the request should then be derivable from the request).
E.g that saves having to iterate over the context lists looking for the
request->ctx!
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre


More information about the Intel-gfx mailing list