[Intel-gfx] [PATCH 02/13] drm/i915: Stop the machine whilst capturing the GPU crash dump

Fri Aug 5 19:57:03 UTC 2016

On Fri, Aug 05, 2016 at 08:01:58PM +0100, Chris Wilson wrote:
> On Fri, Aug 05, 2016 at 08:50:11PM +0200, Daniel Vetter wrote:
> > On Fri, Aug 05, 2016 at 10:05:53AM +0100, Chris Wilson wrote:
> > > The error state is purposefully racy as we expect it to be called at any
> > > time and so have avoided any locking whilst capturing the crash dump.
> > > However, with multi-engine GPUs and multiple CPUs, those races can
> > > manifest into OOPSes as we attempt to chase dangling pointers freed on
> > > other CPUs. Under discussion are lots of ways to slow down normal
> > > operation in order to protect the post-mortem error capture, but what it
> > > we take the opposite approach and freeze the machine whilst the error
> > > capture runs (note the GPU may still running, but as long as we don't
> > > process any of the results the driver's bookkeeping will be static).
> > > 
> > > Note that by of itself, this is not a complete fix. It also depends on
> > > the compiler barriers in list_add/list_del to prevent traversing the
> > > lists into the void.
> > 
> > The other important bit I think are NULL checks. I think the commit
> > message should mention that too.
> 
> Not so convinced here. My idea for GPU error capture is that we
> basically grab the bad request and then probe from there. The only list
> (and pointer chasing) I want to walk in a dangerous manner is the request
> list. Everything we need to report should be derivable from that
> request, and as the request is reachable from the list we know it is
> intact. I guess we could err on !request->fence.refcount and run away.

Yeah, this story is more convincing, would be good to add it.
-Daniel

> 
> > > v2: Avoid drm_clflush_pages() inside stop_machine() as it may use
> > > stop_machine() itself for its wbinvd fallback.
> > 
> > Droppingt he clflush from error capture seems like a pretty big
> > regression, at least at a glance. Why does this not result in piles of
> > corrupted error state captures?
> 
> We replace the clflush with an ever bigger complete cache flush.
> 
> However...
> 
> > I guess since we need to flush on incoherent platforms before gpu access
> > anyway this is mostly true (as long as userspace doesn't do something
> > silly), but I think it should be captured in a comment at the place of the
> > clflush.
> 
> Indeed. All the data we read has been flushed out of the cpu caches.
> If not, then right there is a reason for a hang.  It's moot as I
> propose just using the GTT for accessing every page since it is the only
> universal method we have - and it has been my preference for GPU capture
> because it gives us coherent data as the GPU would see it. We only
> stopped doing so more recently when the batches were not accessible from
> the GTT.
> -Chris
> 
> -- 
> Chris Wilson, Intel Open Source Technology Centre

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch