[Intel-gfx] Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup

Tue Oct 25 09:49:49 CEST 2011

On Tue, Oct 25, 2011 at 09:15:58AM +0200, Jesse Barnes wrote:
> On Mon, 24 Oct 2011 19:43:44 -0700
> Kenneth Graunke <kenneth at whitecape.org> wrote:
> 
> > On 10/24/2011 05:58 PM, James R. Leu wrote:
> > > Debug output attached
> > 
> > You're in luck!  I fixed this GPU hang today in Mesa master.
> > 
> > This commit fixes the hang:
> > 
> > commit 3cc0a7be23ab603ed40d602595f673a44e079885
> > Author: Kenneth Graunke <kenneth at whitecape.org>
> > Date:   Fri Oct 21 01:03:37 2011 -0700
> > 
> >     i965: Apply post-sync non-zero workaround to homebrew workaround.
> > 
> >     In commit 3e5d3626, Eric added a homebrew workaround to fix GPU
> > hangs in the Mesa "engine" demo and oglc's api-texcoord test.
> > 
> >     Unfortunately, his PIPE_CONTROL contains a Depth Stall, which
> >     necessitates the post-sync non-zero workaround,
> > 
> >     Fixes GPU hangs in Civilization 4, PlaneShift, and 3DMMES.
> >     Hopefully Heroes of Newerth as well, though I haven't tested that.
> > 
> >     NOTE: This is candidate for the 7.11 branch.
> > 
> >     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=40324
> >     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=41096
> >     Signed-off-by: Kenneth Graunke <kenneth at whitecape.org>
> >     Reviewed-and-tested-by: Eric Anholt <eric at anholt.net>
> > 
> > I'm planning on cherry-picking it to the 7.11 branch in the next few
> > days, so it ought to make the upcoming 7.11.1 release.
> 
> It's good that we have so many ways and opportunities to test our GPU
> reset reliability.
> 
> Gordon, can you make sure our regular QA covers GPU hang detect and
> reset using a few different methods (e.g. the ones above but without
> the fix from Ken in Mesa)?  It's important that reset work really well
> and ideally w/o even being noticed by the user, so the more ways we
> have to wedge things, the better we can test the reset path's
> invisibility.

I'm thinking about adding a debugfs file that stops ringbuffer tail writes
on the specified ring to simulate a gpu hang. This way we can really
stress-test the hangcheck and error_state capture code. And by throwing
random workloads at the gpu while we "hang" it we hopefully can decently
exercise the gpu reset code and see whether it properly resets the gpu (or
just takes down the entire system).
-Daniel
-- 
Daniel Vetter
Mail: daniel at ffwll.ch
Mobile: +41 (0)79 365 57 48