[Intel-gfx] [PATCH 2/2] tests/gem_eio: Resilience against "hanging too fast"

Thu Nov 26 07:51:13 PST 2015

On Thu, Nov 26, 2015 at 03:34:05PM +0000, Chris Wilson wrote:
> On Thu, Nov 26, 2015 at 03:46:06PM +0100, Daniel Vetter wrote:
> > On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote:
> > > On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote:
> > > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> > > > hang we need to be careful to not run into the "hanging too fast
> > > > check":
> > > > 
> > > > - don't restore the ban period, but instead keep it at 0.
> > > > - make sure we idle the gpu fully before hanging it again (wait
> > > >   subtest missted that).
> > > > 
> > > > With this gem_eio works now reliable even when I don't run the
> > > > subtests individually.
> > > > 
> > > > Of course it's a bit fishy that the default ctx gets blamed for
> > > > essentially doing nothing, but until that's figured out in upstream
> > > > it's better to make the test work for now.
> > > 
> > > This used to be reliable. And just disabling all banning in the kernel
> > > forever more is silly.
> > > 
> > > During igt_post_hang_ring:
> > > 1. we wait upon the hanging batch
> > >  - this returns when hangcheck fires
> > > 2. reset the ban period to normal
> > >  - this takes mutex_lock_interruptible and so must wait for the reset
> > >    handler to run before it can make the change,
> > >  - ergo the hanging batch never triggers a ban for itself.
> > >  - (a subsequent nonsimulated GPU hang may trigger the ban though)
> > 
> > This isn't where it dies. It dies when we do the echo 1 > i915_wedged.
> 
> That is not where it dies.

Well at least it happens after we start the hang recover from i915_wedged.

> > I suspect quiescent_gpu or whatever is getting in the way, but I really only
> > wanted to get things to run first. And since i915_wedged is a developer
> > feature, and it does work perfectly if you don't intend to reuse contexts
> > I didn't see any point in first trying to fix it up.
> > 
> > So I still maintain that this is a good enough approach, at least if
> > there's no obvious fix in-flight already.
> 
> No way. This is a kernel regression since 4.0, having just tested with
> v4.0 on snb/ivb/hsw.

Ok, I didn't realize that. I figured since i915_wedged will return -EAGAIN
anyway when we are terminally wedged, and that seems to have been the case
ever since we started with reset_counter this has been broken forever. I
guess I missed something.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch