[Intel-gfx] [PATCH i-g-t] tests/initial_state: Add a test to capture the state of the GPU

Tue May 16 10:35:54 UTC 2017

On Tue, May 16, 2017 at 10:07:41AM +0000, Lofstedt, Marta wrote:
> 
> 
> > -----Original Message-----
> > From: Chris Wilson [mailto:chris at chris-wilson.co.uk]
> > Sent: Tuesday, May 16, 2017 12:48 PM
> > To: Lofstedt, Marta <marta.lofstedt at intel.com>
> > Cc: Daniel Vetter <daniel at ffwll.ch>; Martin Peres
> > <martin.peres at linux.intel.com>; intel-gfx at lists.freedesktop.org
> > Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add a test to capture
> > the state of the GPU
> > 
> > On Tue, May 16, 2017 at 09:43:52AM +0000, Lofstedt, Marta wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Chris Wilson [mailto:chris at chris-wilson.co.uk]
> > > > Sent: Tuesday, May 16, 2017 12:04 PM
> > > > To: Lofstedt, Marta <marta.lofstedt at intel.com>
> > > > Cc: Daniel Vetter <daniel at ffwll.ch>; Martin Peres
> > > > <martin.peres at linux.intel.com>; intel-gfx at lists.freedesktop.org
> > > > Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add a
> > > > test to capture the state of the GPU
> > > >
> > > > On Tue, May 16, 2017 at 08:54:51AM +0000, Lofstedt, Marta wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Chris Wilson [mailto:chris at chris-wilson.co.uk]
> > > > > > Sent: Tuesday, May 16, 2017 11:21 AM
> > > > > > To: Lofstedt, Marta <marta.lofstedt at intel.com>
> > > > > > Cc: Daniel Vetter <daniel at ffwll.ch>; Martin Peres
> > > > > > <martin.peres at linux.intel.com>; intel-gfx at lists.freedesktop.org
> > > > > > Subject: Re: [Intel-gfx] [PATCH i-g-t] tests/initial_state: Add
> > > > > > a test to capture the state of the GPU
> > > > > >
> > > > > > On Tue, May 16, 2017 at 07:42:51AM +0000, Lofstedt, Marta wrote:
> > > > > > > I hereby pull-out this patch.
> > > > > > > The idea of it was to know if we were already wedged at the
> > > > > > > beginning of
> > > > > > testing, that would give us information on how to interpret
> > > > > > silly results; such that test starting to get skipped and/or we
> > > > > > got dmesg-warns/incomplete on tests that usually should be skipped.
> > > > > > > Also, we are planning to soon deploy a piglit.conf solution
> > > > > > > where testing
> > > > > > will be terminated on wedged, so I agree that my test isn't really
> > needed.
> > > > > >
> > > > > > Not everything is broken by wedged; internally we just use that
> > > > > > as an indicator that GEM is hosed. KMS should still work, we
> > > > > > must still be able to drive the displays to show the error and
> > > > > > keep the servers alive until the data is saved (and hopefully
> > > > > > gracefully degrade that we don't have to interrupt their immediate
> > session).
> > > > >
> > > > > It doesn't matter if it is broken or not, if we are terminally
> > > > > wedged the rest
> > > > of the result may be silly. Look for example at CI_DRM_2612, the
> > > > fi-elk-e7500 is wedged at igt at gem_busy@basic-hang-default, then all
> > > > test are skipped until gem_exec_reloc at basic-cpu-gtt-noreloc where
> > > > the machine hangs, but it is a gem test so it should have been
> > > > skipped, right. My conclusion from seeing this pattern multiple
> > > > times is that after terminally wedged, silly things can happen, i.e.
> > > > we can't trust the results, and since we don't want silly bugs, the CI
> > testing should be stopped.
> > > >
> > > > The machine didn't hang, it was remotely killed because the run timed
> > out.
> > > How do you know that?
> > 
> > The dmesg is a stream of flip timeouts until we run out of total BAT runtime
> > (12 minutes + some startup slack).
> > -Chris
> 
> Then look at CI_DRM_2602, wedged at igt at gem_busy@basic-hang-default, after a lot of skipping, we get incomplete result for another test, this time gem_exec_reloc at basic-gtt-cpu-noreloc
> 
> So, gem_exec_reloc at basic-cpu-gtt-noreloc and gem_exec_reloc at basic-gtt-cpu-noreloc are falsely getting blamed and my conclusion is that this is due to the permanent wedging started at gem_busy at basic-hang-default. So, to avoid bug reports for gem_exec_reloc at basic-cpu-gtt-noreloc and gem_exec_reloc at basic-gtt-cpu- noreloc the suggestion is to stop testing after we are terminally wedged. 

The conclusion is still wrong though, it doesn't derive from the wedged
state itself but from another bug that can be triggered by a recovered
gpu hang. The issue is that post-processing isn't yet smart enough to
determine the real cause and picks on the victim. We are always going to
have that problem in some form, the question is what can be done to make
the process smart to avoid false positives or rather just say this is a
general bug and be careful not to be precise on the blame.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre