[Intel-gfx] [PATCH 1/6] drm/i915: hangcheck robustification

Wed Oct 19 17:02:57 CEST 2011

On Wed, 19 Oct 2011 12:32:25 +0100
Chris Wilson <chris at chris-wilson.co.uk> wrote:

> On Tue, 11 Oct 2011 16:39:09 +0200, Daniel Vetter <daniel.vetter at ffwll.ch> wrote:
> > From: Ben Widawsky <ben at bwidawsk.net>
> > 
> > This was pulled out of the per ring error handling patch series as it
> > actually fixes two issues, and bikeshedding appears to be going on
> > there.
> > 
> > First, remove setting hangcheck_count when we do notify ring. While it
> > seems counterintuitive to be setting up a timer to catch hangcheck_count
> > greater than 0 with hangcheck_count already greater than 0, actually
> > when we go to check if the GPU is hung we clear that value if the gpu is
> > still alive . Leaving this is actually harmful as submitting work could
> > falsely clear the count while the hanghcheck code is checking the count.
> > I can't think of case where this doesn't just delay the inevitable
> > reset... but I didn't spend too much time thinking about it.
> > 
> > Second, for Gen5+ we have more information to be considered when
> > determining if the GPU is stuck, primarily the media ring (and blitter
> > ring in gen6). This patch will check all available rings, and also updates
> > error state with the new information. It theoretically cant fix false
> > positives, but I haven't actually come across such a case.
> > 
> > Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> > [danvet: remove remnants of a unrelated cleanup patch]
> > Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch>
> 
> NAK: This failed to detect a hang, leaving my box frozen. I suspect that
> the value of INSTDONE was fluctuating on the render ring even though we
> had now requests pending and so could assume that it was idle.
> -Chris
> 
How is that different than the previous behavior? We checked instdone on
the render ring before this patch too.