[Intel-gfx] [PATCH 1/6] drm/i915: hangcheck robustification

Wed Oct 19 13:32:25 CEST 2011

On Tue, 11 Oct 2011 16:39:09 +0200, Daniel Vetter <daniel.vetter at ffwll.ch> wrote:
> From: Ben Widawsky <ben at bwidawsk.net>
> 
> This was pulled out of the per ring error handling patch series as it
> actually fixes two issues, and bikeshedding appears to be going on
> there.
> 
> First, remove setting hangcheck_count when we do notify ring. While it
> seems counterintuitive to be setting up a timer to catch hangcheck_count
> greater than 0 with hangcheck_count already greater than 0, actually
> when we go to check if the GPU is hung we clear that value if the gpu is
> still alive . Leaving this is actually harmful as submitting work could
> falsely clear the count while the hanghcheck code is checking the count.
> I can't think of case where this doesn't just delay the inevitable
> reset... but I didn't spend too much time thinking about it.
> 
> Second, for Gen5+ we have more information to be considered when
> determining if the GPU is stuck, primarily the media ring (and blitter
> ring in gen6). This patch will check all available rings, and also updates
> error state with the new information. It theoretically cant fix false
> positives, but I haven't actually come across such a case.
> 
> Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> [danvet: remove remnants of a unrelated cleanup patch]
> Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch>

NAK: This failed to detect a hang, leaving my box frozen. I suspect that
the value of INSTDONE was fluctuating on the render ring even though we
had now requests pending and so could assume that it was idle.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre