[Intel-gfx] [PATCH 1/2] drm/i915: Convert hangcheck from a timer into a delayed work item

Fri Jan 23 07:43:46 PST 2015

On Fri, Jan 23, 2015 at 02:44:07PM +0200, Mika Kuoppala wrote:
> From: Chris Wilson <chris at chris-wilson.co.uk>
> 
> When run as a timer, i915_hangcheck_elapsed() must adhere to all the
> rules of running in a softirq context. This is advantageous to us as we
> want to minimise the risk that a driver bug will prevent us from
> detecting a hung GPU. However, that is irrelevant if the driver bug
> prevents us from resetting and recovering. Still it is prudent not to
> rely on mutexes inside the checker, but given the coarseness of
> dev->struct_mutex doing so is extremely hard.
> 
> Give in and run from a work queue, i.e. outside of softirq.
> 
> v2:
> 
> The conversion does have one significant change, from the use of
> mod_timer to schedule_delayed_work, means that the time that we execute
> the first hangcheck is fixed and not continually deferred by later work.
> This has the advantage of not allowing userspace to fill the ring before
> hangcheck can finally run. At the same time, it removes the ability for
> the interrupt to defer the hangcheck as well. This is sensible for that
> an interrupt is only for a single engine, whereas we perform hangcheck
> globally, so whilst one ring may have hung, the other could be running
> normally and preventing the hangcheck from firing.
> 
> Cc: Jani Nikula <jani.nikula at intel.com>
> Cc: Daniel Vetter <dnaiel.vetter at ffwll.chm>
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk> (v2)
> Signed-off-by: Mika Kuoppala <mika.kuoppala at intel.com>

One thing special with timers is that they'll always run, and if you do a
del_timer_sync from process context you can't deadlock with timer A
because you hold some locks for timer B that's in front of A in the
queues. With workqueues that's not the case, and it's really easy to cause
deadlocks by blocking some random work item in front of the queue by
accident.

I think for this switch we need our own, dedicated hangcheck work queue,
with it's own thread to make sure it gets run reliable. But besides that I
really like this chnage.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch