[Intel-gfx] [PATCH 1/2] drm/i915: Convert hangcheck from a timer into a delayed work item

Fri Jan 23 12:58:35 PST 2015

On Fri, Jan 23, 2015 at 02:44:07PM +0200, Mika Kuoppala wrote:
> From: Chris Wilson <chris at chris-wilson.co.uk>
> 
> When run as a timer, i915_hangcheck_elapsed() must adhere to all the
> rules of running in a softirq context. This is advantageous to us as we
> want to minimise the risk that a driver bug will prevent us from
> detecting a hung GPU. However, that is irrelevant if the driver bug
> prevents us from resetting and recovering. Still it is prudent not to
> rely on mutexes inside the checker, but given the coarseness of
> dev->struct_mutex doing so is extremely hard.
> 
> Give in and run from a work queue, i.e. outside of softirq.
> 
> v2:
> 
> The conversion does have one significant change, from the use of
> mod_timer to schedule_delayed_work, means that the time that we execute
> the first hangcheck is fixed and not continually deferred by later work.
> This has the advantage of not allowing userspace to fill the ring before
> hangcheck can finally run. At the same time, it removes the ability for
> the interrupt to defer the hangcheck as well. This is sensible for that
> an interrupt is only for a single engine, whereas we perform hangcheck
> globally, so whilst one ring may have hung, the other could be running
> normally and preventing the hangcheck from firing.

We can drop this comment since we have already applied this change in an
earlier patch.

> @@ -3097,17 +3099,11 @@ static void i915_hangcheck_elapsed(unsigned long data)
>  
>  void i915_queue_hangcheck(struct drm_device *dev)
>  {
> -	struct drm_i915_private *dev_priv = dev->dev_private;
> -	struct timer_list *timer = &dev_priv->gpu_error.hangcheck_timer;
> -
>  	if (!i915.enable_hangcheck)
>  		return;
>  
> -	/* Don't continually defer the hangcheck, but make sure it is active */

Keep the comment, or more preferrably let's capture the reason why we
don't want to continually defer the hangcheck:

/* Don't continually defer the hangcheck so that it is always run at
 * least once after work has been scheduled on any ring. Otherwise,
 * we will ignore a hung ring if a second ring is kept busy.
 */

> -	if (timer_pending(timer))
> -		return;
> -	mod_timer(timer,
> -		  round_jiffies_up(jiffies + DRM_I915_HANGCHECK_JIFFIES));
> +	schedule_delayed_work(&to_i915(dev)->gpu_error.hangcheck_work,
> +			      round_jiffies_up_relative(DRM_I915_HANGCHECK_JIFFIES));
>  }

-- 
Chris Wilson, Intel Open Source Technology Centre