[Intel-gfx] [PATCH 28/28] drm/i915: Replace hangcheck by heartbeats

Chris Wilson chris at chris-wilson.co.uk
Mon Aug 26 16:56:32 UTC 2019


Quoting Bloomfield, Jon (2019-08-26 15:08:02)
> > -----Original Message-----
> > From: Chris Wilson <chris at chris-wilson.co.uk>
> > + * While the engine is active, we send a periodic pulse along the entire
> 
> along the entire what?

s/entire/engine/

> > + * to check on its health and to flush any idle-barriers. If that request
> > + * is stuck, and we fail to preempt it, we declare the engine hung and
> > + * issue a reset -- in the hope that restores progress.
> > + */
> > +
> > +static long delay(void)
> 
> Probably NOT the best function name in the world, ever. 'delay' could equally be a verb, so it's not self-descriptive.

Is it a noun or verb in this case? Or just shorthand.

> > +{
> > +     const long t =
> > msecs_to_jiffies(CONFIG_DRM_I915_HEARTBEAT_INTERVAL);
> > +
> > +     return round_jiffies_up_relative(t);
> > +}

> > +static void heartbeat(struct work_struct *wrk)
> > +{
> > +     struct i915_sched_attr attr = {
> > +             .priority = I915_USER_PRIORITY(I915_PRIORITY_MIN),
> > +     };
> > +     struct intel_engine_cs *engine =
> > +             container_of(wrk, typeof(*engine), heartbeat.work);
> > +     struct intel_context *ce = engine->kernel_context;
> > +     struct i915_request *rq;
> > +
> > +     if (!intel_engine_pm_get_if_awake(engine))
> > +             return;
> > +
> > +     rq = engine->last_heartbeat;
> > +     if (rq && i915_request_completed(rq)) {
> > +             i915_request_put(rq);
> > +             engine->last_heartbeat = NULL;
> > +     }
> > +
> > +     if (intel_gt_is_wedged(engine->gt))
> > +             goto out;
> > +
> > +     if (engine->last_heartbeat) {
> > +             if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)) {
> > +                     struct drm_printer p = drm_debug_printer(__func__);
> > +
> > +                     intel_engine_dump(engine, &p,
> > +                                       "%s heartbeat not ticking\n",
> > +                                       engine->name);
> > +             }
> > +
> > +             if (engine->schedule &&
> > +                 rq->sched.attr.priority < I915_PRIORITY_BARRIER) {
> > +                     attr.priority =
> > +
> >       I915_USER_PRIORITY(I915_PRIORITY_HEARTBEAT);
> > +                     if (rq->sched.attr.priority >= attr.priority)
> > +                             attr.priority = I915_PRIORITY_BARRIER;
> > +
> > +                     local_bh_disable();
> > +                     engine->schedule(rq, &attr);
> > +                     local_bh_enable();
> > +             } else {
> > +                     intel_gt_handle_error(engine->gt, engine->mask,
> > +                                           I915_ERROR_CAPTURE,
> > +                                           "stopped heartbeat on %s",
> > +                                           engine->name);
> > +             }
> > +             goto out;
> > +     }
> > +
> > +     if (engine->wakeref_serial == engine->serial)
> > +             goto out;
> > +
> > +     mutex_lock(&ce->timeline->mutex);
> > +
> > +     intel_context_enter(ce);
> > +     rq = __i915_request_create(ce, GFP_NOWAIT | __GFP_NOWARN);
> > +     intel_context_exit(ce);
> > +     if (IS_ERR(rq))
> > +             goto unlock;
> > +
> > +     idle_pulse(engine, rq);
> > +     if (i915_modparams.enable_hangcheck)
> > +             engine->last_heartbeat = i915_request_get(rq);
> > +
> > +     __i915_request_commit(rq);
> > +     __i915_request_queue(rq, &attr);
> > +
> > +unlock:
> > +     mutex_unlock(&ce->timeline->mutex);
> > +out:
> > +     schedule_delayed_work(&engine->heartbeat, delay());
> 
> Isn't engine->heartbeat now NULL in some cases?

engine->heartbeat, the worker
vs
engine->last_heartbeat

Maybe,

struct intel_engine_heartbeat {
	work;
	systole;
};


> > diff --git a/drivers/gpu/drm/i915/i915_getparam.c
> > b/drivers/gpu/drm/i915/i915_getparam.c
> > index 5d9101376a3d..e6c351080593 100644
> > --- a/drivers/gpu/drm/i915/i915_getparam.c
> > +++ b/drivers/gpu/drm/i915/i915_getparam.c
> > @@ -78,8 +78,7 @@ int i915_getparam_ioctl(struct drm_device *dev, void
> > *data,
> >                       return -ENODEV;
> >               break;
> >       case I915_PARAM_HAS_GPU_RESET:
> > -             value = i915_modparams.enable_hangcheck &&
> > -                     intel_has_gpu_reset(i915);
> > +             value = intel_has_gpu_reset(i915);
> 
> Don't understand this tweak. We haven't really changed the essence of hangcheck, just improved it, so why do we change this get_param?

I deleted the modparam in earlier patches. But anticipated you would
object...
-Chris


More information about the Intel-gfx mailing list