[Intel-gfx] [PATCH 28/28] drm/i915: Replace hangcheck by heartbeats
Chris Wilson
chris at chris-wilson.co.uk
Mon Aug 26 16:56:32 UTC 2019
Quoting Bloomfield, Jon (2019-08-26 15:08:02)
> > -----Original Message-----
> > From: Chris Wilson <chris at chris-wilson.co.uk>
> > + * While the engine is active, we send a periodic pulse along the entire
>
> along the entire what?
s/entire/engine/
> > + * to check on its health and to flush any idle-barriers. If that request
> > + * is stuck, and we fail to preempt it, we declare the engine hung and
> > + * issue a reset -- in the hope that restores progress.
> > + */
> > +
> > +static long delay(void)
>
> Probably NOT the best function name in the world, ever. 'delay' could equally be a verb, so it's not self-descriptive.
Is it a noun or verb in this case? Or just shorthand.
> > +{
> > + const long t =
> > msecs_to_jiffies(CONFIG_DRM_I915_HEARTBEAT_INTERVAL);
> > +
> > + return round_jiffies_up_relative(t);
> > +}
> > +static void heartbeat(struct work_struct *wrk)
> > +{
> > + struct i915_sched_attr attr = {
> > + .priority = I915_USER_PRIORITY(I915_PRIORITY_MIN),
> > + };
> > + struct intel_engine_cs *engine =
> > + container_of(wrk, typeof(*engine), heartbeat.work);
> > + struct intel_context *ce = engine->kernel_context;
> > + struct i915_request *rq;
> > +
> > + if (!intel_engine_pm_get_if_awake(engine))
> > + return;
> > +
> > + rq = engine->last_heartbeat;
> > + if (rq && i915_request_completed(rq)) {
> > + i915_request_put(rq);
> > + engine->last_heartbeat = NULL;
> > + }
> > +
> > + if (intel_gt_is_wedged(engine->gt))
> > + goto out;
> > +
> > + if (engine->last_heartbeat) {
> > + if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)) {
> > + struct drm_printer p = drm_debug_printer(__func__);
> > +
> > + intel_engine_dump(engine, &p,
> > + "%s heartbeat not ticking\n",
> > + engine->name);
> > + }
> > +
> > + if (engine->schedule &&
> > + rq->sched.attr.priority < I915_PRIORITY_BARRIER) {
> > + attr.priority =
> > +
> > I915_USER_PRIORITY(I915_PRIORITY_HEARTBEAT);
> > + if (rq->sched.attr.priority >= attr.priority)
> > + attr.priority = I915_PRIORITY_BARRIER;
> > +
> > + local_bh_disable();
> > + engine->schedule(rq, &attr);
> > + local_bh_enable();
> > + } else {
> > + intel_gt_handle_error(engine->gt, engine->mask,
> > + I915_ERROR_CAPTURE,
> > + "stopped heartbeat on %s",
> > + engine->name);
> > + }
> > + goto out;
> > + }
> > +
> > + if (engine->wakeref_serial == engine->serial)
> > + goto out;
> > +
> > + mutex_lock(&ce->timeline->mutex);
> > +
> > + intel_context_enter(ce);
> > + rq = __i915_request_create(ce, GFP_NOWAIT | __GFP_NOWARN);
> > + intel_context_exit(ce);
> > + if (IS_ERR(rq))
> > + goto unlock;
> > +
> > + idle_pulse(engine, rq);
> > + if (i915_modparams.enable_hangcheck)
> > + engine->last_heartbeat = i915_request_get(rq);
> > +
> > + __i915_request_commit(rq);
> > + __i915_request_queue(rq, &attr);
> > +
> > +unlock:
> > + mutex_unlock(&ce->timeline->mutex);
> > +out:
> > + schedule_delayed_work(&engine->heartbeat, delay());
>
> Isn't engine->heartbeat now NULL in some cases?
engine->heartbeat, the worker
vs
engine->last_heartbeat
Maybe,
struct intel_engine_heartbeat {
work;
systole;
};
> > diff --git a/drivers/gpu/drm/i915/i915_getparam.c
> > b/drivers/gpu/drm/i915/i915_getparam.c
> > index 5d9101376a3d..e6c351080593 100644
> > --- a/drivers/gpu/drm/i915/i915_getparam.c
> > +++ b/drivers/gpu/drm/i915/i915_getparam.c
> > @@ -78,8 +78,7 @@ int i915_getparam_ioctl(struct drm_device *dev, void
> > *data,
> > return -ENODEV;
> > break;
> > case I915_PARAM_HAS_GPU_RESET:
> > - value = i915_modparams.enable_hangcheck &&
> > - intel_has_gpu_reset(i915);
> > + value = intel_has_gpu_reset(i915);
>
> Don't understand this tweak. We haven't really changed the essence of hangcheck, just improved it, so why do we change this get_param?
I deleted the modparam in earlier patches. But anticipated you would
object...
-Chris
More information about the Intel-gfx
mailing list