[Intel-gfx] [PATCH 09/10] drm/i915: Replace hangcheck by heartbeats

Chris Wilson chris at chris-wilson.co.uk
Fri Oct 11 15:06:03 UTC 2019


Quoting Tvrtko Ursulin (2019-10-11 15:24:21)
> 
> On 10/10/2019 08:14, Chris Wilson wrote:
> > +config DRM_I915_HEARTBEAT_INTERVAL
> > +     int "Interval between heartbeat pulses (ms)"
> > +     default 2500 # milliseconds
> > +     help
> > +       While active the driver uses a periodic request, a heartbeat, to
> > +       check the wellness of the GPU and to regularly flush state changes
> > +       (idle barriers).
> 
> Should this be somehow reworded to be more end user friendly? My idea, 
> may need to be corrected for bad English:

End user friendly. Sure, but that means I didn't hide this well enough
;)
 
> The driver sends a periodic heartbeat down all active GT engines to 
> check the health of the GPU and undertake regular house-keeping of 
> internal driver state.
> 
> Main points from the user perspective: "request" - whaat? "idle 
> barriers" - ditto. "Wellness" - a bit unusual in this context, no?

> > +static void heartbeat(struct work_struct *wrk)
> > +{
> > +     struct i915_sched_attr attr = {
> > +             .priority = I915_USER_PRIORITY(I915_PRIORITY_MIN),
> 
> You were saying it's better to start from zero, right?

The first bump. Starting at lowest, means run when first idle. Then we
jump to 0 and be scheduled like any other normal user.

> > +     };
> > +     struct intel_engine_cs *engine =
> > +             container_of(wrk, typeof(*engine), heartbeat.work.work);
> > +     struct intel_context *ce = engine->kernel_context;
> > +     struct i915_request *rq;
> > +
> > +     if (!intel_engine_pm_get_if_awake(engine))
> > +             return;
> > +
> > +     rq = engine->heartbeat.systole;
> > +     if (rq && i915_request_completed(rq)) {
> > +             i915_request_put(rq);
> > +             engine->heartbeat.systole = NULL;
> > +     }
> > +
> > +     if (intel_gt_is_wedged(engine->gt))
> > +             goto out;
> > +
> > +     if (engine->heartbeat.systole) {
> > +             if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)) {
> > +                     struct drm_printer p = drm_debug_printer(__func__);
> > +
> > +                     intel_engine_dump(engine, &p,
> > +                                       "%s heartbeat not ticking\n",
> > +                                       engine->name);
> 
> This could perhaps be better only when we have reached a higher priority 
> attempt. Okay it's under DEBUG_GEM but still, not sure there is value in 
> being so panicky if for any reason preemption does not work. Heartbeat 
> does not depend on preemption as far as I could spot, right?

The challenge is evident by the else path where we immediately reset.
If we cause a preemption event from the heartbeat (even strictly at min
prio we could cause a timeslice to expire) it is useful to have the
debug in dmesg (as in CI we don't get error-state very often).

Yes, I've tried trimming it to only on the vital paths, but so far
haven't found a satisfactory means.

To make me happy I think I need to push it down into the reset routines
themselves. Hmm. Except those we definitely don't want dmesg spam as
they get runs 10s of thousands times during CI.

It'll do for now. I'm sure we'll get tired of it and find it a new home.

> > +static struct kobj_attribute heartbeat_interval_attr =
> > +__ATTR(heartbeat_interval_ms, 0600, heartbeat_interval_show, heartbeat_interval_store);
> >   
> >   static void kobj_engine_release(struct kobject *kobj)
> >   {
> > @@ -115,6 +141,9 @@ void intel_engines_add_sysfs(struct drm_i915_private *i915)
> >               &class_attr.attr,
> >               &inst_attr.attr,
> >               &mmio_attr.attr,
> > +#if CONFIG_DRM_I915_HEARTBEAT_INTERVAL
> > +             &heartbeat_interval_attr.attr,
> > +#endif
> 
> Presumably compiler is happy (or the linker) with only this part getting 
> the #ifdef treatment? (The show/store functions above don't have it.)

Yup, it's not annoying enough to complain about the dead globals. Although
it should be more than smart enough to remove them.
-Chris


More information about the Intel-gfx mailing list