[Intel-gfx] [PATCH 2/3] drm/i915/gt: Don't declare hangs if engine is stalled
Chris Wilson
chris at chris-wilson.co.uk
Thu May 28 16:52:15 UTC 2020
Quoting Chris Wilson (2020-05-28 17:50:55)
> Quoting Mika Kuoppala (2020-05-28 17:23:18)
> > Chris Wilson <chris at chris-wilson.co.uk> writes:
> >
> > > If the ring submission is stalled on an external request, nothing can be
> > > submitted, not even the heartbeat in the kernel context. Since nothing
> > > is running, resetting the engine/device does not unblock the system and
> > > is pointless. We can see if the heartbeat is supposed to be running
> > > before declaring foul.
> > >
> > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > > ---
> > > .../gpu/drm/i915/gt/intel_engine_heartbeat.c | 19 ++++++++++++++++---
> > > 1 file changed, 16 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> > > index 5136c8bf112d..f67ad937eefb 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> > > @@ -48,8 +48,10 @@ static void show_heartbeat(const struct i915_request *rq,
> > > struct drm_printer p = drm_debug_printer("heartbeat");
> > >
> > > intel_engine_dump(engine, &p,
> > > - "%s heartbeat {prio:%d} not ticking\n",
> > > + "%s heartbeat {seqno:%llx:%lld, prio:%d} not ticking\n",
> > > engine->name,
> > > + rq->fence.context,
> > > + rq->fence.seqno,
> > > rq->sched.attr.priority);
> > > }
> > >
> > > @@ -76,8 +78,19 @@ static void heartbeat(struct work_struct *wrk)
> > > goto out;
> > >
> > > if (engine->heartbeat.systole) {
> > > - if (engine->schedule &&
> > > - rq->sched.attr.priority < I915_PRIORITY_BARRIER) {
> > > + if (!i915_sw_fence_signaled(&rq->submit)) {
> > > + /*
> > > + * Not yet submitted, system is stalled.
> > > + *
> > > + * This more often happens for ring submission,
> > > + * where all contexts are funnelled into a common
> > > + * ringbuffer. If one context is blocked on an
> > > + * external fence, not only is it not submitted,
> > > + * but all other contexts, including the kernel
> > > + * context are stuck waiting for the signal.
> > > + */
> >
> > The solution how to save the system evades me.
> > But piling the heartbeat on top does not help with it in
> > any case.
>
> Last resort could be hangcheck again, but over a much much longer
> interval, say 2 minutes with work queued to the engine, but it remains
> idle, mark the device as wedged (and stop using it altogether). We have
> to be really confident that the cure is worth it.
To be effective we would also need to brute force complete the requests
waiting on external fences so that we could power down the device. Hmm,
that reminds me I need something similar to power down an active device
at suspend.
-Chris
More information about the Intel-gfx
mailing list