[Intel-gfx] [PATCH] drm/i915: Replace hangcheck by heartbeats

Chris Wilson chris at chris-wilson.co.uk
Thu Jul 25 23:52:17 UTC 2019


Quoting Bloomfield, Jon (2019-07-26 00:41:49)
> > -----Original Message-----
> > From: Chris Wilson <chris at chris-wilson.co.uk>
> > Sent: Thursday, July 25, 2019 4:28 PM
> > To: Bloomfield, Jon <jon.bloomfield at intel.com>; intel-
> > gfx at lists.freedesktop.org
> > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>; Ursulin, Tvrtko
> > <tvrtko.ursulin at intel.com>
> > Subject: RE: [PATCH] drm/i915: Replace hangcheck by heartbeats
> > 
> > Quoting Bloomfield, Jon (2019-07-26 00:21:47)
> > > > -----Original Message-----
> > > > From: Chris Wilson <chris at chris-wilson.co.uk>
> > > > Sent: Thursday, July 25, 2019 4:17 PM
> > > > To: intel-gfx at lists.freedesktop.org
> > > > Cc: Chris Wilson <chris at chris-wilson.co.uk>; Joonas Lahtinen
> > > > <joonas.lahtinen at linux.intel.com>; Ursulin, Tvrtko
> > <tvrtko.ursulin at intel.com>;
> > > > Bloomfield, Jon <jon.bloomfield at intel.com>
> > > > Subject: [PATCH] drm/i915: Replace hangcheck by heartbeats
> > > >
> > > > Replace sampling the engine state every so often with a periodic
> > > > heartbeat request to measure the health of an engine. This is coupled
> > > > with the forced-preemption to allow long running requests to survive so
> > > > long as they do not block other users.
> > >
> > > Can you explain why we would need this at all if we have forced-preemption?
> > > Forced preemption guarantees that an engine cannot interfere with the
> > timely
> > > execution of other contexts. If it hangs, but nothing else wants to use the
> > engine
> > > then do we care?
> > 
> > We may not have something else waiting to use the engine, but we may
> > have users waiting for the response where we need to detect the GPU hang
> > to prevent an infinite wait / stuck processes and infinite power drain.
> 
> I'm not sure I buy that logic. Being able to pre-empt doesn't imply it will
> ever end. As written a context can sit forever, apparently making progress
> but never actually returning a response to the user. If the user isn't happy
> with the progress they will kill the process. So we haven't solved the
> user responsiveness here. All we've done is eliminated the potential to
> run one class of otherwise valid workload.

Indeed, one of the conditions I have in mind for endless is rlimits. The
user + admin should be able to specify that a context not exceed so much
runtime, and if we ever get a scheduler, we can write that as a budget
(along with deadlines).
 
> Same argument goes for power. Just because it yields when other contexts
> want to run doesn't mean it won't consume lots of power indefinitely. I can
> equally write a CPU program to burn lots of power, forever, and it won't get
> nuked.

I agree, and continue to dislike letting hogs have free reign.

> TDR made sense when it was the only way to ensure contexts could always
> make forward progress. But force-preemption does everything we need to
> ensure that as far as I can tell.

No. Force-preemption (preemption-by-reset) is arbitrarily shooting mostly
innocent contexts, that had the misfortune to not yield quick enough. It
is data loss and a dos (given enough concentration could probably be used
by third parties to shoot down completely innocent clients), and so
should be used as a last resort shotgun and not be confused as being a
scalpel. And given our history and current situation, resets are still a
liability.
-Chris


More information about the Intel-gfx mailing list