[Intel-gfx] [PATCH] drm/i915: Replace hangcheck by heartbeats

Bloomfield, Jon jon.bloomfield at intel.com
Thu Jul 25 23:41:49 UTC 2019


> -----Original Message-----
> From: Chris Wilson <chris at chris-wilson.co.uk>
> Sent: Thursday, July 25, 2019 4:28 PM
> To: Bloomfield, Jon <jon.bloomfield at intel.com>; intel-
> gfx at lists.freedesktop.org
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>; Ursulin, Tvrtko
> <tvrtko.ursulin at intel.com>
> Subject: RE: [PATCH] drm/i915: Replace hangcheck by heartbeats
> 
> Quoting Bloomfield, Jon (2019-07-26 00:21:47)
> > > -----Original Message-----
> > > From: Chris Wilson <chris at chris-wilson.co.uk>
> > > Sent: Thursday, July 25, 2019 4:17 PM
> > > To: intel-gfx at lists.freedesktop.org
> > > Cc: Chris Wilson <chris at chris-wilson.co.uk>; Joonas Lahtinen
> > > <joonas.lahtinen at linux.intel.com>; Ursulin, Tvrtko
> <tvrtko.ursulin at intel.com>;
> > > Bloomfield, Jon <jon.bloomfield at intel.com>
> > > Subject: [PATCH] drm/i915: Replace hangcheck by heartbeats
> > >
> > > Replace sampling the engine state every so often with a periodic
> > > heartbeat request to measure the health of an engine. This is coupled
> > > with the forced-preemption to allow long running requests to survive so
> > > long as they do not block other users.
> >
> > Can you explain why we would need this at all if we have forced-preemption?
> > Forced preemption guarantees that an engine cannot interfere with the
> timely
> > execution of other contexts. If it hangs, but nothing else wants to use the
> engine
> > then do we care?
> 
> We may not have something else waiting to use the engine, but we may
> have users waiting for the response where we need to detect the GPU hang
> to prevent an infinite wait / stuck processes and infinite power drain.

I'm not sure I buy that logic. Being able to pre-empt doesn't imply it will
ever end. As written a context can sit forever, apparently making progress
but never actually returning a response to the user. If the user isn't happy
with the progress they will kill the process. So we haven't solved the
user responsiveness here. All we've done is eliminated the potential to
run one class of otherwise valid workload.

Same argument goes for power. Just because it yields when other contexts
want to run doesn't mean it won't consume lots of power indefinitely. I can
equally write a CPU program to burn lots of power, forever, and it won't get
nuked.

TDR made sense when it was the only way to ensure contexts could always
make forward progress. But force-preemption does everything we need to
ensure that as far as I can tell.

> 
> There is also the secondary task of flushing idle barriers.
> 
> > Power, obviously. But I'm not everything can be pre-empted. If a compute
> > context is running on an engine, and no other contexts require that engine,
> > then is it  right to nuke it just because it's busy in a long running thread?
> 
> Yes. Unless you ask that we implement GPU-isolation where not even the
> kernel is allowed to use a particular set of engines.
> -Chris


More information about the Intel-gfx mailing list