[Intel-gfx] [PATCH] drm/i915: Replace hangcheck by heartbeats

Fri Jul 26 15:00:06 UTC 2019

> -----Original Message-----
> From: Chris Wilson <chris at chris-wilson.co.uk>
> Sent: Thursday, July 25, 2019 4:52 PM
> To: Bloomfield, Jon <jon.bloomfield at intel.com>; intel-
> gfx at lists.freedesktop.org
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>; Ursulin, Tvrtko
> <tvrtko.ursulin at intel.com>
> Subject: RE: [PATCH] drm/i915: Replace hangcheck by heartbeats
> 
> Quoting Bloomfield, Jon (2019-07-26 00:41:49)
> > > -----Original Message-----
> > > From: Chris Wilson <chris at chris-wilson.co.uk>
> > > Sent: Thursday, July 25, 2019 4:28 PM
> > > To: Bloomfield, Jon <jon.bloomfield at intel.com>; intel-
> > > gfx at lists.freedesktop.org
> > > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>; Ursulin, Tvrtko
> > > <tvrtko.ursulin at intel.com>
> > > Subject: RE: [PATCH] drm/i915: Replace hangcheck by heartbeats
> > >
> > > Quoting Bloomfield, Jon (2019-07-26 00:21:47)
> > > > > -----Original Message-----
> > > > > From: Chris Wilson <chris at chris-wilson.co.uk>
> > > > > Sent: Thursday, July 25, 2019 4:17 PM
> > > > > To: intel-gfx at lists.freedesktop.org
> > > > > Cc: Chris Wilson <chris at chris-wilson.co.uk>; Joonas Lahtinen
> > > > > <joonas.lahtinen at linux.intel.com>; Ursulin, Tvrtko
> > > <tvrtko.ursulin at intel.com>;
> > > > > Bloomfield, Jon <jon.bloomfield at intel.com>
> > > > > Subject: [PATCH] drm/i915: Replace hangcheck by heartbeats
> > > > >
> > > > > Replace sampling the engine state every so often with a periodic
> > > > > heartbeat request to measure the health of an engine. This is coupled
> > > > > with the forced-preemption to allow long running requests to survive so
> > > > > long as they do not block other users.
> > > >
> > > > Can you explain why we would need this at all if we have forced-
> preemption?
> > > > Forced preemption guarantees that an engine cannot interfere with the
> > > timely
> > > > execution of other contexts. If it hangs, but nothing else wants to use the
> > > engine
> > > > then do we care?
> > >
> > > We may not have something else waiting to use the engine, but we may
> > > have users waiting for the response where we need to detect the GPU hang
> > > to prevent an infinite wait / stuck processes and infinite power drain.
> >
> > I'm not sure I buy that logic. Being able to pre-empt doesn't imply it will
> > ever end. As written a context can sit forever, apparently making progress
> > but never actually returning a response to the user. If the user isn't happy
> > with the progress they will kill the process. So we haven't solved the
> > user responsiveness here. All we've done is eliminated the potential to
> > run one class of otherwise valid workload.
> 
> Indeed, one of the conditions I have in mind for endless is rlimits. The
> user + admin should be able to specify that a context not exceed so much
> runtime, and if we ever get a scheduler, we can write that as a budget
> (along with deadlines).

Agreed, an rlimit solution would work better, if there was an option for 'unlimited'. 

The specific reason I poked was to try to find a solution to the
long-running compute workload scenarios. In those cases there is no fwd
progress on the ring/BB, but the kernel will still be making fwd progress. The
challenge here is that it's not easy for compiler to guarantee to fold a user
kernel into something that fit any specific time boundary. It depends on the
user kernel structure. A thread could take several seconds (or possibly
minutes) to complete. That's not automatically bad.

We need a solution to that specific problem, and while I agree that it would be nice to detect errant contexts and kick them out, that relies on an accurate classification of 'errant' and 'safe'. The response needs to be proportional to the problem. 
> 
> > Same argument goes for power. Just because it yields when other contexts
> > want to run doesn't mean it won't consume lots of power indefinitely. I can
> > equally write a CPU program to burn lots of power, forever, and it won't get
> > nuked.
> 
> I agree, and continue to dislike letting hogs have free reign.

But even with this change a BB hog can still have free reign to consume power, as long as it pauses it's unsavoury habit whenever the heartbeat request comes snooping on the engine. What we're effectively saying with this is that it's ok for a context to spin in a BB, but not ok to spin in an EU kernel. Why would we differentiate when both can burn the same power? So we haven't solved this problem, but continue to victimize legitimate code.

> 
> > TDR made sense when it was the only way to ensure contexts could always
> > make forward progress. But force-preemption does everything we need to
> > ensure that as far as I can tell.
> 
> No. Force-preemption (preemption-by-reset) is arbitrarily shooting mostly
> innocent contexts, that had the misfortune to not yield quick enough. It
> is data loss and a dos (given enough concentration could probably be used
> by third parties to shoot down completely innocent clients), and so
> should be used as a last resort shotgun and not be confused as being a
> scalpel. And given our history and current situation, resets are still a
> liability.

Doesn't the heartbeat rely on forced-preemption to send the bailiffs in when the context refuses to go? If so, that actually makes it more likely to evict an innocent context. So you're using that 'last resort' even more often.

I do like the rlimit suggestion, but until we have that, just disabling TDR feels like a better stop-gap than nuking innocent compute contexts just in case they might block something.

> -Chris