[Intel-gfx] [PATCH] drm/i915: Replace hangcheck by heartbeats

Fri Jul 26 20:10:52 UTC 2019

Quoting Bloomfield, Jon (2019-07-26 16:00:06)
> > -----Original Message-----
> > From: Chris Wilson <chris at chris-wilson.co.uk>
> > Sent: Thursday, July 25, 2019 4:52 PM
> > To: Bloomfield, Jon <jon.bloomfield at intel.com>; intel-
> > gfx at lists.freedesktop.org
> > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>; Ursulin, Tvrtko
> > <tvrtko.ursulin at intel.com>
> > Subject: RE: [PATCH] drm/i915: Replace hangcheck by heartbeats
> > 
> > Quoting Bloomfield, Jon (2019-07-26 00:41:49)
> > > > -----Original Message-----
> > > > From: Chris Wilson <chris at chris-wilson.co.uk>
> > > > Sent: Thursday, July 25, 2019 4:28 PM
> > > > To: Bloomfield, Jon <jon.bloomfield at intel.com>; intel-
> > > > gfx at lists.freedesktop.org
> > > > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>; Ursulin, Tvrtko
> > > > <tvrtko.ursulin at intel.com>
> > > > Subject: RE: [PATCH] drm/i915: Replace hangcheck by heartbeats
> > > >
> > > > Quoting Bloomfield, Jon (2019-07-26 00:21:47)
> > > > > > -----Original Message-----
> > > > > > From: Chris Wilson <chris at chris-wilson.co.uk>
> > > > > > Sent: Thursday, July 25, 2019 4:17 PM
> > > > > > To: intel-gfx at lists.freedesktop.org
> > > > > > Cc: Chris Wilson <chris at chris-wilson.co.uk>; Joonas Lahtinen
> > > > > > <joonas.lahtinen at linux.intel.com>; Ursulin, Tvrtko
> > > > <tvrtko.ursulin at intel.com>;
> > > > > > Bloomfield, Jon <jon.bloomfield at intel.com>
> > > > > > Subject: [PATCH] drm/i915: Replace hangcheck by heartbeats
> > > > > >
> > > > > > Replace sampling the engine state every so often with a periodic
> > > > > > heartbeat request to measure the health of an engine. This is coupled
> > > > > > with the forced-preemption to allow long running requests to survive so
> > > > > > long as they do not block other users.
> > > > >
> > > > > Can you explain why we would need this at all if we have forced-
> > preemption?
> > > > > Forced preemption guarantees that an engine cannot interfere with the
> > > > timely
> > > > > execution of other contexts. If it hangs, but nothing else wants to use the
> > > > engine
> > > > > then do we care?
> > > >
> > > > We may not have something else waiting to use the engine, but we may
> > > > have users waiting for the response where we need to detect the GPU hang
> > > > to prevent an infinite wait / stuck processes and infinite power drain.
> > >
> > > I'm not sure I buy that logic. Being able to pre-empt doesn't imply it will
> > > ever end. As written a context can sit forever, apparently making progress
> > > but never actually returning a response to the user. If the user isn't happy
> > > with the progress they will kill the process. So we haven't solved the
> > > user responsiveness here. All we've done is eliminated the potential to
> > > run one class of otherwise valid workload.
> > 
> > Indeed, one of the conditions I have in mind for endless is rlimits. The
> > user + admin should be able to specify that a context not exceed so much
> > runtime, and if we ever get a scheduler, we can write that as a budget
> > (along with deadlines).
> 
> Agreed, an rlimit solution would work better, if there was an option for 'unlimited'. 
> 
> The specific reason I poked was to try to find a solution to the
> long-running compute workload scenarios. In those cases there is no fwd
> progress on the ring/BB, but the kernel will still be making fwd progress. The
> challenge here is that it's not easy for compiler to guarantee to fold a user
> kernel into something that fit any specific time boundary. It depends on the
> user kernel structure. A thread could take several seconds (or possibly
> minutes) to complete. That's not automatically bad.
> 
> We need a solution to that specific problem, and while I agree that it would be nice to detect errant contexts and kick them out, that relies on an accurate classification of 'errant' and 'safe'. The response needs to be proportional to the problem. 

Unless I am missing something we have nothing less than an engine reset.

The timers employed here are all per-engine, and could be exposed via
sysfs with the same granularity.

However, I also think there is a large overlap between the scenarios you
describe and the ones used for CPU-isolation.

> > > Same argument goes for power. Just because it yields when other contexts
> > > want to run doesn't mean it won't consume lots of power indefinitely. I can
> > > equally write a CPU program to burn lots of power, forever, and it won't get
> > > nuked.
> > 
> > I agree, and continue to dislike letting hogs have free reign.
> 
> But even with this change a BB hog can still have free reign to consume power, as long as it pauses it's unsavoury habit whenever the heartbeat request comes snooping on the engine. What we're effectively saying with this is that it's ok for a context to spin in a BB, but not ok to spin in an EU kernel. Why would we differentiate when both can burn the same power? So we haven't solved this problem, but continue to victimize legitimate code.

Yes, that is exactly the point of this change. Along with it also
providing the heartbeat for housekeeping purposes.

> > > TDR made sense when it was the only way to ensure contexts could always
> > > make forward progress. But force-preemption does everything we need to
> > > ensure that as far as I can tell.
> > 
> > No. Force-preemption (preemption-by-reset) is arbitrarily shooting mostly
> > innocent contexts, that had the misfortune to not yield quick enough. It
> > is data loss and a dos (given enough concentration could probably be used
> > by third parties to shoot down completely innocent clients), and so
> > should be used as a last resort shotgun and not be confused as being a
> > scalpel. And given our history and current situation, resets are still a
> > liability.
> 
> Doesn't the heartbeat rely on forced-preemption to send the bailiffs in when the context refuses to go? If so, that actually makes it more likely to evict an innocent context. So you're using that 'last resort' even more often.

It's no more often than before, you have to fail to advance within an
interval, and then deny preemption request.

> I do like the rlimit suggestion, but until we have that, just disabling TDR feels like a better stop-gap than nuking innocent compute contexts just in case they might block something.

They're not innocent though :-p

Disabling hangcheck (no idea why you confuse that with the recovery
procedure) makes igt unhappy, but they are able to do that today with
the modparam. This patch makes it easier to move that to an engine
parameter, but to disable it entirely you still need to be able to reset
the GPU on demand (suspend, eviction, oom). Without hangcheck we need to
evaluate all MAX_SCHEDULE_TIMEOUT waits and supply a reset-on-timeout
along critical paths.
-Chris