[Intel-gfx] [PATCH] drm/i915: Replace hangcheck by heartbeats

Fri Jul 26 20:58:38 UTC 2019

> -----Original Message-----
> From: Chris Wilson <chris at chris-wilson.co.uk>
> Sent: Friday, July 26, 2019 1:11 PM
> To: Bloomfield, Jon <jon.bloomfield at intel.com>; intel-
> gfx at lists.freedesktop.org
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>; Ursulin, Tvrtko
> <tvrtko.ursulin at intel.com>
> Subject: RE: [PATCH] drm/i915: Replace hangcheck by heartbeats
> 
> Quoting Bloomfield, Jon (2019-07-26 16:00:06)
> > > -----Original Message-----
> > > From: Chris Wilson <chris at chris-wilson.co.uk>
> > > Sent: Thursday, July 25, 2019 4:52 PM
> > > To: Bloomfield, Jon <jon.bloomfield at intel.com>; intel-
> > > gfx at lists.freedesktop.org
> > > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>; Ursulin, Tvrtko
> > > <tvrtko.ursulin at intel.com>
> > > Subject: RE: [PATCH] drm/i915: Replace hangcheck by heartbeats
> > >
> > > Quoting Bloomfield, Jon (2019-07-26 00:41:49)
> > > > > -----Original Message-----
> > > > > From: Chris Wilson <chris at chris-wilson.co.uk>
> > > > > Sent: Thursday, July 25, 2019 4:28 PM
> > > > > To: Bloomfield, Jon <jon.bloomfield at intel.com>; intel-
> > > > > gfx at lists.freedesktop.org
> > > > > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>; Ursulin, Tvrtko
> > > > > <tvrtko.ursulin at intel.com>
> > > > > Subject: RE: [PATCH] drm/i915: Replace hangcheck by heartbeats
> > > > >
> > > > > Quoting Bloomfield, Jon (2019-07-26 00:21:47)
> > > > > > > -----Original Message-----
> > > > > > > From: Chris Wilson <chris at chris-wilson.co.uk>
> > > > > > > Sent: Thursday, July 25, 2019 4:17 PM
> > > > > > > To: intel-gfx at lists.freedesktop.org
> > > > > > > Cc: Chris Wilson <chris at chris-wilson.co.uk>; Joonas Lahtinen
> > > > > > > <joonas.lahtinen at linux.intel.com>; Ursulin, Tvrtko
> > > > > <tvrtko.ursulin at intel.com>;
> > > > > > > Bloomfield, Jon <jon.bloomfield at intel.com>
> > > > > > > Subject: [PATCH] drm/i915: Replace hangcheck by heartbeats
> > > > > > >
> > > > > > > Replace sampling the engine state every so often with a periodic
> > > > > > > heartbeat request to measure the health of an engine. This is
> coupled
> > > > > > > with the forced-preemption to allow long running requests to
> survive so
> > > > > > > long as they do not block other users.
> > > > > >
> > > > > > Can you explain why we would need this at all if we have forced-
> > > preemption?
> > > > > > Forced preemption guarantees that an engine cannot interfere with
> the
> > > > > timely
> > > > > > execution of other contexts. If it hangs, but nothing else wants to use
> the
> > > > > engine
> > > > > > then do we care?
> > > > >
> > > > > We may not have something else waiting to use the engine, but we may
> > > > > have users waiting for the response where we need to detect the GPU
> hang
> > > > > to prevent an infinite wait / stuck processes and infinite power drain.
> > > >
> > > > I'm not sure I buy that logic. Being able to pre-empt doesn't imply it will
> > > > ever end. As written a context can sit forever, apparently making progress
> > > > but never actually returning a response to the user. If the user isn't happy
> > > > with the progress they will kill the process. So we haven't solved the
> > > > user responsiveness here. All we've done is eliminated the potential to
> > > > run one class of otherwise valid workload.
> > >
> > > Indeed, one of the conditions I have in mind for endless is rlimits. The
> > > user + admin should be able to specify that a context not exceed so much
> > > runtime, and if we ever get a scheduler, we can write that as a budget
> > > (along with deadlines).
> >
> > Agreed, an rlimit solution would work better, if there was an option for
> 'unlimited'.
> >
> > The specific reason I poked was to try to find a solution to the
> > long-running compute workload scenarios. In those cases there is no fwd
> > progress on the ring/BB, but the kernel will still be making fwd progress. The
> > challenge here is that it's not easy for compiler to guarantee to fold a user
> > kernel into something that fit any specific time boundary. It depends on the
> > user kernel structure. A thread could take several seconds (or possibly
> > minutes) to complete. That's not automatically bad.
> >
> > We need a solution to that specific problem, and while I agree that it would
> be nice to detect errant contexts and kick them out, that relies on an accurate
> classification of 'errant' and 'safe'. The response needs to be proportional to
> the problem.
> 
> Unless I am missing something we have nothing less than an engine reset.
> 
> The timers employed here are all per-engine, and could be exposed via
> sysfs with the same granularity.
> 
> However, I also think there is a large overlap between the scenarios you
> describe and the ones used for CPU-isolation.
> 
> > > > Same argument goes for power. Just because it yields when other
> contexts
> > > > want to run doesn't mean it won't consume lots of power indefinitely. I
> can
> > > > equally write a CPU program to burn lots of power, forever, and it won't
> get
> > > > nuked.
> > >
> > > I agree, and continue to dislike letting hogs have free reign.
> >
> > But even with this change a BB hog can still have free reign to consume
> power, as long as it pauses it's unsavoury habit whenever the heartbeat request
> comes snooping on the engine. What we're effectively saying with this is that
> it's ok for a context to spin in a BB, but not ok to spin in an EU kernel. Why
> would we differentiate when both can burn the same power? So we haven't
> solved this problem, but continue to victimize legitimate code.
> 
> Yes, that is exactly the point of this change. Along with it also
> providing the heartbeat for housekeeping purposes.
> 
> > > > TDR made sense when it was the only way to ensure contexts could
> always
> > > > make forward progress. But force-preemption does everything we need
> to
> > > > ensure that as far as I can tell.
> > >
> > > No. Force-preemption (preemption-by-reset) is arbitrarily shooting mostly
> > > innocent contexts, that had the misfortune to not yield quick enough. It
> > > is data loss and a dos (given enough concentration could probably be used
> > > by third parties to shoot down completely innocent clients), and so
> > > should be used as a last resort shotgun and not be confused as being a
> > > scalpel. And given our history and current situation, resets are still a
> > > liability.
> >
> > Doesn't the heartbeat rely on forced-preemption to send the bailiffs in when
> the context refuses to go? If so, that actually makes it more likely to evict an
> innocent context. So you're using that 'last resort' even more often.
> 
> It's no more often than before, you have to fail to advance within an
> interval, and then deny preemption request.

It's entrapment. You are creating an artificial workload for the context to impede. Before that artificial workload was injected, the context would have run to completion, the world would be at peace (and the user would have her new bitcoin). Instead she stays poor because the DoS police launched an illegal sting on her otherwise genius operation.

> 
> > I do like the rlimit suggestion, but until we have that, just disabling TDR feels
> like a better stop-gap than nuking innocent compute contexts just in case they
> might block something.
> 
> They're not innocent though :-p

They are innocent until proven guilty :-)

> 
> Disabling hangcheck (no idea why you confuse that with the recovery
> procedure) makes igt unhappy, but they are able to do that today with
> the modparam. This patch makes it easier to move that to an engine
> parameter, but to disable it entirely you still need to be able to reset
> the GPU on demand (suspend, eviction, oom). Without hangcheck we need to
> evaluate all MAX_SCHEDULE_TIMEOUT waits and supply a reset-on-timeout
> along critical paths.
> -Chris

I don't think I'm confusing hang-check with the recovery. I've talked about TDR, which to me is a periodic hangcheck, combined with a recovery by engine reset. I don't argue against being able to reset, just against the blunt classification that hangcheck itself provides.

TDR was originally looking for regressive workloads that were not making forward progress to protect against DoS. But it was always a very blunt tool. It's never been able to differentiate long running, but otherwise valid, compute workloads from genuine BB hangs, but that was fine at the time, and as you say users could always switch the modparam.

Now we have more emphasis on compute we need a solution that doesn't involve a modparam. This was specifically requested by the compute team - they know that they can flip the tdr switch, but that means their workload will only run if user modifies the system. That's hardly ideal.

Without the rlimit concept I don't think we can't prevent power hogs whatever we do, any more than the core kernel can prevent CPU power hogs. So, if we can prevent a workload from blocking other contexts, then it is unhelpful to continue either with the blunt tool that TDR is, or the similarly blunt heartbeat. If we have neither of these, but can guarantee forward progress when we need to, then we can still protect the system against DoS, without condemning any contexts before a crime has been committed.