[Intel-gfx] [PATCH] drm/i915: Replace hangcheck by heartbeats

Fri Jul 26 22:19:38 UTC 2019

Hmmn. We're still on orthogonal perspectives as far as our previous arguments stand. But it doesn't matter because while thinking through your replies, I realized there is one argument in favour, which trumps all my previous arguments against this patch - it makes things deterministic. Without this patch (or hangcheck), whether a context gets nuked depends on what else is running. And that's a recipe for confused support emails.

So I retract my other arguments, thanks for staying with me :-)

BTW: TDR==Timeout-Detection and Reset. Essentially hangcheck and recovery.

> -----Original Message-----
> From: Chris Wilson <chris at chris-wilson.co.uk>
> Sent: Friday, July 26, 2019 2:30 PM
> To: Bloomfield, Jon <jon.bloomfield at intel.com>; intel-
> gfx at lists.freedesktop.org
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>; Ursulin, Tvrtko
> <tvrtko.ursulin at intel.com>
> Subject: RE: [PATCH] drm/i915: Replace hangcheck by heartbeats
> 
> Quoting Bloomfield, Jon (2019-07-26 21:58:38)
> > > From: Chris Wilson <chris at chris-wilson.co.uk>
> > > It's no more often than before, you have to fail to advance within an
> > > interval, and then deny preemption request.
> >
> > It's entrapment. You are creating an artificial workload for the context to
> impede. Before that artificial workload was injected, the context would have
> run to completion, the world would be at peace (and the user would have her
> new bitcoin). Instead she stays poor because the DoS police launched an illegal
> sting on her otherwise genius operation.
> 
> It's housekeeping; it's the cost of keeping the engine alive. This
> argument is why CPU-isolation came about (aiui).
> 
> > > > I do like the rlimit suggestion, but until we have that, just disabling TDR
> feels
> > > like a better stop-gap than nuking innocent compute contexts just in case
> they
> > > might block something.
> > >
> > > They're not innocent though :-p
> >
> > They are innocent until proven guilty :-)
> >
> > >
> > > Disabling hangcheck (no idea why you confuse that with the recovery
> > > procedure) makes igt unhappy, but they are able to do that today with
> > > the modparam. This patch makes it easier to move that to an engine
> > > parameter, but to disable it entirely you still need to be able to reset
> > > the GPU on demand (suspend, eviction, oom). Without hangcheck we need
> to
> > > evaluate all MAX_SCHEDULE_TIMEOUT waits and supply a reset-on-
> timeout
> > > along critical paths.
> > > -Chris
> >
> > I don't think I'm confusing hang-check with the recovery. I've talked about
> TDR, which to me is a periodic hangcheck, combined with a recovery by engine
> reset. I don't argue against being able to reset, just against the blunt
> classification that hangcheck itself provides.
> >
> > TDR was originally looking for regressive workloads that were not making
> forward progress to protect against DoS. But it was always a very blunt tool. It's
> never been able to differentiate long running, but otherwise valid, compute
> workloads from genuine BB hangs, but that was fine at the time, and as you say
> users could always switch the modparam.
> 
> To be honest, I still have no idea what TDR is. But I take it that you
> agree that we're only talking about hangcheck :) What I think you are
> missing out on is that we have some more or less essential (itself
> depending on client behaviour) housekeeping that goes along side it.
> 
> My claim is that without a guarantee of isolation, anything else that
> wants to use that engine will need the background housekeeping. (And if
> we don't go as far as performing complete isolation, I expect userspace
> is going to need the kernel to cleanup as they go along, as they are
> unlikely to be prepared to do the system maintenance themselves.)
> 
> > Now we have more emphasis on compute we need a solution that doesn't
> involve a modparam. This was specifically requested by the compute team -
> they know that they can flip the tdr switch, but that means their workload will
> only run if user modifies the system. That's hardly ideal.
> 
> It means they can adjust things to their admins' hearts' content, and
> it's a udev rule away from setting permissions to allow the compute group
> to freely reconfigure the settings.
> 
> > Without the rlimit concept I don't think we can't prevent power hogs
> whatever we do, any more than the core kernel can prevent CPU power hogs.
> So, if we can prevent a workload from blocking other contexts, then it is
> unhelpful to continue either with the blunt tool that TDR is, or the similarly
> blunt heartbeat. If we have neither of these, but can guarantee forward
> progress when we need to, then we can still protect the system against DoS,
> without condemning any contexts before a crime has been committed.
> 
> rlimits is orthogonal to the problem of preventing an isolated compute
> task from turning into a system-wide dos. For endless, we need
> suspend-to-idle at a very minimum before even considering the dilemma of
> hangcheck. For eviction/oom, suspend-to-idle is not enough, and we need
> to have the same sort of concept as oomkiller; kill a task to save the
> system (or not depending on admin selection).
> -Chris