[Intel-gfx] [PATCH] drm/i915: Replace hangcheck by heartbeats

Fri Jul 26 21:29:39 UTC 2019

Quoting Bloomfield, Jon (2019-07-26 21:58:38)
> > From: Chris Wilson <chris at chris-wilson.co.uk>
> > It's no more often than before, you have to fail to advance within an
> > interval, and then deny preemption request.
> 
> It's entrapment. You are creating an artificial workload for the context to impede. Before that artificial workload was injected, the context would have run to completion, the world would be at peace (and the user would have her new bitcoin). Instead she stays poor because the DoS police launched an illegal sting on her otherwise genius operation.

It's housekeeping; it's the cost of keeping the engine alive. This
argument is why CPU-isolation came about (aiui).

> > > I do like the rlimit suggestion, but until we have that, just disabling TDR feels
> > like a better stop-gap than nuking innocent compute contexts just in case they
> > might block something.
> > 
> > They're not innocent though :-p
> 
> They are innocent until proven guilty :-)
> 
> > 
> > Disabling hangcheck (no idea why you confuse that with the recovery
> > procedure) makes igt unhappy, but they are able to do that today with
> > the modparam. This patch makes it easier to move that to an engine
> > parameter, but to disable it entirely you still need to be able to reset
> > the GPU on demand (suspend, eviction, oom). Without hangcheck we need to
> > evaluate all MAX_SCHEDULE_TIMEOUT waits and supply a reset-on-timeout
> > along critical paths.
> > -Chris
> 
> I don't think I'm confusing hang-check with the recovery. I've talked about TDR, which to me is a periodic hangcheck, combined with a recovery by engine reset. I don't argue against being able to reset, just against the blunt classification that hangcheck itself provides.
> 
> TDR was originally looking for regressive workloads that were not making forward progress to protect against DoS. But it was always a very blunt tool. It's never been able to differentiate long running, but otherwise valid, compute workloads from genuine BB hangs, but that was fine at the time, and as you say users could always switch the modparam.

To be honest, I still have no idea what TDR is. But I take it that you
agree that we're only talking about hangcheck :) What I think you are
missing out on is that we have some more or less essential (itself
depending on client behaviour) housekeeping that goes along side it.

My claim is that without a guarantee of isolation, anything else that
wants to use that engine will need the background housekeeping. (And if
we don't go as far as performing complete isolation, I expect userspace
is going to need the kernel to cleanup as they go along, as they are
unlikely to be prepared to do the system maintenance themselves.)

> Now we have more emphasis on compute we need a solution that doesn't involve a modparam. This was specifically requested by the compute team - they know that they can flip the tdr switch, but that means their workload will only run if user modifies the system. That's hardly ideal.

It means they can adjust things to their admins' hearts' content, and
it's a udev rule away from setting permissions to allow the compute group
to freely reconfigure the settings.

> Without the rlimit concept I don't think we can't prevent power hogs whatever we do, any more than the core kernel can prevent CPU power hogs. So, if we can prevent a workload from blocking other contexts, then it is unhelpful to continue either with the blunt tool that TDR is, or the similarly blunt heartbeat. If we have neither of these, but can guarantee forward progress when we need to, then we can still protect the system against DoS, without condemning any contexts before a crime has been committed. 

rlimits is orthogonal to the problem of preventing an isolated compute
task from turning into a system-wide dos. For endless, we need
suspend-to-idle at a very minimum before even considering the dilemma of
hangcheck. For eviction/oom, suspend-to-idle is not enough, and we need
to have the same sort of concept as oomkiller; kill a task to save the
system (or not depending on admin selection).
-Chris