[Intel-gfx] [PATCH] drm/i915: Replace hangcheck by heartbeats

Mon Jul 29 09:45:52 UTC 2019

Quoting Joonas Lahtinen (2019-07-29 10:26:47)
> Quoting Chris Wilson (2019-07-27 01:27:02)
> > Quoting Bloomfield, Jon (2019-07-26 23:19:38)
> > > Hmmn. We're still on orthogonal perspectives as far as our previous arguments stand. But it doesn't matter because while thinking through your replies, I realized there is one argument in favour, which trumps all my previous arguments against this patch - it makes things deterministic. Without this patch (or hangcheck), whether a context gets nuked depends on what else is running. And that's a recipe for confused support emails.
> > > 
> > > So I retract my other arguments, thanks for staying with me :-)
> > 
> > No worries, it's been really useful, especially realising a few more
> > areas we can improve our resilience. You will get your way eventually.
> > (But what did it cost? Everything.)
> 
> Ok, so just confirming here. The plan is still to have userspace set a
> per context (or per request) time limit for expected completion of a
> request. This will be useful for the media workloads that consume
> deterministic amount of time for correct bitstream. And the userspace
> wants to be notified much quicker than the generic hangcheck time if
> the operation failed due to corrupt bitstream.
> 
> This time limit can be set to infinite by compute workloads.

That only provides a cap on the context itself. We also have the
criteria that is something else has been selected to run on the GPU, you
have to allow preemption within a certain period or else you will be
shot.

> Then, in parallel to that, we have cgroups or system wide configuration
> for maximum allowed timeslice per process/context. That means that a
> long-running workload must pre-empt at that granularity.

Not quite. It must preempt within a few ms of being asked, that is a
different problem to the timeslice granularity (which is when we ask it
to switch, or if not due to a high priority request earlier). It's a QoS
issue for the other context. Setting that timeout is hard, we can allow
a context to select its own timeout, or define it via sysfs/cgroups, but
because it depends on the running context, it causes another context to
fail in non-trivial ways. The GPU is simply not as preemptible as one
would like.

Fwiw, I was thinking the next step would be to put per-engine controls
in sysfs, then cross the cgroups bridge. I'm not sure my previous plan
of exposing per-context parameters for timeslice/preemption is that
usable.

> That pre-emption/hearbeat should happen regardless if others contexts are
> requesting the hardware or not, because better start recovery of a hung
> task as soon as it misbehaves.

I concur, but Jon would like the opposite to allow for uncooperative
compute kernels that simply block preemption forever. I think for the
extreme Jon wants, something like CPU-isolation fits better, where the
important client owns an engine all to itself and the kernel is not even
allowed to do housekeeping on that engine. (We would turn off time-
slicing, preemption timers, etc on that engine and basically run it in
submission order.)
-Chris