[Intel-gfx] [PATCH] drm/i915: Replace hangcheck by heartbeats

Mon Jul 29 09:26:47 UTC 2019

Quoting Chris Wilson (2019-07-27 01:27:02)
> Quoting Bloomfield, Jon (2019-07-26 23:19:38)
> > Hmmn. We're still on orthogonal perspectives as far as our previous arguments stand. But it doesn't matter because while thinking through your replies, I realized there is one argument in favour, which trumps all my previous arguments against this patch - it makes things deterministic. Without this patch (or hangcheck), whether a context gets nuked depends on what else is running. And that's a recipe for confused support emails.
> > 
> > So I retract my other arguments, thanks for staying with me :-)
> 
> No worries, it's been really useful, especially realising a few more
> areas we can improve our resilience. You will get your way eventually.
> (But what did it cost? Everything.)

Ok, so just confirming here. The plan is still to have userspace set a
per context (or per request) time limit for expected completion of a
request. This will be useful for the media workloads that consume
deterministic amount of time for correct bitstream. And the userspace
wants to be notified much quicker than the generic hangcheck time if
the operation failed due to corrupt bitstream.

This time limit can be set to infinite by compute workloads.

Then, in parallel to that, we have cgroups or system wide configuration
for maximum allowed timeslice per process/context. That means that a
long-running workload must pre-empt at that granularity.

That pre-emption/hearbeat should happen regardless if others contexts are
requesting the hardware or not, because better start recovery of a hung
task as soon as it misbehaves.

Regards, Joonas