[Intel-gfx] [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close

Wed Aug 7 14:33:51 UTC 2019

> -----Original Message-----
> From: Chris Wilson <chris at chris-wilson.co.uk>
> Sent: Wednesday, August 7, 2019 7:14 AM
> To: Bloomfield, Jon <jon.bloomfield at intel.com>; intel-
> gfx at lists.freedesktop.org
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>; Winiarski, Michal
> <michal.winiarski at intel.com>
> Subject: RE: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
> 
> Quoting Bloomfield, Jon (2019-08-07 15:04:16)
> > Ok, so your concern is supporting non-persistence on older non-preempting,
> engine-reset capable, hardware. Why is that strictly required? Can't we simply
> make it dependent on the features needed to do it well, and if your hardware
> cannot, then the advice is not to disable hangcheck? I'm doubtful that anyone
> would attempt a HPC type workload on n IVB.
> 
> Our advice was not to disable hangcheck :)
> 
> With the cat out of the bag, my concern is dotting the Is and crossing
> the Ts. Fixing up the error handling path to the reset isn't all that
> bad. But I'm not going to advertise the persistence-parameter support
> unless we have a clean solution, and we can advise that compute
> workloads are better handled with new hardware.
> 
> > I'm not sure I understand your "survives hangcheck" idea. You mean instead
> of simply disabling hangcheck, just opt in to persistence and have that also
> prevent hangcheck? Isn't that the wrong way around, since persistence is what
> is happening today?
> 
> Persistence is the clear-and-present danger. I'm just trying to sketch a
> path for endless support, trying to ask myself questions such as: Is the
> persistence parameter still required? What other parameters make sense?
> Does anything less than CPU-esque isolation make sense? :)
> -Chris

I personally liked your persistence idea :-)

Isolation doesn't really solve the problem in this case. So, customer enables isolation for a HPC workload. That workload hangs, and user hits ctrl-C. We still have a hung workload and the next job in the queue still can't run.

Also, Isolation is kind of meaningless when there is only one engine that's capable of running your workload. On Gen9, pretty much every type of workload requires some RCS involvement, and RCS is where the compute workloads need to run. So isolation hasn't helped us.

I'd settle for umd opting in to non-persistence and not providing the automatic association with hangcheck. That at least ensures well behaved umd's don't kill the system.

We didn't explore the idea of terminating orphaned contexts though (where none of their resources are referenced by any other contexts). Is there a reason why this is not feasible? In the case of compute (certainly HPC) workloads, there would be no compositor taking the output so this might be a solution.

Jon