[Intel-gfx] [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close

Wed Aug 7 14:04:16 UTC 2019

> -----Original Message-----
> From: Chris Wilson <chris at chris-wilson.co.uk>
> Sent: Wednesday, August 7, 2019 6:23 AM
> To: intel-gfx at lists.freedesktop.org
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>; Winiarski, Michal
> <michal.winiarski at intel.com>; Bloomfield, Jon <jon.bloomfield at intel.com>
> Subject: Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
> 
> Quoting Chris Wilson (2019-08-06 14:47:25)
> > @@ -433,6 +482,8 @@ __create_context(struct drm_i915_private *i915)
> >
> >         i915_gem_context_set_bannable(ctx);
> >         i915_gem_context_set_recoverable(ctx);
> > +       if (i915_modparams.enable_hangcheck)
> > +               i915_gem_context_set_persistence(ctx);
> 
> I am not fond of this, but from a pragmatic point of view, this does
> prevent the issue Jon raised: HW resources being pinned indefinitely
> past process termination.
> 
> I don't like it because we cannot perform the operation cleanly
> everywhere, it requires preemption first and foremost (with a cooperating
> submission backend) and per-engine reset. The alternative is to try and
> do a full GPU reset if the context is still active. For the sake of the
> issue raised, I think that (full reset on older HW) is required.
> 
> That we are baking in a change of ABI due to an unsafe modparam is meh.
> There are a few more corner cases to deal with before endless just
> works. But since it is being used in the wild, I'm not sure we can wait
> for userspace to opt-in, or wait for cgroups. However, since users are
> being encouraged to disable hangcheck, should we extend the concept of
> persistence to also mean "survives hangcheck"? -- though it should be a
> separate parameter, and I'm not sure how exactly to protect it from the
> heartbeat reset without giving gross privileges to the context. (CPU
> isolation is nicer from the pov where we can just give up and not even
> worry about the engine. If userspace can request isolation, it has the
> privilege to screw up.)
> -Chris

Ok, so your concern is supporting non-persistence on older non-preempting, engine-reset capable, hardware. Why is that strictly required? Can't we simply make it dependent on the features needed to do it well, and if your hardware cannot, then the advice is not to disable hangcheck? I'm doubtful that anyone would attempt a HPC type workload on n IVB.

I'm not sure I understand your "survives hangcheck" idea. You mean instead of simply disabling hangcheck, just opt in to persistence and have that also prevent hangcheck? Isn't that the wrong way around, since persistence is what is happening today?