[Intel-gfx] [PATCH 20/30] drm/i915: Cancel non-persistent contexts on close

Chris Wilson chris at chris-wilson.co.uk
Wed Oct 2 14:05:04 UTC 2019


Quoting Bloomfield, Jon (2019-10-02 14:52:32)
> 
> 
> > -----Original Message-----
> > From: Chris Wilson <chris at chris-wilson.co.uk>
> > Sent: Wednesday, October 2, 2019 4:20 AM
> > To: intel-gfx at lists.freedesktop.org
> > Cc: Chris Wilson <chris at chris-wilson.co.uk>; Joonas Lahtinen
> > <joonas.lahtinen at linux.intel.com>; Winiarski, Michal
> > <michal.winiarski at intel.com>; Bloomfield, Jon <jon.bloomfield at intel.com>
> > Subject: [PATCH 20/30] drm/i915: Cancel non-persistent contexts on close
> > 
> > Normally, we rely on our hangcheck to prevent persistent batches from
> > hogging the GPU. However, if the user disables hangcheck, this mechanism
> > breaks down. Despite our insistence that this is unsafe, the users are
> > equally insistent that they want to use endless batches and will disable
> > the hangcheck mechanism. We are looking at perhaps replacing hangcheck
> > with a softer mechanism, that sends a pulse down the engine to check if
> > it is well. We can use the same preemptive pulse to flush an active
> > persistent context off the GPU upon context close, preventing resources
> > being lost and unkillable requests remaining on the GPU after process
> > termination. To avoid changing the ABI and accidentally breaking
> > existing userspace, we make the persistence of a context explicit and
> > enable it by default (matching current ABI). Userspace can opt out of
> > persistent mode (forcing requests to be cancelled when the context is
> > closed by process termination or explicitly) by a context parameter. To
> > facilitate existing use-cases of disabling hangcheck, if the modparam is
> > disabled (i915.enable_hangcheck=0), we disable persistence mode by
> > default.  (Note, one of the outcomes for supporting endless mode will be
> > the removal of hangchecking, at which point opting into persistent mode
> > will be mandatory, or maybe the default perhaps controlled by cgroups.)
> > 
> > Testcase: igt/gem_ctx_persistence
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> > Cc: MichaƂ Winiarski <michal.winiarski at intel.com>
> > Cc: Jon Bloomfield <jon.bloomfield at intel.com>
> > Reviewed-by: Jon Bloomfield <jon.bloomfield at intel.com>
> > ---
> >  drivers/gpu/drm/i915/Makefile                 |   1 +
> >  drivers/gpu/drm/i915/gem/i915_gem_context.c   | 122 ++++++++++++++++++
> >  drivers/gpu/drm/i915/gem/i915_gem_context.h   |  15 +++
> >  .../gpu/drm/i915/gem/i915_gem_context_types.h |   1 +
> >  .../gpu/drm/i915/gem/selftests/mock_context.c |   2 +
> >  .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |  56 ++++++++
> >  .../gpu/drm/i915/gt/intel_engine_heartbeat.h  |  14 ++
> >  drivers/gpu/drm/i915/gt/intel_engine_pm.c     |   2 +-
> >  drivers/gpu/drm/i915/i915_priolist_types.h    |   1 +
> >  include/uapi/drm/i915_drm.h                   |  15 +++
> >  10 files changed, 228 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> >  create mode 100644 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.h
> > 
> > diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> > index 1f87c70a2842..561fa2bb3006 100644
> > --- a/drivers/gpu/drm/i915/Makefile
> > +++ b/drivers/gpu/drm/i915/Makefile
> > @@ -76,6 +76,7 @@ gt-y += \
> >       gt/intel_breadcrumbs.o \
> >       gt/intel_context.o \
> >       gt/intel_engine_cs.o \
> > +     gt/intel_engine_heartbeat.o \
> >       gt/intel_engine_pm.o \
> >       gt/intel_engine_pool.o \
> >       gt/intel_engine_sysfs.o \
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > index 0ab416887fc2..e832238a72e5 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> 
> <SNIP>
> 
> > +static int
> > +set_persistence(struct i915_gem_context *ctx,
> > +             const struct drm_i915_gem_context_param *args)
> > +{
> > +     if (args->size)
> > +             return -EINVAL;
> > +
> > +     if (args->value) {
> > +             i915_gem_context_set_persistence(ctx);
> > +             return 0;
> > +     }
> > +
> > +     /* To cancel a context we use "preempt-to-idle" */
> > +     if (!(ctx->i915->caps.scheduler &
> > I915_SCHEDULER_CAP_PREEMPTION))
> > +             return -ENODEV;
> > +
> > +     i915_gem_context_clear_persistence(ctx);
> > +     return 0;
> > +}
> 
> Hmmn. Given that disabling hangcheck is an explicit operation, and we already change the default setting, can't we make it a hard requirement that persistence requires hangcheck? You should not really be able to opt back in to persistence if hangcheck is disabled. In fact you could just test for hangcheck when deciding whether to kill the context, and force-kill if it is off - that way if hangcheck is disabled after a context starts it will get cleaned up.

I hear the toctou argument, but I really, really want to avoid coupling
in the modparam as much possible (I'd rather kill off the modparam
entirely for being too coarse).

But certainly having a check here saying you cannot re-enable
persistence without hangcheck seems valid. Let's see how that looks.
(That takes a bit of lalala to ignore the toctou.)
-Chris


More information about the Intel-gfx mailing list