[Intel-gfx] [CI 1/4] drm/i915/gt: Try to more gracefully quiesce the system before resets

Chris Wilson chris at chris-wilson.co.uk
Wed Oct 23 13:28:12 UTC 2019


Quoting Mika Kuoppala (2019-10-23 14:21:01)
> Chris Wilson <chris at chris-wilson.co.uk> writes:
> 
> > If we are doing a normal GPU reset triggered after detecting a long
> > period of stalled work, we can take our time and allow the engines to
> > quiesce. Since we've stopped submission to the engine, and if we wait
> > long enough an innocent context should complete, leaving the engine idle.
> > So by waiting a short amount of time, we should prevent clobbering other
> > users when resetting a stuck context.
> >
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/Kconfig.profile         | 11 +++++++++++
> >  drivers/gpu/drm/i915/gt/intel_engine_cs.c    | 20 +++++++++++++++++++-
> >  drivers/gpu/drm/i915/gt/intel_engine_types.h |  4 ++++
> >  3 files changed, 34 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> > index 48df8889a88a..97f01bfeda41 100644
> > --- a/drivers/gpu/drm/i915/Kconfig.profile
> > +++ b/drivers/gpu/drm/i915/Kconfig.profile
> > @@ -25,3 +25,14 @@ config DRM_I915_SPIN_REQUEST
> >         May be 0 to disable the initial spin. In practice, we estimate
> >         the cost of enabling the interrupt (if currently disabled) to be
> >         a few microseconds.
> > +
> > +config DRM_I915_STOP_TIMEOUT
> > +     int "How long to wait for an engine to quiesce gracefully before reset (ms)"
> > +     default 100 # milliseconds
> > +     help
> > +       By stopping submission and sleeping for a short time before resetting
> > +       the GPU, we allow the innocent contexts also on the system to quiesce.
> > +       It is then less likely for a hanging context to cause collateral
> > +       damage as the system is reset in order to recover. The colorary is
> 
> s/coloray/corollary
> 
> I am not claiming that I would know a better value for this tunable.
> 
> But atleast currently with the hangcheck periods we have, I think
> there is room for more time to actual reset processing.
> 
> We could go as far as we start to idle the other engines
> in parallel, when one shows symptoms. But well perhaps
> the effect is the same as shortening the detection cycle.

True, the other idea I think I may experiment with is pushing the
stalled flag down. There's no point waiting for the engine if we've
declared it hung already, and that should eliminate the need for the if
(in_atomic). I think the essence of the path stands -- we can reset
more gracefully if we wait.

I probably should make it a Suggested-by Joonas & Jon.
-Chris


More information about the Intel-gfx mailing list