[Intel-gfx] [PATCH] drm/i915: Optionally disable automatic recovery after a GPU reset

Chris Wilson chris at chris-wilson.co.uk
Wed Oct 3 07:22:34 UTC 2018


Quoting Mika Kuoppala (2018-10-03 07:22:13)
> Chris Wilson <chris at chris-wilson.co.uk> writes:
> 
> > Some clients, such as mesa, may only emit minimal incremental batches
> > that rely on the logical context state from previous batches. They know
> > that recovery is impossible after a hang as their required GPU state is
> > lost, and that each in flight and subsequent batch will hang (resetting
> > the context image back to default perpetuating the problem).
> >
> > To avoid getting into the state in the first place, we can allow clients
> > to opt out of automatic recovery and elect to ban any guilty context
> > following a hang. This prevents the continual stream of hangs and allows
> > the client to recreate their context and rebuild the state from scratch.
> >
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Kenneth Graunke <kenneth at whitecape.org>
> > Cc: Mika Kuoppala <mika.kuoppala at intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_gem.c         |  3 ++-
> >  drivers/gpu/drm/i915/i915_gem_context.c | 12 ++++++++++++
> >  drivers/gpu/drm/i915/i915_gem_context.h | 16 ++++++++++++++++
> >  include/uapi/drm/i915_drm.h             | 20 ++++++++++++++++++++
> >  4 files changed, 50 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 7d45e71100bc..eee06d90d460 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -3018,7 +3018,8 @@ static void i915_gem_context_mark_guilty(struct i915_gem_context *ctx)
> >  
> >       bannable = i915_gem_context_is_bannable(ctx);
> >       score = atomic_add_return(CONTEXT_SCORE_GUILTY, &ctx->ban_score);
> > -     banned = score >= CONTEXT_SCORE_BAN_THRESHOLD;
> > +     banned = (i915_gem_context_is_unrecoverable(ctx) ||
> > +               score >= CONTEXT_SCORE_BAN_THRESHOLD);
> 
> We treat banned contexts rather aggressively on client level
> banning scoring.
> 
> Should we give some leeway if a client tells it don't
> need recovery, instead of being more harsh on them?
> 
> As with this, third hang would lead to client ban.

Yes, no. I was thinking of the same problem, but ultimately I decided
that it was not worth pursuing atm. It still takes 10s (or thereabouts)
for us to detect a hang, so a client can lock the machine up for 30s
even with the "quicker" recovery. As stated, the client will be banned
anyway as it cannot recover without taking action itself (just after a
few broken batches). So from that perspective, it doesn't look any worse
than before. If it becomes a more noticeable problem (WebGL) we can open
a dialogue as to how to proceed.

> >       /* Cool contexts don't accumulate client ban score */
> >       if (!bannable)
> > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> > index 8cbe58070561..2d5e4119786a 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > @@ -878,6 +878,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> >       case I915_CONTEXT_PARAM_BANNABLE:
> >               args->value = i915_gem_context_is_bannable(ctx);
> >               break;
> > +     case I915_CONTEXT_PARAM_RECOVERABLE:
> > +             args->value = !i915_gem_context_is_unrecoverable(ctx);
> 
> Pondering here why not the internal coding be also
> i915_gem_context_is_recoverable(ctx). Atleast in here
> would read better.

6 of one, half a dozen of the other. Improve one place, makes the other
equally less readable. I'd buy the argument that we should prefer the
default state to be expressed in the name.
-Chris


More information about the Intel-gfx mailing list