[Intel-gfx] [PATCH 2/2] drm/i915: Use time based guilty context banning

Tue Feb 19 11:36:14 UTC 2019

Quoting Mika Kuoppala (2019-02-19 11:22:32)
> > diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
> > index 1911e00d2581..bae88a4ea924 100644
> > --- a/drivers/gpu/drm/i915/i915_reset.c
> > +++ b/drivers/gpu/drm/i915/i915_reset.c
> > @@ -59,24 +59,29 @@ static void client_mark_guilty(struct drm_i915_file_private *file_priv,
> >  
> >  static bool context_mark_guilty(struct i915_gem_context *ctx)
> >  {
> > -     unsigned int score;
> > -     bool banned, bannable;
> > +     unsigned long prev_hang;
> > +     bool banned;
> > +     int i;
> >  
> >       atomic_inc(&ctx->guilty_count);
> >  
> > -     bannable = i915_gem_context_is_bannable(ctx);
> > -     score = atomic_add_return(CONTEXT_SCORE_GUILTY, &ctx->ban_score);
> > -     banned = (!i915_gem_context_is_recoverable(ctx) ||
> > -               score >= CONTEXT_SCORE_BAN_THRESHOLD);
> > -
> >       /* Cool contexts don't accumulate client ban score */
> 
> This comment becomes misleading and can be removed.

/* Cool contexts are too cool to be banned */

> 
> > -     if (!bannable)
> > +     if (!i915_gem_context_is_bannable(ctx))
> >               return false;
> >  
> > +     /* Record the timestamp for the last N hangs */
> > +     prev_hang = ctx->hang_timestamp[0];
> > +     for (i = 0; i < ARRAY_SIZE(ctx->hang_timestamp) - 1; i++)
> > +             ctx->hang_timestamp[i] = ctx->hang_timestamp[i + 1];
> > +     ctx->hang_timestamp[i] = jiffies;
> > +
> > +     /* If we have hung N+1 times in rapid succession, we ban the context! */
> > +     banned = !i915_gem_context_is_recoverable(ctx);
> > +     if (time_before(jiffies, prev_hang + CONTEXT_FAST_HANG_JIFFIES))
> > +             banned = true;
> 
> Ok, the initialization to jiffies - CONTEXT_FAST_HANG_JIFFIES guarantees
> that it cant be banned even if it manages to immediately hang. Which
> in itself is difficult feat to accomplish.
> 
> >       if (banned) {
> > -             DRM_DEBUG_DRIVER("context %s: guilty %d, score %u, banned\n",
> > -                              ctx->name, atomic_read(&ctx->guilty_count),
> > -                              score);
> > +             DRM_DEBUG_DRIVER("context %s: guilty %d, banned\n",
> > +                              ctx->name,
> > atomic_read(&ctx->guilty_count));
> 
> Now when we know when it previously hanged, we could improve the
> logging a bit by saying how long ago. But not sure
> about if it worth the trouble.

~o~

Not sure yet if that will be helpful. The secondary hangs do not get
captured in error-state, so for post-mortem debugging we don't know if
the context is banned (yet). For clients trying to run, the first they
know about being banned is -EIO from gem_execbuf. Looking back at dmesg,
there's just a string of GPU resets, already too late for adding
drm.debug=0x2 to dmesg.

-EIO either means client banned or driver is wedged. In either case, the
first course of action is to debug the initial hang. (Or in the case of
mesa recovery patches, to recreate the context and try again.) From the
perspective of the current patches, we are teaching clients how to
handle such events without "distrubing" the user, even less incentive
for large amount of debugging into the secondary banning, fix the
primary hang!

Yeah, I'm thinking that we don't need much more than some driver
debugging to clarify that the context was banned, everything else should
be discoverable though the uapi; and the real debugging is always the
first hang.
-Chris