[Intel-gfx] [PATCH 27/28] drm/i915: Cancel non-persistent contexts on close

Mon Aug 26 16:50:44 UTC 2019

Quoting Bloomfield, Jon (2019-08-26 14:39:55)
> > -----Original Message-----
> > From: Chris Wilson <chris at chris-wilson.co.uk>
> > +     /*
> > +      * Send a "high priority pulse" down the engine to cause the
> > +      * current request to be momentarily preempted. (If it fails to
> > +      * be preempted, it will be reset). As we have marked our context
> > +      * as banned, any incomplete request, including any running, will
> > +      * be skipped following the preemption.
> > +      */
> > +     reset = 0;
> > +     for_each_engine_masked(engine, gt->i915, active, tmp)
> > +             if (intel_engine_pulse(engine))
> > +                     reset |= engine->mask;
> > +
> > +     /*
> > +      * If we are unable to send a preemptive pulse to bump
> > +      * the context from the GPU, we have to resort to a full
> > +      * reset. We hope the collateral damage is worth it.
> > +      */
> > +     if (reset)
> > +             intel_gt_handle_error(gt, reset, 0,
> > +                                   "context closure in %s", ctx->name);
> 
> This seems inconsistent with the policy not to allow non-persistence without pre-emption, since if we can't pre-empt we nuke anyway.

So, the only way to get here is if i915.enable_hangcheck=0 on older
hardware; the user has forced ourselves into a situation we do not like.

Having recognised that i915.enable_hangcheck=0 is a trivial way to
accidentally dos (as opposed to the deliberate dos that is the expected
behaviour), this is our mitigation.

> But this feels unsafe to me - How does intel_gt_handle_error prevent us from nuking a following context, instead of the target? Ideally we would:
>         1) Unqueue any context currently behind the target context
>         2) Reset engine only if the target context is running (it could complete during 1)
>         3) Requeue other contexts
> 
> If the above is viable (?) we don't even need to attempt pre-emption.

That's exactly what the safe strategy attempts before we fallback
to the reset path. (With a bit of handwaving over gen8, it could do with
a bit of refinement as it has kernel preemption, but not user
preemption.)

> > +static int
> > +set_persistence(struct i915_gem_context *ctx,
> > +             const struct drm_i915_gem_context_param *args)
> > +{
> > +     if (args->size)
> > +             return -EINVAL;
> > +
> > +     if (args->value) {
> > +             i915_gem_context_set_persistence(ctx);
> > +             return 0;
> > +     }
> > +
> > +     /* To cancel a context we use "preempt-to-idle" */
> > +     if (!(ctx->i915->caps.scheduler &
> 
> Why do we need to give up on older devices? If you fail to preempt you reset the context anyway, so can't we just use the reset fallback path?

The fallback path is to reset the entire gpu with no regard as to what
is actually running. So we only allow the context parameter if we can
safely kill the context on closure.
-Chris