[Intel-gfx] [PATCH 09/11] drm/i915/gem: Consolidate ctx->engines[] release

Wed Feb 26 17:06:37 UTC 2020

Quoting Tvrtko Ursulin (2020-02-26 16:41:03)
> 
> On 25/02/2020 08:22, Chris Wilson wrote:
> > Use the same engine_idle_release() routine for cleaning all old
> > ctx->engine[] state, closing any potential races with concurrent execbuf
> > submission.
> > 
> > Closes: https://gitlab.freedesktop.org/drm/intel/issues/1241
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > ---
> > Reorder set-closed/engine_idle_release to avoid premature killing
> > Take a reference to prevent racing context free with engine cleanup
> > ---
> >   drivers/gpu/drm/i915/gem/i915_gem_context.c | 199 +++++++++++---------
> >   drivers/gpu/drm/i915/gem/i915_gem_context.h |   1 -
> >   2 files changed, 108 insertions(+), 92 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > index adcebf22a3d3..0862a77d81ed 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > @@ -243,7 +243,6 @@ static void __free_engines(struct i915_gem_engines *e, unsigned int count)
> >               if (!e->engines[count])
> >                       continue;
> >   
> > -             RCU_INIT_POINTER(e->engines[count]->gem_context, NULL);
> >               intel_context_put(e->engines[count]);
> >       }
> >       kfree(e);
> > @@ -256,7 +255,11 @@ static void free_engines(struct i915_gem_engines *e)
> >   
> >   static void free_engines_rcu(struct rcu_head *rcu)
> >   {
> > -     free_engines(container_of(rcu, struct i915_gem_engines, rcu));
> > +     struct i915_gem_engines *engines =
> > +             container_of(rcu, struct i915_gem_engines, rcu);
> > +
> > +     i915_sw_fence_fini(&engines->fence);
> 
> This was missing so far?

Yes. Completely missed it until throwing it in a loop long enough for
kmalloc recycling to catch up. And having ODEBUG enabled helps!

> > +static int engines_notify(struct i915_sw_fence *fence,
> > +                       enum i915_sw_fence_notify state)
> > +{
> > +     struct i915_gem_engines *engines =
> > +             container_of(fence, typeof(*engines), fence);
> > +
> > +     switch (state) {
> > +     case FENCE_COMPLETE:
> > +             if (!list_empty(&engines->link)) {
> > +                     struct i915_gem_context *ctx = engines->ctx;
> > +                     unsigned long flags;
> > +
> > +                     spin_lock_irqsave(&ctx->stale.lock, flags);
> > +                     list_del(&engines->link);
> > +                     spin_unlock_irqrestore(&ctx->stale.lock, flags);
> > +             }
> > +             break;
> > +
> > +     case FENCE_FREE:
> > +             i915_gem_context_put(engines->ctx);
> 
> This put can go under FENCE_COMPLETE?

Yes. Either works, I thought it was more of a release operation. But if
you would rather FENCE_FREE == just call_rcu(free_engines_rcu), I can see
the elegance in that.

> > +             init_rcu_head(&engines->rcu);
> > +             call_rcu(&engines->rcu, free_engines_rcu);
> > +             break;
> > +     }
> > +
> > +     return NOTIFY_DONE;
> > +}
> > +
> > +static void engines_idle_release(struct i915_gem_context *ctx,
> > +                              struct i915_gem_engines *engines)
> > +{
> > +     struct i915_gem_engines_iter it;
> > +     struct intel_context *ce;
> > +
> > +     i915_sw_fence_init(&engines->fence, engines_notify);
> > +     INIT_LIST_HEAD(&engines->link);
> > +
> > +     engines->ctx = i915_gem_context_get(ctx);
> > +
> > +     for_each_gem_engine(ce, engines, it) {
> > +             int err = 0;
> > +
> > +             RCU_INIT_POINTER(ce->gem_context, NULL);
> > +
> > +             if (!ce->timeline) { /* XXX serialisation with execbuf? */
> > +                     intel_context_set_banned(ce);
> 
> What is banned for?

Banned is how we prevent further execution. The problem here is making
sure we catch concurrent execbuf allocating/pinning the context. This
does not and leaves a window in which between the !ce->timline and
set_banned the other thread could see in with the hanging batch :|

On the other hand, we don't want to mark the context as banned too
early. So we unfortunately can't mark it unconditionally.

> > +                     continue;
> > +             }
> > +
> > +             mutex_lock(&ce->timeline->mutex);
> > +             if (!list_empty(&ce->timeline->requests)) {
> > +                     struct i915_request *rq;
> > +
> > +                     rq = list_last_entry(&ce->timeline->requests,
> > +                                          typeof(*rq),
> > +                                          link);
> 
> Why no more i915_active_fence_get?

I was looking for something concrete with which we can serialise with
execbuf, the timeline mutex is one and we can check for a late ban
inside execbuf.

But there's still the tiny window above.

Hmm. Actually the ce->pin_mutex might work^Whelp for execbuf serialisation.
Not by itself it won't though. But it should be able to close the
!ce->timeline hole...
-Chris