[Intel-gfx] [PATCH 1/3] drm/i915/gem: Don't leak non-persistent requests on changing engines

Fri Feb 7 16:55:35 UTC 2020

Quoting Tvrtko Ursulin (2020-02-07 16:46:55)
> 
> If you want quick&dirty feedback read below, if you want something 
> smarter wait some more. :)
> 
> On 07/02/2020 11:11, Chris Wilson wrote:
> > +static void kill_stale_engines(struct i915_gem_context *ctx)
> > +{
> > +     struct i915_gem_engines *pos, *next;
> > +     unsigned long flags;
> > +
> > +     spin_lock_irqsave(&ctx->stale.lock, flags);
> > +     list_for_each_entry_safe(pos, next, &ctx->stale.engines, link) {
> > +             if (!i915_sw_fence_await(&pos->fence))
> > +                     continue;
> 
> When is this path hit?

Race with the interrupt callback.

> > +
> > +             spin_unlock_irqrestore(&ctx->stale.lock, flags);
> > +
> > +             kill_engines(pos);
> > +
> > +             spin_lock_irqsave(&ctx->stale.lock, flags);
> > +             list_safe_reset_next(pos, next, link);
> > +             list_del_init(&pos->link);
> > +
> > +             i915_sw_fence_complete(&pos->fence);
> 
> This will trigger FENCE_FREE below?

Yes, the final completion sends both notifications.

> > +static int engines_notify(struct i915_sw_fence *fence,
> > +                       enum i915_sw_fence_notify state)
> > +{
> > +     struct i915_gem_engines *engines =
> > +             container_of(fence, typeof(*engines), fence);
> > +
> > +     switch (state) {
> > +     case FENCE_COMPLETE:
> > +             if (!list_empty(&engines->link)) {
> 
> Why it is safe to look at the state of engines->link outside the lock? 
> We can have a race between context close and completion event on a stale 
> engine, right?

There is no race :)

It's just coordination with kill_stale_engines().

> > +static void engines_idle_release(struct i915_gem_engines *engines)
> > +{
> > +     struct i915_gem_engines_iter it;
> > +     struct intel_context *ce;
> > +     unsigned long flags;
> > +
> > +     GEM_BUG_ON(!engines);
> > +     i915_sw_fence_init(&engines->fence, engines_notify);
> > +
> > +     spin_lock_irqsave(&engines->ctx->stale.lock, flags);
> > +     list_add(&engines->link, &engines->ctx->stale.engines);
> > +     spin_unlock_irqrestore(&engines->ctx->stale.lock, flags);
> > +
> > +     for_each_gem_engine(ce, engines, it) {
> > +             struct dma_fence *fence;
> > +             int err;
> > +
> > +             if (!ce->timeline)
> > +                     continue;
> 
> When does this happen?

Replacing the default engines before use. Or any engine set prior to
use.

> > +
> > +             fence = i915_active_fence_get(&ce->timeline->last_request);
> > +             if (!fence)
> > +                     continue;
> > +
> > +             err = i915_sw_fence_await_dma_fence(&engines->fence,
> > +                                                 fence, 0,
> > +                                                 GFP_KERNEL);
> > +
> > +             dma_fence_put(fence);
> > +             if (err < 0) {
> > +                     kill_engines(engines);
> > +                     break;
> 
> Okay to leave already setup awaits active in this case?

Yes. They will be signaled. It may seem a bit harsh, but we fell into an
unlikely error path and have to do so something.

> > -void i915_sw_fence_await(struct i915_sw_fence *fence)
> > +bool i915_sw_fence_await(struct i915_sw_fence *fence)
> >   {
> > -     debug_fence_assert(fence);
> > -     WARN_ON(atomic_inc_return(&fence->pending) <= 1);
> > +     int old, new;
> > +
> > +     new = atomic_read(&fence->pending);
> > +     do {
> > +             if (new < 1)
> > +                     return false;
> > +
> > +             old = new++;
> > +     } while ((new = atomic_cmpxchg(&fence->pending, old, new)) != old);
> > +
> > +     return true;
> 
> No idea what's happening here. Why was the existing code inadequate and 
> what are you changing?

I needed an await_if_busy to handle the race with the interrupts.
-Chris