[Intel-gfx] [PATCH 14/28] drm/i915/gt: Free stale request on destroying the virtual engine

Wed Nov 18 12:10:04 UTC 2020

Quoting Tvrtko Ursulin (2020-11-18 11:38:43)
> 
> On 18/11/2020 11:24, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2020-11-18 11:05:24)
> >>
> >> On 17/11/2020 11:30, Chris Wilson wrote:
> >>> Since preempt-to-busy, we may unsubmit a request while it is still on
> >>> the HW and completes asynchronously. That means it may be retired and in
> >>> the process destroy the virtual engine (as the user has closed their
> >>> context), but that engine may still be holding onto the unsubmitted
> >>> compelted request. Therefore we need to potentially cleanup the old
> >>> request on destroying the virtual engine. We also have to keep the
> >>> virtual_engine alive until after the sibling's execlists_dequeue() have
> >>> finished peeking into the virtual engines, for which we serialise with
> >>> RCU.
> >>>
> >>> v2: Be paranoid and flush the tasklet as well.
> >>> v3: And flush the tasklet before the engines, as the tasklet may
> >>> re-attach an rb_node after our removal from the siblings.
> >>>
> >>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >>> ---
> >>>    drivers/gpu/drm/i915/gt/intel_lrc.c | 61 +++++++++++++++++++++++++----
> >>>    1 file changed, 54 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>> index 17cb7060eb29..c11433884cf6 100644
> >>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>> @@ -182,6 +182,7 @@
> >>>    struct virtual_engine {
> >>>        struct intel_engine_cs base;
> >>>        struct intel_context context;
> >>> +     struct rcu_work rcu;
> >>>    
> >>>        /*
> >>>         * We allow only a single request through the virtual engine at a time
> >>> @@ -5470,44 +5471,90 @@ static struct list_head *virtual_queue(struct virtual_engine *ve)
> >>>        return &ve->base.execlists.default_priolist.requests[0];
> >>>    }
> >>>    
> >>> -static void virtual_context_destroy(struct kref *kref)
> >>> +static void rcu_virtual_context_destroy(struct work_struct *wrk)
> >>>    {
> >>>        struct virtual_engine *ve =
> >>> -             container_of(kref, typeof(*ve), context.ref);
> >>> +             container_of(wrk, typeof(*ve), rcu.work);
> >>>        unsigned int n;
> >>>    
> >>> -     GEM_BUG_ON(!list_empty(virtual_queue(ve)));
> >>> -     GEM_BUG_ON(ve->request);
> >>>        GEM_BUG_ON(ve->context.inflight);
> >>>    
> >>> +     /* Preempt-to-busy may leave a stale request behind. */
> >>> +     if (unlikely(ve->request)) {
> >>> +             struct i915_request *old;
> >>> +
> >>> +             spin_lock_irq(&ve->base.active.lock);
> >>> +
> >>> +             old = fetch_and_zero(&ve->request);
> >>> +             if (old) {
> >>> +                     GEM_BUG_ON(!i915_request_completed(old));
> >>> +                     __i915_request_submit(old);
> >>> +                     i915_request_put(old);
> >>> +             }
> >>> +
> >>> +             spin_unlock_irq(&ve->base.active.lock);
> >>> +     }
> >>> +
> >>> +     /*
> >>> +      * Flush the tasklet in case it is still running on another core.
> >>> +      *
> >>> +      * This needs to be done before we remove ourselves from the siblings'
> >>> +      * rbtrees as in the case it is running in parallel, it may reinsert
> >>> +      * the rb_node into a sibling.
> >>> +      */
> >>> +     tasklet_kill(&ve->base.execlists.tasklet);
> >>
> >> Can it still be running after an RCU period?
> > 
> > I think there is a window between checking to see if the request is
> > completed and kicking the tasklet, that is not under the rcu lock and
> > opportunity for the request to be retired, and barrier flushed to drop
> > the context references.
> 
>  From where would that check come?

The window of opportunity extends all the way from the
i915_request_completed check during unsubmit right until the virtual
engine tasklet is executed -- we do not hold a reference to the virtual
engine for the tasklet, and that request may be retired in the
background, and along with it the virtual engine destroyed.
-Chris