[Intel-gfx] [PATCH 14/28] drm/i915/gt: Free stale request on destroying the virtual engine
Chris Wilson
chris at chris-wilson.co.uk
Wed Nov 18 12:10:04 UTC 2020
Quoting Tvrtko Ursulin (2020-11-18 11:38:43)
>
> On 18/11/2020 11:24, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2020-11-18 11:05:24)
> >>
> >> On 17/11/2020 11:30, Chris Wilson wrote:
> >>> Since preempt-to-busy, we may unsubmit a request while it is still on
> >>> the HW and completes asynchronously. That means it may be retired and in
> >>> the process destroy the virtual engine (as the user has closed their
> >>> context), but that engine may still be holding onto the unsubmitted
> >>> compelted request. Therefore we need to potentially cleanup the old
> >>> request on destroying the virtual engine. We also have to keep the
> >>> virtual_engine alive until after the sibling's execlists_dequeue() have
> >>> finished peeking into the virtual engines, for which we serialise with
> >>> RCU.
> >>>
> >>> v2: Be paranoid and flush the tasklet as well.
> >>> v3: And flush the tasklet before the engines, as the tasklet may
> >>> re-attach an rb_node after our removal from the siblings.
> >>>
> >>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >>> ---
> >>> drivers/gpu/drm/i915/gt/intel_lrc.c | 61 +++++++++++++++++++++++++----
> >>> 1 file changed, 54 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>> index 17cb7060eb29..c11433884cf6 100644
> >>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>> @@ -182,6 +182,7 @@
> >>> struct virtual_engine {
> >>> struct intel_engine_cs base;
> >>> struct intel_context context;
> >>> + struct rcu_work rcu;
> >>>
> >>> /*
> >>> * We allow only a single request through the virtual engine at a time
> >>> @@ -5470,44 +5471,90 @@ static struct list_head *virtual_queue(struct virtual_engine *ve)
> >>> return &ve->base.execlists.default_priolist.requests[0];
> >>> }
> >>>
> >>> -static void virtual_context_destroy(struct kref *kref)
> >>> +static void rcu_virtual_context_destroy(struct work_struct *wrk)
> >>> {
> >>> struct virtual_engine *ve =
> >>> - container_of(kref, typeof(*ve), context.ref);
> >>> + container_of(wrk, typeof(*ve), rcu.work);
> >>> unsigned int n;
> >>>
> >>> - GEM_BUG_ON(!list_empty(virtual_queue(ve)));
> >>> - GEM_BUG_ON(ve->request);
> >>> GEM_BUG_ON(ve->context.inflight);
> >>>
> >>> + /* Preempt-to-busy may leave a stale request behind. */
> >>> + if (unlikely(ve->request)) {
> >>> + struct i915_request *old;
> >>> +
> >>> + spin_lock_irq(&ve->base.active.lock);
> >>> +
> >>> + old = fetch_and_zero(&ve->request);
> >>> + if (old) {
> >>> + GEM_BUG_ON(!i915_request_completed(old));
> >>> + __i915_request_submit(old);
> >>> + i915_request_put(old);
> >>> + }
> >>> +
> >>> + spin_unlock_irq(&ve->base.active.lock);
> >>> + }
> >>> +
> >>> + /*
> >>> + * Flush the tasklet in case it is still running on another core.
> >>> + *
> >>> + * This needs to be done before we remove ourselves from the siblings'
> >>> + * rbtrees as in the case it is running in parallel, it may reinsert
> >>> + * the rb_node into a sibling.
> >>> + */
> >>> + tasklet_kill(&ve->base.execlists.tasklet);
> >>
> >> Can it still be running after an RCU period?
> >
> > I think there is a window between checking to see if the request is
> > completed and kicking the tasklet, that is not under the rcu lock and
> > opportunity for the request to be retired, and barrier flushed to drop
> > the context references.
>
> From where would that check come?
The window of opportunity extends all the way from the
i915_request_completed check during unsubmit right until the virtual
engine tasklet is executed -- we do not hold a reference to the virtual
engine for the tasklet, and that request may be retired in the
background, and along with it the virtual engine destroyed.
-Chris
More information about the Intel-gfx
mailing list