[PATCH 2/9] drm/xe: Stop the TLB fence timer on driver teardown

Fri Aug 15 20:59:37 UTC 2025

On Fri, 2025-08-15 at 13:48 -0700, Matthew Brost wrote:
> On Fri, Aug 15, 2025 at 02:33:43PM -0600, Summers, Stuart wrote:
> > On Wed, 2025-08-13 at 14:24 -0700, Matthew Brost wrote:
> > > On Wed, Aug 13, 2025 at 02:19:20PM -0700, Matthew Brost wrote:
> > > > On Wed, Aug 13, 2025 at 07:47:59PM +0000, stuartsummers wrote:
> > > > > There is a small window of time during the driver teardown
> > > > > where
> > > > > the TLB invalidation fence->gt is set to NULL but an
> > > > > invalidation
> > > > > worker is still alive. This won't cause an issue currently
> > > > > because
> > > > > the tlb_invalidation structure itself is still present.
> > > > > However
> > > > > as
> > > > > part of a future refactor, we are looking to split that
> > > > > structure
> > > > > out. To be safe, go ahead and cancel any outstanding TLB
> > > > > invalidation
> > > > > fence worker thread on _fini() since at that time we are
> > > > > already
> > > > > tearing down the driverworker thread on _fini() since at that
> > > > > time we
> > > > > are already tearing down the driver and don't need to
> > > > > continue
> > > > > monitoring this.
> > > > > 
> > > > > Signed-off-by: stuartsummers <stuart.summers at intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 2 ++
> > > > >  1 file changed, 2 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > > b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > > index 75854b963d66..08e882433b13 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > > @@ -44,6 +44,8 @@ static void
> > > > > xe_gt_tlb_invalidation_fence_fini(struct
> > > > > xe_gt_tlb_invalidation_fenc
> > > > >         if (WARN_ON_ONCE(!fence->gt))
> > > > >                 return;
> > > > >  
> > > > > +       cancel_delayed_work(&fence->gt-
> > > > > > tlb_invalidation.fence_tdr);
> > > > 
> > > > I think cancel_delayed_work_sync is what we want here.
> > > > 
> > > > Matt
> > > 
> > > Wait, I was thinking this was the fini handler called from drmm.
> > > 
> > > Would it work add drmm handler for driver teardown that
> > > calls cancel_delayed_work_sync and signals any pending TLB
> > > invalidation
> > > fences?
> > 
> > Oops, I'm just realizing I had only replied to you the other day
> > instead of adding Farah and the intel-xe list...
> > 
> > Anyway, yes, I agree with the drmm piece. I'll have something
> > updated
> > here shortly, just finishing up a little testing.
> > 
> > For the sync vs no sync, I did some testing and it seems like on
> > teardown, we're in a race for doing the work cancel here and the
> > actual
> > teardown of the worker structure. So calling sync has a chance of
> > hitting this WARN_ON:
> > static bool __flush_work(struct work_struct *work, bool
> > from_cancel)
> > {
> >         struct wq_barrier barr;
> > 
> >         if (WARN_ON(!wq_online))
> >                 return false;
> > 
> >         if (WARN_ON(!work->func))
> >                 return false;
> > 
> > Also, at this time we don't know whether the GuC is fully torn down
> > and
> 
> If you are aware of this series [1], we should be tearing down the

Oh great I think I had missed this one. I'll definitely take a look.

> GuC
> CTis very early in undriver load which has a side affect of
> immediately
> signaling any TLB invalidation fences. I need to fully review this
> one,
> but my initial thought is this is correct.
> 
> If drmm or devm handler here is run after the one above which cancels
> the worker + signals any outstanding TLB fences, I think we'd be
> good.

Yeah that's what I was thinking and what seems to be happening in
testing.

> 
> [1] https://patchwork.freedesktop.org/series/152870/
> 
> > at least I've been able to observe in testing that active
> > invalidation
> > requests to GuC can still come back around the time of this cancel
> > delayed work call. Meaning if we try doing an early signal here, we
> > can
> > run into issues where the transaction is already signaled causing
> > hangs
> > on driver unload. So IMO we should just let the normal GuC teardown
> > and
> > TLB invalidation signaling happen. The only thing missing is to
> > cancel
> > any ongoing timeout handling, basically. So this cancel work call
> > is
> > just making sure if we exit while the CT communication is held up
> > (like
> > on the error injection case here).
> > 
> > I'll have a little inline documentation in the upcoming code update
> > to
> > describe that.
> 
> Yes, some documentation would be good around the teardown sequence,
> timeout tracking, and what happens if the backend is disabled. Feel
> free
> to take a pass at this, do it in follow up, or ping to me to help
> write
> this up.

Yeah this current iteration doesn't do anything in the exec list case
which I was hoping to add right after. I'll let you know how that
progresses.

As for more detailed teardown documentation, that would definitely be
interesting here. I wasn't planning on doing anything extensively in
this series, but I can look as a follow-up.

Thanks,
Stuart

> 
> Matt
> 
> > 
> > Thanks,
> > Stuart
> > 
> > > 
> > > Matt
> > > 
> > > > 
> > > > > +
> > > > >         xe_pm_runtime_put(gt_to_xe(fence->gt));
> > > > >         fence->gt = NULL; /* fini() should be called once */
> > > > >  }
> > > > > -- 
> > > > > 2.34.1
> > > > > 
> >