[PATCH 2/9] drm/xe: Stop the TLB fence timer on driver teardown

Fri Aug 15 20:48:40 UTC 2025

On Fri, Aug 15, 2025 at 02:33:43PM -0600, Summers, Stuart wrote:
> On Wed, 2025-08-13 at 14:24 -0700, Matthew Brost wrote:
> > On Wed, Aug 13, 2025 at 02:19:20PM -0700, Matthew Brost wrote:
> > > On Wed, Aug 13, 2025 at 07:47:59PM +0000, stuartsummers wrote:
> > > > There is a small window of time during the driver teardown where
> > > > the TLB invalidation fence->gt is set to NULL but an invalidation
> > > > worker is still alive. This won't cause an issue currently
> > > > because
> > > > the tlb_invalidation structure itself is still present. However
> > > > as
> > > > part of a future refactor, we are looking to split that structure
> > > > out. To be safe, go ahead and cancel any outstanding TLB
> > > > invalidation
> > > > fence worker thread on _fini() since at that time we are already
> > > > tearing down the driverworker thread on _fini() since at that
> > > > time we
> > > > are already tearing down the driver and don't need to continue
> > > > monitoring this.
> > > > 
> > > > Signed-off-by: stuartsummers <stuart.summers at intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 2 ++
> > > >  1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > index 75854b963d66..08e882433b13 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > @@ -44,6 +44,8 @@ static void
> > > > xe_gt_tlb_invalidation_fence_fini(struct
> > > > xe_gt_tlb_invalidation_fenc
> > > >         if (WARN_ON_ONCE(!fence->gt))
> > > >                 return;
> > > >  
> > > > +       cancel_delayed_work(&fence->gt-
> > > > >tlb_invalidation.fence_tdr);
> > > 
> > > I think cancel_delayed_work_sync is what we want here.
> > > 
> > > Matt
> > 
> > Wait, I was thinking this was the fini handler called from drmm.
> > 
> > Would it work add drmm handler for driver teardown that
> > calls cancel_delayed_work_sync and signals any pending TLB
> > invalidation
> > fences?
> 
> Oops, I'm just realizing I had only replied to you the other day
> instead of adding Farah and the intel-xe list...
> 
> Anyway, yes, I agree with the drmm piece. I'll have something updated
> here shortly, just finishing up a little testing.
> 
> For the sync vs no sync, I did some testing and it seems like on
> teardown, we're in a race for doing the work cancel here and the actual
> teardown of the worker structure. So calling sync has a chance of
> hitting this WARN_ON:
> static bool __flush_work(struct work_struct *work, bool from_cancel)
> {
>         struct wq_barrier barr;
> 
>         if (WARN_ON(!wq_online))
>                 return false;
> 
>         if (WARN_ON(!work->func))
>                 return false;
> 
> Also, at this time we don't know whether the GuC is fully torn down and

If you are aware of this series [1], we should be tearing down the GuC
CTis very early in undriver load which has a side affect of immediately
signaling any TLB invalidation fences. I need to fully review this one,
but my initial thought is this is correct.

If drmm or devm handler here is run after the one above which cancels
the worker + signals any outstanding TLB fences, I think we'd be good.

[1] https://patchwork.freedesktop.org/series/152870/

> at least I've been able to observe in testing that active invalidation
> requests to GuC can still come back around the time of this cancel
> delayed work call. Meaning if we try doing an early signal here, we can
> run into issues where the transaction is already signaled causing hangs
> on driver unload. So IMO we should just let the normal GuC teardown and
> TLB invalidation signaling happen. The only thing missing is to cancel
> any ongoing timeout handling, basically. So this cancel work call is
> just making sure if we exit while the CT communication is held up (like
> on the error injection case here).
> 
> I'll have a little inline documentation in the upcoming code update to
> describe that.

Yes, some documentation would be good around the teardown sequence,
timeout tracking, and what happens if the backend is disabled. Feel free
to take a pass at this, do it in follow up, or ping to me to help write
this up.

Matt

> 
> Thanks,
> Stuart
> 
> > 
> > Matt
> > 
> > > 
> > > > +
> > > >         xe_pm_runtime_put(gt_to_xe(fence->gt));
> > > >         fence->gt = NULL; /* fini() should be called once */
> > > >  }
> > > > -- 
> > > > 2.34.1
> > > > 
>