[PATCH 2/9] drm/xe: Stop the TLB fence timer on driver teardown

Fri Aug 15 20:33:43 UTC 2025

On Wed, 2025-08-13 at 14:24 -0700, Matthew Brost wrote:
> On Wed, Aug 13, 2025 at 02:19:20PM -0700, Matthew Brost wrote:
> > On Wed, Aug 13, 2025 at 07:47:59PM +0000, stuartsummers wrote:
> > > There is a small window of time during the driver teardown where
> > > the TLB invalidation fence->gt is set to NULL but an invalidation
> > > worker is still alive. This won't cause an issue currently
> > > because
> > > the tlb_invalidation structure itself is still present. However
> > > as
> > > part of a future refactor, we are looking to split that structure
> > > out. To be safe, go ahead and cancel any outstanding TLB
> > > invalidation
> > > fence worker thread on _fini() since at that time we are already
> > > tearing down the driverworker thread on _fini() since at that
> > > time we
> > > are already tearing down the driver and don't need to continue
> > > monitoring this.
> > > 
> > > Signed-off-by: stuartsummers <stuart.summers at intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > index 75854b963d66..08e882433b13 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > @@ -44,6 +44,8 @@ static void
> > > xe_gt_tlb_invalidation_fence_fini(struct
> > > xe_gt_tlb_invalidation_fenc
> > >         if (WARN_ON_ONCE(!fence->gt))
> > >                 return;
> > >  
> > > +       cancel_delayed_work(&fence->gt-
> > > >tlb_invalidation.fence_tdr);
> > 
> > I think cancel_delayed_work_sync is what we want here.
> > 
> > Matt
> 
> Wait, I was thinking this was the fini handler called from drmm.
> 
> Would it work add drmm handler for driver teardown that
> calls cancel_delayed_work_sync and signals any pending TLB
> invalidation
> fences?

Oops, I'm just realizing I had only replied to you the other day
instead of adding Farah and the intel-xe list...

Anyway, yes, I agree with the drmm piece. I'll have something updated
here shortly, just finishing up a little testing.

For the sync vs no sync, I did some testing and it seems like on
teardown, we're in a race for doing the work cancel here and the actual
teardown of the worker structure. So calling sync has a chance of
hitting this WARN_ON:
static bool __flush_work(struct work_struct *work, bool from_cancel)
{
        struct wq_barrier barr;

        if (WARN_ON(!wq_online))
                return false;

        if (WARN_ON(!work->func))
                return false;

Also, at this time we don't know whether the GuC is fully torn down and
at least I've been able to observe in testing that active invalidation
requests to GuC can still come back around the time of this cancel
delayed work call. Meaning if we try doing an early signal here, we can
run into issues where the transaction is already signaled causing hangs
on driver unload. So IMO we should just let the normal GuC teardown and
TLB invalidation signaling happen. The only thing missing is to cancel
any ongoing timeout handling, basically. So this cancel work call is
just making sure if we exit while the CT communication is held up (like
on the error injection case here).

I'll have a little inline documentation in the upcoming code update to
describe that.

Thanks,
Stuart

> 
> Matt
> 
> > 
> > > +
> > >         xe_pm_runtime_put(gt_to_xe(fence->gt));
> > >         fence->gt = NULL; /* fini() should be called once */
> > >  }
> > > -- 
> > > 2.34.1
> > >