[PATCH 2/9] drm/xe: Cancel pending TLB inval workers on teardown

Summers, Stuart stuart.summers at intel.com
Fri Aug 22 14:57:37 UTC 2025


On Thu, 2025-08-21 at 19:21 -0700, Matthew Brost wrote:
> On Thu, Aug 21, 2025 at 05:34:53PM -0600, Summers, Stuart wrote:
> > On Thu, 2025-08-21 at 15:13 -0700, Matthew Brost wrote:
> > > On Wed, Aug 20, 2025 at 11:30:50PM +0000, Stuart Summers wrote:
> > > > There is a small window of time during the driver teardown
> > > > where
> > > > the TLB invalidation fence->gt is set to NULL but an
> > > > invalidation
> > > > worker is still alive. This won't cause an issue currently
> > > > because
> > > > the tlb_invalidation structure itself is still present. However
> > > > as
> > > > part of a future refactor, we are looking to split that
> > > > structure
> > > > out. To be safe, go ahead and cancel any outstanding TLB
> > > > invalidation
> > > > fence worker thread on _fini() since at that time we are
> > > > already
> > > > tearing down the driverworker thread on _fini() since at that
> > > > time
> > > > we
> > > > are already tearing down the driver and don't need to continue
> > > > monitoring this.
> > > > 
> > > > Add a new _fini() routine on the GT TLB invalidation
> > > > side to handle this worker cleanup on driver teardown.
> > > > 
> > > > Signed-off-by: Stuart Summers <stuart.summers at intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_gt.c                  |  2 ++
> > > >  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 21
> > > > +++++++++++++++++++++
> > > >  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h |  1 +
> > > >  3 files changed, 24 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_gt.c
> > > > b/drivers/gpu/drm/xe/xe_gt.c
> > > > index a3397f04abcc..bbf3f4da7f54 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > > @@ -69,6 +69,8 @@ static void gt_fini(struct drm_device *drm,
> > > > void
> > > > *arg)
> > > >  {
> > > >         struct xe_gt *gt = arg;
> > > >  
> > > > +       xe_gt_tlb_invalidation_fini(gt);
> > > > +
> > > >         destroy_workqueue(gt->ordered_wq);
> > > >  }
> > > >  
> > > > diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > index 75854b963d66..170f6c97268f 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > @@ -188,6 +188,27 @@ void xe_gt_tlb_invalidation_reset(struct
> > > > xe_gt
> > > > *gt)
> > > >         mutex_unlock(&gt->tlb_invalidation.seqno_lock);
> > > >  }
> > > >  
> > > > +/**
> > > > + *
> > > > + * xe_gt_tlb_invalidation_fini - Clean up GT TLB invalidation
> > > > state
> > > > + *
> > > > + * Cancel pending fence workers and clean up any additional
> > > > + * GT TLB invalidation state.
> > > > + */
> > > > +void xe_gt_tlb_invalidation_fini(struct xe_gt *gt)
> > > > +{
> > > > +       /*
> > > > +        * At this point, we expect any outstanding TLB
> > > > transactions
> > > > +        * to still signal through the normal channels of
> > > > interrupts
> > > > +        * or MMIO polling updates, etc. There could be some
> > > > workers
> > > > +        * triggered as a result of timeouts in those channels,
> > > > though,
> > > > +        * that we want to make sure get cleaned up during an
> > > > unbind.
> > > > +        * So here, cancel the delayed work, but let the normal
> > > > fence
> > > > +        * signaling handle the rest.
> > > > +        */
> > > 
> > > You should assert this is true then, e.g., the pending list is
> > > empty.
> > 
> > The problem is this seems a little racey. I don't think we can
> 
> Hmm, I wouldn't think racing here would be possible. Trying to reason
> how that would be possible and can't think of a scenario.

Alright let me work on this and get back.. I have tried this and am
seeing system hangs without a panic (at least I haven't been able to
get the panic over serial or in the syslog, etc). I also observed the
list slowly draining (via printk) over the course of the teardown
naturally without the explicit drain here.

But I agree it would be nicer to have that here... I'll report back
once I have this working.

Thanks,
Stuart

> 
> > guarantee that the responses are back from GuC by the time we hit
> > this
> > _fini() routine.
> > 
> 
> If this truly is racy, we cannot risk fences not signaling as that
> could
> lead to very bad things (e.g., reclaim waiting a on fence which never
> signals). I think we will need to walk all pending fences and signal
> these.
> 
> > > 
> > > I haven't checked but later in the series when we move this to an
> > > embedded struct, is the fini owned by tlb_inval layer (i.e., it
> > > has
> > > its
> > > own drmm or devm fini handler)? That would be a better placement
> > > IMO.
> > 
> > Oh sorry I should have been more clear in the cover letter. Yes I
> > am
> > doing that, but I thought it best to fix this independently first.
> > At
> > the time this series starts, we don't have a good place to inject a
> > new
> > drmm fini handler. I modify this path later in this series to do
> > exactly that (remove the call from gt_fini() and add it to a new
> > tlb_fini() that gets called via the drm layer).
> > 
> 
> +1
> 
> Matt
> 
> > Thanks,
> > Stuart
> > 
> > > 
> > > Matt
> > > 
> > > > +       cancel_delayed_work(&gt->tlb_invalidation.fence_tdr);
> > > > +}
> > > > +
> > > >  static bool tlb_invalidation_seqno_past(struct xe_gt *gt, int
> > > > seqno)
> > > >  {
> > > >         int seqno_recv = READ_ONCE(gt-
> > > > > tlb_invalidation.seqno_recv);
> > > > diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > > b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > > index f7f0f2eaf4b5..3e4cff3922d6 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > > +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > > @@ -16,6 +16,7 @@ struct xe_vm;
> > > >  struct xe_vma;
> > > >  
> > > >  int xe_gt_tlb_invalidation_init_early(struct xe_gt *gt);
> > > > +void xe_gt_tlb_invalidation_fini(struct xe_gt *gt);
> > > >  
> > > >  void xe_gt_tlb_invalidation_reset(struct xe_gt *gt);
> > > >  int xe_gt_tlb_invalidation_ggtt(struct xe_gt *gt);
> > > > -- 
> > > > 2.34.1
> > > > 
> > 



More information about the Intel-xe mailing list