[PATCH 2/9] drm/xe: Cancel pending TLB inval workers on teardown

Summers, Stuart stuart.summers at intel.com
Mon Aug 25 18:23:58 UTC 2025


On Mon, 2025-08-25 at 11:20 -0700, Matthew Brost wrote:
> On Mon, Aug 25, 2025 at 12:06:44PM -0600, Summers, Stuart wrote:
> > On Mon, 2025-08-25 at 17:57 +0000, Stuart Summers wrote:
> > > Add a new _fini() routine on the GT TLB invalidation
> > > side to handle this worker cleanup on driver teardown.
> > > 
> > > v2: Move the TLB teardown to the gt fini() routine called during
> > >     gt_init rather than in gt_alloc. This way the GT structure
> > > stays
> > >     alive for while we reset the TLB state.
> > > 
> > > Signed-off-by: Stuart Summers <stuart.summers at intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_gt.c                  |  2 ++
> > >  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 12 ++++++++++++
> > >  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h |  1 +
> > >  3 files changed, 15 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_gt.c
> > > b/drivers/gpu/drm/xe/xe_gt.c
> > > index a3397f04abcc..178c4783bbda 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > @@ -603,6 +603,8 @@ static void xe_gt_fini(void *arg)
> > >         struct xe_gt *gt = arg;
> > >         int i;
> > >  
> > > +       xe_gt_tlb_invalidation_fini(gt);
> > > +
> > >         for (i = 0; i < XE_ENGINE_CLASS_MAX; ++i)
> > >                 xe_hw_fence_irq_finish(&gt->fence_irq[i]);
> > >  
> > > diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > index 75854b963d66..db00c5adead9 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > @@ -188,6 +188,18 @@ void xe_gt_tlb_invalidation_reset(struct
> > > xe_gt
> > > *gt)
> > >         mutex_unlock(&gt->tlb_invalidation.seqno_lock);
> > >  }
> > >  
> > > +/**
> > > + *
> > > + * xe_gt_tlb_invalidation_fini - Clean up GT TLB invalidation
> > > state
> > > + *
> > > + * Cancel pending fence workers and clean up any additional
> > > + * GT TLB invalidation state.
> > > + */
> > > +void xe_gt_tlb_invalidation_fini(struct xe_gt *gt)
> > > +{
> > > +       xe_gt_tlb_invalidation_reset(gt);
> > 
> > I've been seeing an issue on fault injection, running in a tight
> > while
> 
> I think fault injection case will be fixed by [1] whenever that
> merges.

Oh excellent. I had seen the series here but hadn't thought to test as
I was worried the injection error was related to mine. I'll check that
out here shortly just to confirm. I do have the basic error injection
change locally too that forces this, so might float that if it does.

> 
> [1] https://patchwork.freedesktop.org/series/152870/
> 
> For general safety though, I think calling tlb_invalidation_reset is
> a
> good idea.
> 
> > loop, where occasionally we see that a couple of sysfs files
> > weren't
> > properly torn down on a prior driver instance, followed by TLB
> > invalidation timeouts. Up until today I was only able to reproduce
> > that
> > with this series, so I wanted to be sure we weren't causing
> > something
> > here, particularly with this _reset() call (one of the reasons I
> > had
> > declined to include this in the original series). Today though,
> > even
> > without the series, I was able to reproduce that behavior (-EEXIST
> > on
> > sysfs create, followed by TLB inval timeout). So I don't think we
> > should block this series on that debug.
> > 
> 
> I agree. The prior CI run LGTM. The failure here [2] should be fixed
> by
> [3] which merged last night.

Great

> 
> [2]
> https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-152022v8/shard-lnl-7/igt@xe_exec_compute_mode@many-execqueues-userptr-rebind.html
> [3] https://patchwork.freedesktop.org/series/153197/ 
> 
> > I see a few things in ci-buglog that could be related, although I
> > don't
> > see any results in those to confirm:
> > https://gfx-ci.igk.intel.com/cibuglog-ng/issue/7175?query_key=a5707a4d3ae2ebb8c04ef6cea0ef747322df4ee1
> > https://gfx-ci.igk.intel.com/cibuglog-ng/issue/10412?query_key=402f2615406c4afa4814a29849b312a0c7b66e9c
> > https://gfx-ci.igk.intel.com/cibuglog-ng/issue/15004?query_key=e0ce601ae69ec76bbdf27293dc2919ba07357de3
> > 
> > Anyway, at least for this series, I think we can ignore that issue.
> > 
> 
> Again agree. I think if this CI run is clean, go ahead and merge.
> 
> With that:
> Reviewed-by: Matthew Brost <matthew.brost at intel.com>

Appreciate the feedback and review Matt!

-Stuart

> 
> > Thanks,
> > Stuart
> > 
> > > +}
> > > +
> > >  static bool tlb_invalidation_seqno_past(struct xe_gt *gt, int
> > > seqno)
> > >  {
> > >         int seqno_recv = READ_ONCE(gt-
> > > >tlb_invalidation.seqno_recv);
> > > diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > index f7f0f2eaf4b5..3e4cff3922d6 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > @@ -16,6 +16,7 @@ struct xe_vm;
> > >  struct xe_vma;
> > >  
> > >  int xe_gt_tlb_invalidation_init_early(struct xe_gt *gt);
> > > +void xe_gt_tlb_invalidation_fini(struct xe_gt *gt);
> > >  
> > >  void xe_gt_tlb_invalidation_reset(struct xe_gt *gt);
> > >  int xe_gt_tlb_invalidation_ggtt(struct xe_gt *gt);
> > 



More information about the Intel-xe mailing list