[PATCH 2/9] drm/xe: Cancel pending TLB inval workers on teardown

Mon Aug 25 18:32:56 UTC 2025

On Mon, 2025-08-25 at 18:23 +0000, Summers, Stuart wrote:
> On Mon, 2025-08-25 at 11:20 -0700, Matthew Brost wrote:
> > On Mon, Aug 25, 2025 at 12:06:44PM -0600, Summers, Stuart wrote:
> > > On Mon, 2025-08-25 at 17:57 +0000, Stuart Summers wrote:
> > > > Add a new _fini() routine on the GT TLB invalidation
> > > > side to handle this worker cleanup on driver teardown.
> > > > 
> > > > v2: Move the TLB teardown to the gt fini() routine called
> > > > during
> > > >     gt_init rather than in gt_alloc. This way the GT structure
> > > > stays
> > > >     alive for while we reset the TLB state.
> > > > 
> > > > Signed-off-by: Stuart Summers <stuart.summers at intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_gt.c                  |  2 ++
> > > >  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 12 ++++++++++++
> > > >  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h |  1 +
> > > >  3 files changed, 15 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_gt.c
> > > > b/drivers/gpu/drm/xe/xe_gt.c
> > > > index a3397f04abcc..178c4783bbda 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > > @@ -603,6 +603,8 @@ static void xe_gt_fini(void *arg)
> > > >         struct xe_gt *gt = arg;
> > > >         int i;
> > > >  
> > > > +       xe_gt_tlb_invalidation_fini(gt);
> > > > +
> > > >         for (i = 0; i < XE_ENGINE_CLASS_MAX; ++i)
> > > >                 xe_hw_fence_irq_finish(&gt->fence_irq[i]);
> > > >  
> > > > diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > index 75854b963d66..db00c5adead9 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > > @@ -188,6 +188,18 @@ void xe_gt_tlb_invalidation_reset(struct
> > > > xe_gt
> > > > *gt)
> > > >         mutex_unlock(&gt->tlb_invalidation.seqno_lock);
> > > >  }
> > > >  
> > > > +/**
> > > > + *
> > > > + * xe_gt_tlb_invalidation_fini - Clean up GT TLB invalidation
> > > > state
> > > > + *
> > > > + * Cancel pending fence workers and clean up any additional
> > > > + * GT TLB invalidation state.
> > > > + */
> > > > +void xe_gt_tlb_invalidation_fini(struct xe_gt *gt)
> > > > +{
> > > > +       xe_gt_tlb_invalidation_reset(gt);
> > > 
> > > I've been seeing an issue on fault injection, running in a tight
> > > while
> > 
> > I think fault injection case will be fixed by [1] whenever that
> > merges.
> 
> Oh excellent. I had seen the series here but hadn't thought to test
> as
> I was worried the injection error was related to mine. I'll check
> that
> out here shortly just to confirm. I do have the basic error injection
> change locally too that forces this, so might float that if it does.

Yeah no unfortunately it still reproduces. Basically this with the
equivalent function injection point in the driver:
xe_fault_injection --r inject-fault-probe-function-xe_device_sysfs_init

I'll keep poking around to see what I can come up with there.

Thanks,
Stuart

> 
> > 
> > [1] https://patchwork.freedesktop.org/series/152870/
> > 
> > For general safety though, I think calling tlb_invalidation_reset
> > is
> > a
> > good idea.
> > 
> > > loop, where occasionally we see that a couple of sysfs files
> > > weren't
> > > properly torn down on a prior driver instance, followed by TLB
> > > invalidation timeouts. Up until today I was only able to
> > > reproduce
> > > that
> > > with this series, so I wanted to be sure we weren't causing
> > > something
> > > here, particularly with this _reset() call (one of the reasons I
> > > had
> > > declined to include this in the original series). Today though,
> > > even
> > > without the series, I was able to reproduce that behavior (-
> > > EEXIST
> > > on
> > > sysfs create, followed by TLB inval timeout). So I don't think we
> > > should block this series on that debug.
> > > 
> > 
> > I agree. The prior CI run LGTM. The failure here [2] should be
> > fixed
> > by
> > [3] which merged last night.
> 
> Great
> 
> > 
> > [2]
> > https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-152022v8/shard-lnl-7/igt@xe_exec_compute_mode@many-execqueues-userptr-rebind.html
> > [3] https://patchwork.freedesktop.org/series/153197/ 
> > 
> > > I see a few things in ci-buglog that could be related, although I
> > > don't
> > > see any results in those to confirm:
> > > https://gfx-ci.igk.intel.com/cibuglog-ng/issue/7175?query_key=a5707a4d3ae2ebb8c04ef6cea0ef747322df4ee1
> > > https://gfx-ci.igk.intel.com/cibuglog-ng/issue/10412?query_key=402f2615406c4afa4814a29849b312a0c7b66e9c
> > > https://gfx-ci.igk.intel.com/cibuglog-ng/issue/15004?query_key=e0ce601ae69ec76bbdf27293dc2919ba07357de3
> > > 
> > > Anyway, at least for this series, I think we can ignore that
> > > issue.
> > > 
> > 
> > Again agree. I think if this CI run is clean, go ahead and merge.
> > 
> > With that:
> > Reviewed-by: Matthew Brost <matthew.brost at intel.com>
> 
> Appreciate the feedback and review Matt!
> 
> -Stuart
> 
> > 
> > > Thanks,
> > > Stuart
> > > 
> > > > +}
> > > > +
> > > >  static bool tlb_invalidation_seqno_past(struct xe_gt *gt, int
> > > > seqno)
> > > >  {
> > > >         int seqno_recv = READ_ONCE(gt-
> > > > > tlb_invalidation.seqno_recv);
> > > > diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > > b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > > index f7f0f2eaf4b5..3e4cff3922d6 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > > +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > > @@ -16,6 +16,7 @@ struct xe_vm;
> > > >  struct xe_vma;
> > > >  
> > > >  int xe_gt_tlb_invalidation_init_early(struct xe_gt *gt);
> > > > +void xe_gt_tlb_invalidation_fini(struct xe_gt *gt);
> > > >  
> > > >  void xe_gt_tlb_invalidation_reset(struct xe_gt *gt);
> > > >  int xe_gt_tlb_invalidation_ggtt(struct xe_gt *gt);
> > > 
>