[PATCH 2/9] drm/xe: Cancel pending TLB inval workers on teardown
Summers, Stuart
stuart.summers at intel.com
Mon Aug 25 18:23:58 UTC 2025
On Mon, 2025-08-25 at 11:20 -0700, Matthew Brost wrote:
> On Mon, Aug 25, 2025 at 12:06:44PM -0600, Summers, Stuart wrote:
> > On Mon, 2025-08-25 at 17:57 +0000, Stuart Summers wrote:
> > > Add a new _fini() routine on the GT TLB invalidation
> > > side to handle this worker cleanup on driver teardown.
> > >
> > > v2: Move the TLB teardown to the gt fini() routine called during
> > > gt_init rather than in gt_alloc. This way the GT structure
> > > stays
> > > alive for while we reset the TLB state.
> > >
> > > Signed-off-by: Stuart Summers <stuart.summers at intel.com>
> > > ---
> > > drivers/gpu/drm/xe/xe_gt.c | 2 ++
> > > drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 12 ++++++++++++
> > > drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h | 1 +
> > > 3 files changed, 15 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_gt.c
> > > b/drivers/gpu/drm/xe/xe_gt.c
> > > index a3397f04abcc..178c4783bbda 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > @@ -603,6 +603,8 @@ static void xe_gt_fini(void *arg)
> > > struct xe_gt *gt = arg;
> > > int i;
> > >
> > > + xe_gt_tlb_invalidation_fini(gt);
> > > +
> > > for (i = 0; i < XE_ENGINE_CLASS_MAX; ++i)
> > > xe_hw_fence_irq_finish(>->fence_irq[i]);
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > index 75854b963d66..db00c5adead9 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> > > @@ -188,6 +188,18 @@ void xe_gt_tlb_invalidation_reset(struct
> > > xe_gt
> > > *gt)
> > > mutex_unlock(>->tlb_invalidation.seqno_lock);
> > > }
> > >
> > > +/**
> > > + *
> > > + * xe_gt_tlb_invalidation_fini - Clean up GT TLB invalidation
> > > state
> > > + *
> > > + * Cancel pending fence workers and clean up any additional
> > > + * GT TLB invalidation state.
> > > + */
> > > +void xe_gt_tlb_invalidation_fini(struct xe_gt *gt)
> > > +{
> > > + xe_gt_tlb_invalidation_reset(gt);
> >
> > I've been seeing an issue on fault injection, running in a tight
> > while
>
> I think fault injection case will be fixed by [1] whenever that
> merges.
Oh excellent. I had seen the series here but hadn't thought to test as
I was worried the injection error was related to mine. I'll check that
out here shortly just to confirm. I do have the basic error injection
change locally too that forces this, so might float that if it does.
>
> [1] https://patchwork.freedesktop.org/series/152870/
>
> For general safety though, I think calling tlb_invalidation_reset is
> a
> good idea.
>
> > loop, where occasionally we see that a couple of sysfs files
> > weren't
> > properly torn down on a prior driver instance, followed by TLB
> > invalidation timeouts. Up until today I was only able to reproduce
> > that
> > with this series, so I wanted to be sure we weren't causing
> > something
> > here, particularly with this _reset() call (one of the reasons I
> > had
> > declined to include this in the original series). Today though,
> > even
> > without the series, I was able to reproduce that behavior (-EEXIST
> > on
> > sysfs create, followed by TLB inval timeout). So I don't think we
> > should block this series on that debug.
> >
>
> I agree. The prior CI run LGTM. The failure here [2] should be fixed
> by
> [3] which merged last night.
Great
>
> [2]
> https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-152022v8/shard-lnl-7/igt@xe_exec_compute_mode@many-execqueues-userptr-rebind.html
> [3] https://patchwork.freedesktop.org/series/153197/
>
> > I see a few things in ci-buglog that could be related, although I
> > don't
> > see any results in those to confirm:
> > https://gfx-ci.igk.intel.com/cibuglog-ng/issue/7175?query_key=a5707a4d3ae2ebb8c04ef6cea0ef747322df4ee1
> > https://gfx-ci.igk.intel.com/cibuglog-ng/issue/10412?query_key=402f2615406c4afa4814a29849b312a0c7b66e9c
> > https://gfx-ci.igk.intel.com/cibuglog-ng/issue/15004?query_key=e0ce601ae69ec76bbdf27293dc2919ba07357de3
> >
> > Anyway, at least for this series, I think we can ignore that issue.
> >
>
> Again agree. I think if this CI run is clean, go ahead and merge.
>
> With that:
> Reviewed-by: Matthew Brost <matthew.brost at intel.com>
Appreciate the feedback and review Matt!
-Stuart
>
> > Thanks,
> > Stuart
> >
> > > +}
> > > +
> > > static bool tlb_invalidation_seqno_past(struct xe_gt *gt, int
> > > seqno)
> > > {
> > > int seqno_recv = READ_ONCE(gt-
> > > >tlb_invalidation.seqno_recv);
> > > diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > index f7f0f2eaf4b5..3e4cff3922d6 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
> > > @@ -16,6 +16,7 @@ struct xe_vm;
> > > struct xe_vma;
> > >
> > > int xe_gt_tlb_invalidation_init_early(struct xe_gt *gt);
> > > +void xe_gt_tlb_invalidation_fini(struct xe_gt *gt);
> > >
> > > void xe_gt_tlb_invalidation_reset(struct xe_gt *gt);
> > > int xe_gt_tlb_invalidation_ggtt(struct xe_gt *gt);
> >
More information about the Intel-xe
mailing list