[PATCH] lib/xe/xe_spin: Add additional MI_ARB_CHECK to preemptable spin init
Cavitt, Jonathan
jonathan.cavitt at intel.com
Fri Aug 15 16:30:04 UTC 2025
-----Original Message-----
From: Brost, Matthew <matthew.brost at intel.com>
Sent: Thursday, August 14, 2025 1:19 PM
To: Cavitt, Jonathan <jonathan.cavitt at intel.com>
Cc: igt-dev at lists.freedesktop.org; Gupta, saurabhg <saurabhg.gupta at intel.com>; Zuo, Alex <alex.zuo at intel.com>; Kempczynski, Zbigniew <zbigniew.kempczynski at intel.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura at intel.com>
Subject: Re: [PATCH] lib/xe/xe_spin: Add additional MI_ARB_CHECK to preemptable spin init
>
> On Thu, Aug 14, 2025 at 01:43:01PM -0600, Cavitt, Jonathan wrote:
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost at intel.com>
> > Sent: Thursday, August 14, 2025 12:24 PM
> > To: Cavitt, Jonathan <jonathan.cavitt at intel.com>
> > Cc: igt-dev at lists.freedesktop.org; Gupta, saurabhg <saurabhg.gupta at intel.com>; Zuo, Alex <alex.zuo at intel.com>; Kempczynski, Zbigniew <zbigniew.kempczynski at intel.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura at intel.com>
> > Subject: Re: [PATCH] lib/xe/xe_spin: Add additional MI_ARB_CHECK to preemptable spin init
> > >
> > > On Thu, Aug 14, 2025 at 01:11:39PM -0600, Cavitt, Jonathan wrote:
> > > > -----Original Message-----
> > > > From: Brost, Matthew <matthew.brost at intel.com>
> > > > Sent: Thursday, August 14, 2025 12:05 PM
> > > > To: Cavitt, Jonathan <jonathan.cavitt at intel.com>
> > > > Cc: igt-dev at lists.freedesktop.org; Gupta, saurabhg <saurabhg.gupta at intel.com>; Zuo, Alex <alex.zuo at intel.com>; Kempczynski, Zbigniew <zbigniew.kempczynski at intel.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura at intel.com>
> > > > Subject: Re: [PATCH] lib/xe/xe_spin: Add additional MI_ARB_CHECK to preemptable spin init
> > > > >
> > > > > On Thu, Aug 14, 2025 at 06:50:10PM +0000, Jonathan Cavitt wrote:
> > > > > > Add an additional MI_ARB_CHECK to the spin initialization function,
> > > > > > just before the conditional batch buffer end, when preemption is
> > > > > > enabled. The MI_ARB_CHECK forces a preemption in this case, which can
> > > > > > allow additionally scheduled exec queues to make independent progress
> > > > > > earlier than if we relied on the automatic timeout of the exec queue.
> > > > > >
> > > > > > Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5783
> > > > > > Suggested-by: Zbigniew Kempczyński <zbigniew.kempczynski at intel.com>
> > > > > > Signed-off-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
> > > > > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > > > > Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura at intel.com>
> > > > > > ---
> > > > > > lib/xe/xe_spin.c | 4 +++-
> > > > > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/lib/xe/xe_spin.c b/lib/xe/xe_spin.c
> > > > > > index 25f7555e27..b3a937fe80 100644
> > > > > > --- a/lib/xe/xe_spin.c
> > > > > > +++ b/lib/xe/xe_spin.c
> > > > > > @@ -30,7 +30,6 @@
> > > > > >
> > > > > > enum { START_TS, NOW_TS };
> > > > > >
> > > > > > -
> > > > > > uint32_t xe_spin_nsec_to_ticks(int fd, int gt_id, uint64_t nsec)
> > > > > > {
> > > > > > uint32_t ticks = xe_nsec_to_ticks(fd, gt_id, nsec);
> > > > > > @@ -167,6 +166,9 @@ void xe_spin_init(struct xe_spin *spin, struct xe_spin_opts *opts)
> > > > > > opts->mem_copy->dst->mocs_index;
> > > > > > }
> > > > > >
> > > > > > + if (opts->preempt)
> > > > > > + spin->batch[b++] = MI_ARB_CHECK;
> > > > >
> > > > > But we already have one these at the top of the loop:
> > > > >
> > > > > 72 loop_addr = opts->addr + b * sizeof(uint32_t);
> > > > > 73
> > > > > 74 spin->batch[b++] = MI_STORE_DWORD_IMM_GEN4;
> > > > > 75 spin->batch[b++] = start_addr;
> > > > > 76 spin->batch[b++] = start_addr >> 32;
> > > > > 77 spin->batch[b++] = 0xc0ffee;
> > > > > 78
> > > > > 79 if (opts->preempt)
> > > > > 80 spin->batch[b++] = MI_ARB_CHECK;
I performed the experiment that you requested, and it seems adding an MI_NOOP
after this MI_ARB_CHECK is sufficient to prevent the observed issue. Any thoughts
on why this might be the case?
-Jonathan Cavitt
> > > > >
> > > > > Line 74 is here the ring loop restarts.
> > > >
> > > > I tried removing the top MI_ARB_CHECK, but that caused the test to start
> > > > failing again, so it looks like both MI_ARB_CHECK calls are necessary.
> > > > -Jonathan Cavitt
> > >
> > > It shouldn’t be, though. A couple of instructions later we should jump
> > > to loop_addr (that is line 73 ring instruction) and then execute the
> > > MI_ARB_CHECK programmed in the ring at line 80.
> > >
> > > I’m all for fixing the tests, but I want to understand what’s going on
> > > here.
> > >
> > > Can you reliably reproduce this issue and confirm that it fails before
> > > this change and works after?
> >
> > Yes.
> >
> > Before the change, the issue occurs with mixed frequency (about 10%), and many
> > runs take around a half-second to complete. After the change, the test consistently
> > passes and takes around 10 ms to complete.
> >
> > The primary cause of the issue seems to be an engine reset that occurs due to
> > an exec-queue hang. The additional MI_ARB_CHECK seems to force the second
>
> An engine reset occurs if the GuC issues a preemption and the engine
> doesn’t respond within ~0.6 seconds (the default preemption timeout).
>
> > scheduled exec-queue to make forward progress, which unblocks the first and
> > prevents the hang.
> >
> > Perhaps the "real" issue that the MI_BATCH_BUFFER_START loopback causes the
> > execution call stack to fill up, resulting in the system having to clear up multiple
> > layers of execution once the second exec queue completes? I don't think that's
> > particularly solvable, though.
>
> I don’t think that’s the cause. To me, this points to a PTL hardware
> issue—this shouldn’t be necessary, as I’ve noted and works on all
> platforms AFAIK.
>
> Maybe try some other experiments too—for example, move the single
> preemption point to various places in the ring, or insert some NOPs into
> the ring (in case there’s an alignment issue with instruction
> programming, etc.).
>
> We should bring in SV to root-cause any hardware problems if believe
> this the issue.
>
> Matt
>
> > -Jonathan Cavitt
> >
> > >
> > > Matt
> > >
> > > >
> > > > >
> > > > > Matt
> > > > >
> > > > > > +
> > > > > > spin->batch[b++] = MI_COND_BATCH_BUFFER_END | MI_DO_COMPARE | 2;
> > > > > > spin->batch[b++] = 0;
> > > > > > spin->batch[b++] = end_addr;
> > > > > > --
> > > > > > 2.43.0
> > > > > >
> > > > >
> > >
>
More information about the igt-dev
mailing list