DRM scheduler issue with spsc_queue

Fri Jun 20 16:45:44 UTC 2025

On Mon, Jun 16, 2025 at 12:42:53PM +0200, Simona Vetter wrote:
> On Mon, Jun 16, 2025 at 09:24:38AM +0200, Christian König wrote:
> > On 6/13/25 21:11, Matthew Brost wrote:
> > > On Fri, Jun 13, 2025 at 07:26:22PM +0200, Christian König wrote:
> > >> On 6/13/25 19:01, Matthew Brost wrote:
> > >>> All,
> > >>>
> > >>> After about six hours of debugging, I found an issue in a fairly
> > >>> aggressive test case involving the DRM scheduler function
> > >>> drm_sched_entity_push_job. The problem is that spsc_queue_push does not
> > >>> correctly return first on a job push, causing the queue to fail to run
> > >>> even though it is ready.
> > >>>
> > >>> I know this sounds a bit insane, but I assure you it’s happening and is
> > >>> quite reproducible. I'm working off a pull of drm-tip from a few days
> > >>> ago + some local change to Xe's memory management, with a Kconfig that
> > >>> has no debug options enabled. I’m not sure if there’s a bug somewhere in
> > >>> the kernel related to barriers or atomics in the recent drm-tip. That
> > >>> seems unlikely—but just as unlikely is that this bug has existed for a
> > >>> while without being triggered until now.
> > >>>
> > >>> I've verified the hang in several ways: using printks, adding a debugfs
> > >>> entry to manually kick the DRM scheduler queue when it's stuck (which
> > >>> gets it unstuck), and replacing the SPSC queue with one guarded by a
> > >>> spinlock (which completely fixes the issue).
> > >>>
> > >>> That last point raises a big question: why are we using a convoluted
> > >>> lockless algorithm here instead of a simple spinlock? This isn't a
> > >>> critical path—and even if it were, how much performance benefit are we
> > >>> actually getting from the lockless design? Probably very little.
> > >>>
> > >>> Any objections to me rewriting this around a spinlock-based design? My
> > >>> head hurts from chasing this bug, and I feel like this is the best way
> > >>> forward rather than wasting more time here.
> > >>
> > >> Well the spsc queue is some standard code I used in previous projects and we have never experienced any issue with that.
> > >>
> > > 
> > > I can quite clearly see this not working on my test setup. I can (kinda)
> > > explain it a bit more.
> > > 
> > > Look at this code:
> > > 
> > >  65 static inline bool spsc_queue_push(struct spsc_queue *queue, struct spsc_node *node)
> > >  66 {
> > >  67         struct spsc_node **tail;
> > >  68
> > >  69         node->next = NULL;
> > >  70
> > >  71         preempt_disable();
> > >  72
> > >  73         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail, (long)&node->next);
> > >  74         WRITE_ONCE(*tail, node);
> > >  75         atomic_inc(&queue->job_count);
> > >  76
> > >  77         /*
> > >  78          * In case of first element verify new node will be visible to the consumer
> > >  79          * thread when we ping the kernel thread that there is new work to do.
> > >  80          */
> > >  81         smp_wmb();
> > >  82
> > >  83         preempt_enable();
> > >  84
> > >  85         return tail == &queue->head;
> > >  86 }
> > > 
> > > Between the execution of atomic_long_xchg and atomic_inc, the submission
> > > worker could dequeue the previous last job, reducing job_count to zero,
> > > run the job, observe that job_count == 0 (drm_sched_entity_is_ready),
> > > and then go to sleep. This function has swapped for the previous tail,
> > > so it returns false (i.e., not the first, and does not requeue the
> > > submit worker at caller).
> > > 
> > > The race window here is tiny, and I would think preempt_disable would
> > > make it impossible (or preempt_disable is broken drm-tip a few days
> > > ago), so I’m still a bit perplexed by it. But again, I assure you this
> > > is, in fact, happening on my test setup. My test case is an SVM one,
> > > which involves all sorts of CPU/GPU faults, migrations, etc. Not sure if
> > > that contributes. I can show this race occurring in dmesg if you need
> > > proof.
> > > 
> > > The change below fixes the problem. I'm going to post it to unblock
> > > myself.
> > > 
> > > diff --git a/include/drm/spsc_queue.h b/include/drm/spsc_queue.h
> > > index 125f096c88cb..ee9df8cc67b7 100644
> > > --- a/include/drm/spsc_queue.h
> > > +++ b/include/drm/spsc_queue.h
> > > @@ -70,9 +70,11 @@ static inline bool spsc_queue_push(struct spsc_queue
> > > *queue, struct spsc_node *n
> > > 
> > >         preempt_disable();
> > > 
> > > +       atomic_inc(&queue->job_count);
> > > +       smp_mb__after_atomic();
> > > +
> > >         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail,
> > > (long)&node->next);
> > >         WRITE_ONCE(*tail, node);
> > > -       atomic_inc(&queue->job_count);
> > > 
> > >         /*
> > >          * In case of first element verify new node will be visible to
> > >          * the consumer
> > 
> > I need to double check the code path once more, but that we got this wrong while could certainly be.
> > 
> > The code originated in userspace and atomics are also memory barriers
> > there. Sima had to point out that we have to manually add smp_mb()s here
> > to make it work. 
> 
> Yeah last time I looked (which I think was years ago) spsc lacked the
> needed barriers, and lacked even more the comments to explain how the
> barriers match up (since some are there as implied ones). I kinda decided
> that it's above my skill level to really dig into the entire thing and do
> a formal proof and all that, which is imo what we need for lockless code
> in the kernel or it's just going to be busted in strange ways.
> 

Yea it took me a while to figure out what the SPSC was doing too. I
can't confirm it is correct after my change either.

> Note that thanks to compilers becoming real good you need barriers even on
> x86.

Ping on this patch [1]. Any objection to me merging?

Matt

[1] https://patchwork.freedesktop.org/series/150260/

> -Sima
> 
> > 
> > >> This is a massively performance critical code path and we need to make sure that we move as few cache lines as possible between the producer and consumer side.
> > >>
> > > 
> > > Well, I can't say I buy this argument. If you can show me any real
> > > workload where using a spinlock here vs. going lockless makes a
> > > measurable impact, I'd eat my hat. Also, FWIW, this code seemingly
> > > violates the DRM locking guidelines we're all supposed to follow… But
> > > anyway, I'll go ahead with the fix above.
> > 
> > I probably have to take that back, see another comment below.
> > 
> > > 
> > > Matt
> > > 
> > >> That was the reason why we replaced the spinlock with the spsc queue before.
> > 
> > Well we replaced spinlock+kfifo with spsc the for round robing peeking implementation.
> > 
> > Background is that because of the spinlock even a "peek" transfers the cache line as write to the sheduler thread, and when you do the "peek" even on the idle entities then that starts to not scale at all.
> > 
> > Since we now have the FIFO implementation and avoiding peeking all the time into the job queue of idle entities that might as well not suck that badly any more.
> > 
> > Regards,
> > Christian. 
> > 
> > >>
> > >> Regards,
> > >> Christian.
> > >>
> > >>>
> > >>> Matt
> > >>
> > 
> 
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch