DRM scheduler issue with spsc_queue

Mon Jun 16 07:42:15 UTC 2025

On Mon, Jun 16, 2025 at 09:24:38AM +0200, Christian König wrote:
> On 6/13/25 21:11, Matthew Brost wrote:
> > On Fri, Jun 13, 2025 at 07:26:22PM +0200, Christian König wrote:
> >> On 6/13/25 19:01, Matthew Brost wrote:
> >>> All,
> >>>
> >>> After about six hours of debugging, I found an issue in a fairly
> >>> aggressive test case involving the DRM scheduler function
> >>> drm_sched_entity_push_job. The problem is that spsc_queue_push does not
> >>> correctly return first on a job push, causing the queue to fail to run
> >>> even though it is ready.
> >>>
> >>> I know this sounds a bit insane, but I assure you it’s happening and is
> >>> quite reproducible. I'm working off a pull of drm-tip from a few days
> >>> ago + some local change to Xe's memory management, with a Kconfig that
> >>> has no debug options enabled. I’m not sure if there’s a bug somewhere in
> >>> the kernel related to barriers or atomics in the recent drm-tip. That
> >>> seems unlikely—but just as unlikely is that this bug has existed for a
> >>> while without being triggered until now.
> >>>
> >>> I've verified the hang in several ways: using printks, adding a debugfs
> >>> entry to manually kick the DRM scheduler queue when it's stuck (which
> >>> gets it unstuck), and replacing the SPSC queue with one guarded by a
> >>> spinlock (which completely fixes the issue).
> >>>
> >>> That last point raises a big question: why are we using a convoluted
> >>> lockless algorithm here instead of a simple spinlock? This isn't a
> >>> critical path—and even if it were, how much performance benefit are we
> >>> actually getting from the lockless design? Probably very little.
> >>>
> >>> Any objections to me rewriting this around a spinlock-based design? My
> >>> head hurts from chasing this bug, and I feel like this is the best way
> >>> forward rather than wasting more time here.
> >>
> >> Well the spsc queue is some standard code I used in previous projects and we have never experienced any issue with that.
> >>
> > 
> > I can quite clearly see this not working on my test setup. I can (kinda)
> > explain it a bit more.
> > 
> > Look at this code:
> > 
> >  65 static inline bool spsc_queue_push(struct spsc_queue *queue, struct spsc_node *node)
> >  66 {
> >  67         struct spsc_node **tail;
> >  68
> >  69         node->next = NULL;
> >  70
> >  71         preempt_disable();
> >  72
> >  73         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail, (long)&node->next);
> >  74         WRITE_ONCE(*tail, node);
> >  75         atomic_inc(&queue->job_count);
> >  76
> >  77         /*
> >  78          * In case of first element verify new node will be visible to the consumer
> >  79          * thread when we ping the kernel thread that there is new work to do.
> >  80          */
> >  81         smp_wmb();
> >  82
> >  83         preempt_enable();
> >  84
> >  85         return tail == &queue->head;
> >  86 }
> > 
> > Between the execution of atomic_long_xchg and atomic_inc, the submission
> > worker could dequeue the previous last job, reducing job_count to zero,
> > run the job, observe that job_count == 0 (drm_sched_entity_is_ready),
> > and then go to sleep. This function has swapped for the previous tail,
> > so it returns false (i.e., not the first, and does not requeue the
> > submit worker at caller).
> > 
> > The race window here is tiny, and I would think preempt_disable would
> > make it impossible (or preempt_disable is broken drm-tip a few days
> > ago), so I’m still a bit perplexed by it. But again, I assure you this
> > is, in fact, happening on my test setup. My test case is an SVM one,
> > which involves all sorts of CPU/GPU faults, migrations, etc. Not sure if
> > that contributes. I can show this race occurring in dmesg if you need
> > proof.
> > 
> > The change below fixes the problem. I'm going to post it to unblock
> > myself.
> > 
> > diff --git a/include/drm/spsc_queue.h b/include/drm/spsc_queue.h
> > index 125f096c88cb..ee9df8cc67b7 100644
> > --- a/include/drm/spsc_queue.h
> > +++ b/include/drm/spsc_queue.h
> > @@ -70,9 +70,11 @@ static inline bool spsc_queue_push(struct spsc_queue
> > *queue, struct spsc_node *n
> > 
> >         preempt_disable();
> > 
> > +       atomic_inc(&queue->job_count);
> > +       smp_mb__after_atomic();
> > +
> >         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail,
> > (long)&node->next);
> >         WRITE_ONCE(*tail, node);
> > -       atomic_inc(&queue->job_count);
> > 
> >         /*
> >          * In case of first element verify new node will be visible to
> >          * the consumer
> 
> I need to double check the code path once more, but that we got this wrong while could certainly be.
> 
> The code originated in userspace and atomics are also memory barriers there. Sima had to point out that we have to manually add smp_mb()s here to make it work. 
> 
> >> This is a massively performance critical code path and we need to make sure that we move as few cache lines as possible between the producer and consumer side.
> >>
> > 
> > Well, I can't say I buy this argument. If you can show me any real
> > workload where using a spinlock here vs. going lockless makes a
> > measurable impact, I'd eat my hat. Also, FWIW, this code seemingly
> > violates the DRM locking guidelines we're all supposed to follow… But
> > anyway, I'll go ahead with the fix above.
> 
> I probably have to take that back, see another comment below.
> 
> > 
> > Matt
> > 
> >> That was the reason why we replaced the spinlock with the spsc queue before.
> 
> Well we replaced spinlock+kfifo with spsc the for round robing peeking implementation.
> 
> Background is that because of the spinlock even a "peek" transfers the cache line as write to the sheduler thread, and when you do the "peek" even on the idle entities then that starts to not scale at all.
> 
> Since we now have the FIFO implementation and avoiding peeking all the time into the job queue of idle entities that might as well not suck that badly any more.
> 

I think if the consumer (scheduler) always tried to peek—e.g., by
checking the SPSC tail rather than the job count—the current SPSC code
would work as is.

Matt

> Regards,
> Christian. 
> 
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> Matt
> >>
>