DRM scheduler issue with spsc_queue

Mon Jun 16 10:42:53 UTC 2025

On Mon, Jun 16, 2025 at 09:24:38AM +0200, Christian König wrote:
> On 6/13/25 21:11, Matthew Brost wrote:
> > On Fri, Jun 13, 2025 at 07:26:22PM +0200, Christian König wrote:
> >> On 6/13/25 19:01, Matthew Brost wrote:
> >>> All,
> >>>
> >>> After about six hours of debugging, I found an issue in a fairly
> >>> aggressive test case involving the DRM scheduler function
> >>> drm_sched_entity_push_job. The problem is that spsc_queue_push does not
> >>> correctly return first on a job push, causing the queue to fail to run
> >>> even though it is ready.
> >>>
> >>> I know this sounds a bit insane, but I assure you it’s happening and is
> >>> quite reproducible. I'm working off a pull of drm-tip from a few days
> >>> ago + some local change to Xe's memory management, with a Kconfig that
> >>> has no debug options enabled. I’m not sure if there’s a bug somewhere in
> >>> the kernel related to barriers or atomics in the recent drm-tip. That
> >>> seems unlikely—but just as unlikely is that this bug has existed for a
> >>> while without being triggered until now.
> >>>
> >>> I've verified the hang in several ways: using printks, adding a debugfs
> >>> entry to manually kick the DRM scheduler queue when it's stuck (which
> >>> gets it unstuck), and replacing the SPSC queue with one guarded by a
> >>> spinlock (which completely fixes the issue).
> >>>
> >>> That last point raises a big question: why are we using a convoluted
> >>> lockless algorithm here instead of a simple spinlock? This isn't a
> >>> critical path—and even if it were, how much performance benefit are we
> >>> actually getting from the lockless design? Probably very little.
> >>>
> >>> Any objections to me rewriting this around a spinlock-based design? My
> >>> head hurts from chasing this bug, and I feel like this is the best way
> >>> forward rather than wasting more time here.
> >>
> >> Well the spsc queue is some standard code I used in previous projects and we have never experienced any issue with that.
> >>
> > 
> > I can quite clearly see this not working on my test setup. I can (kinda)
> > explain it a bit more.
> > 
> > Look at this code:
> > 
> >  65 static inline bool spsc_queue_push(struct spsc_queue *queue, struct spsc_node *node)
> >  66 {
> >  67         struct spsc_node **tail;
> >  68
> >  69         node->next = NULL;
> >  70
> >  71         preempt_disable();
> >  72
> >  73         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail, (long)&node->next);
> >  74         WRITE_ONCE(*tail, node);
> >  75         atomic_inc(&queue->job_count);
> >  76
> >  77         /*
> >  78          * In case of first element verify new node will be visible to the consumer
> >  79          * thread when we ping the kernel thread that there is new work to do.
> >  80          */
> >  81         smp_wmb();
> >  82
> >  83         preempt_enable();
> >  84
> >  85         return tail == &queue->head;
> >  86 }
> > 
> > Between the execution of atomic_long_xchg and atomic_inc, the submission
> > worker could dequeue the previous last job, reducing job_count to zero,
> > run the job, observe that job_count == 0 (drm_sched_entity_is_ready),
> > and then go to sleep. This function has swapped for the previous tail,
> > so it returns false (i.e., not the first, and does not requeue the
> > submit worker at caller).
> > 
> > The race window here is tiny, and I would think preempt_disable would
> > make it impossible (or preempt_disable is broken drm-tip a few days
> > ago), so I’m still a bit perplexed by it. But again, I assure you this
> > is, in fact, happening on my test setup. My test case is an SVM one,
> > which involves all sorts of CPU/GPU faults, migrations, etc. Not sure if
> > that contributes. I can show this race occurring in dmesg if you need
> > proof.
> > 
> > The change below fixes the problem. I'm going to post it to unblock
> > myself.
> > 
> > diff --git a/include/drm/spsc_queue.h b/include/drm/spsc_queue.h
> > index 125f096c88cb..ee9df8cc67b7 100644
> > --- a/include/drm/spsc_queue.h
> > +++ b/include/drm/spsc_queue.h
> > @@ -70,9 +70,11 @@ static inline bool spsc_queue_push(struct spsc_queue
> > *queue, struct spsc_node *n
> > 
> >         preempt_disable();
> > 
> > +       atomic_inc(&queue->job_count);
> > +       smp_mb__after_atomic();
> > +
> >         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail,
> > (long)&node->next);
> >         WRITE_ONCE(*tail, node);
> > -       atomic_inc(&queue->job_count);
> > 
> >         /*
> >          * In case of first element verify new node will be visible to
> >          * the consumer
> 
> I need to double check the code path once more, but that we got this wrong while could certainly be.
> 
> The code originated in userspace and atomics are also memory barriers
> there. Sima had to point out that we have to manually add smp_mb()s here
> to make it work. 

Yeah last time I looked (which I think was years ago) spsc lacked the
needed barriers, and lacked even more the comments to explain how the
barriers match up (since some are there as implied ones). I kinda decided
that it's above my skill level to really dig into the entire thing and do
a formal proof and all that, which is imo what we need for lockless code
in the kernel or it's just going to be busted in strange ways.

Note that thanks to compilers becoming real good you need barriers even on
x86.
-Sima

> 
> >> This is a massively performance critical code path and we need to make sure that we move as few cache lines as possible between the producer and consumer side.
> >>
> > 
> > Well, I can't say I buy this argument. If you can show me any real
> > workload where using a spinlock here vs. going lockless makes a
> > measurable impact, I'd eat my hat. Also, FWIW, this code seemingly
> > violates the DRM locking guidelines we're all supposed to follow… But
> > anyway, I'll go ahead with the fix above.
> 
> I probably have to take that back, see another comment below.
> 
> > 
> > Matt
> > 
> >> That was the reason why we replaced the spinlock with the spsc queue before.
> 
> Well we replaced spinlock+kfifo with spsc the for round robing peeking implementation.
> 
> Background is that because of the spinlock even a "peek" transfers the cache line as write to the sheduler thread, and when you do the "peek" even on the idle entities then that starts to not scale at all.
> 
> Since we now have the FIFO implementation and avoiding peeking all the time into the job queue of idle entities that might as well not suck that badly any more.
> 
> Regards,
> Christian. 
> 
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> Matt
> >>
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch