DRM scheduler issue with spsc_queue

Fri Jun 13 19:11:10 UTC 2025

On Fri, Jun 13, 2025 at 07:26:22PM +0200, Christian König wrote:
> On 6/13/25 19:01, Matthew Brost wrote:
> > All,
> > 
> > After about six hours of debugging, I found an issue in a fairly
> > aggressive test case involving the DRM scheduler function
> > drm_sched_entity_push_job. The problem is that spsc_queue_push does not
> > correctly return first on a job push, causing the queue to fail to run
> > even though it is ready.
> > 
> > I know this sounds a bit insane, but I assure you it’s happening and is
> > quite reproducible. I'm working off a pull of drm-tip from a few days
> > ago + some local change to Xe's memory management, with a Kconfig that
> > has no debug options enabled. I’m not sure if there’s a bug somewhere in
> > the kernel related to barriers or atomics in the recent drm-tip. That
> > seems unlikely—but just as unlikely is that this bug has existed for a
> > while without being triggered until now.
> > 
> > I've verified the hang in several ways: using printks, adding a debugfs
> > entry to manually kick the DRM scheduler queue when it's stuck (which
> > gets it unstuck), and replacing the SPSC queue with one guarded by a
> > spinlock (which completely fixes the issue).
> > 
> > That last point raises a big question: why are we using a convoluted
> > lockless algorithm here instead of a simple spinlock? This isn't a
> > critical path—and even if it were, how much performance benefit are we
> > actually getting from the lockless design? Probably very little.
> > 
> > Any objections to me rewriting this around a spinlock-based design? My
> > head hurts from chasing this bug, and I feel like this is the best way
> > forward rather than wasting more time here.
> 
> Well the spsc queue is some standard code I used in previous projects and we have never experienced any issue with that.
> 

I can quite clearly see this not working on my test setup. I can (kinda)
explain it a bit more.

Look at this code:

 65 static inline bool spsc_queue_push(struct spsc_queue *queue, struct spsc_node *node)
 66 {
 67         struct spsc_node **tail;
 68
 69         node->next = NULL;
 70
 71         preempt_disable();
 72
 73         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail, (long)&node->next);
 74         WRITE_ONCE(*tail, node);
 75         atomic_inc(&queue->job_count);
 76
 77         /*
 78          * In case of first element verify new node will be visible to the consumer
 79          * thread when we ping the kernel thread that there is new work to do.
 80          */
 81         smp_wmb();
 82
 83         preempt_enable();
 84
 85         return tail == &queue->head;
 86 }

Between the execution of atomic_long_xchg and atomic_inc, the submission
worker could dequeue the previous last job, reducing job_count to zero,
run the job, observe that job_count == 0 (drm_sched_entity_is_ready),
and then go to sleep. This function has swapped for the previous tail,
so it returns false (i.e., not the first, and does not requeue the
submit worker at caller).

The race window here is tiny, and I would think preempt_disable would
make it impossible (or preempt_disable is broken drm-tip a few days
ago), so I’m still a bit perplexed by it. But again, I assure you this
is, in fact, happening on my test setup. My test case is an SVM one,
which involves all sorts of CPU/GPU faults, migrations, etc. Not sure if
that contributes. I can show this race occurring in dmesg if you need
proof.

The change below fixes the problem. I'm going to post it to unblock
myself.

diff --git a/include/drm/spsc_queue.h b/include/drm/spsc_queue.h
index 125f096c88cb..ee9df8cc67b7 100644
--- a/include/drm/spsc_queue.h
+++ b/include/drm/spsc_queue.h
@@ -70,9 +70,11 @@ static inline bool spsc_queue_push(struct spsc_queue
*queue, struct spsc_node *n

        preempt_disable();

+       atomic_inc(&queue->job_count);
+       smp_mb__after_atomic();
+
        tail = (struct spsc_node **)atomic_long_xchg(&queue->tail,
(long)&node->next);
        WRITE_ONCE(*tail, node);
-       atomic_inc(&queue->job_count);

        /*
         * In case of first element verify new node will be visible to
         * the consumer

> This is a massively performance critical code path and we need to make sure that we move as few cache lines as possible between the producer and consumer side.
> 

Well, I can't say I buy this argument. If you can show me any real
workload where using a spinlock here vs. going lockless makes a
measurable impact, I'd eat my hat. Also, FWIW, this code seemingly
violates the DRM locking guidelines we're all supposed to follow… But
anyway, I'll go ahead with the fix above.

Matt

> That was the reason why we replaced the spinlock with the spsc queue before.
> 
> Regards,
> Christian.
> 
> > 
> > Matt
>