DRM scheduler issue with spsc_queue

Fri Jun 13 17:01:42 UTC 2025

All,

After about six hours of debugging, I found an issue in a fairly
aggressive test case involving the DRM scheduler function
drm_sched_entity_push_job. The problem is that spsc_queue_push does not
correctly return first on a job push, causing the queue to fail to run
even though it is ready.

I know this sounds a bit insane, but I assure you it’s happening and is
quite reproducible. I'm working off a pull of drm-tip from a few days
ago + some local change to Xe's memory management, with a Kconfig that
has no debug options enabled. I’m not sure if there’s a bug somewhere in
the kernel related to barriers or atomics in the recent drm-tip. That
seems unlikely—but just as unlikely is that this bug has existed for a
while without being triggered until now.

I've verified the hang in several ways: using printks, adding a debugfs
entry to manually kick the DRM scheduler queue when it's stuck (which
gets it unstuck), and replacing the SPSC queue with one guarded by a
spinlock (which completely fixes the issue).

That last point raises a big question: why are we using a convoluted
lockless algorithm here instead of a simple spinlock? This isn't a
critical path—and even if it were, how much performance benefit are we
actually getting from the lockless design? Probably very little.

Any objections to me rewriting this around a spinlock-based design? My
head hurts from chasing this bug, and I feel like this is the best way
forward rather than wasting more time here.

Matt