[PATCH] drm/sched: Prevent stopped entities from being added to the run queue.

Thu Jul 24 04:13:34 UTC 2025

On Wed, Jul 23, 2025 at 08:56:01AM +0200, Philipp Stanner wrote:
> On Tue, 2025-07-22 at 01:45 -0700, Matthew Brost wrote:
> > On Tue, Jul 22, 2025 at 01:07:29AM -0700, Matthew Brost wrote:
> > > On Tue, Jul 22, 2025 at 09:37:11AM +0200, Philipp Stanner wrote:
> > > > On Mon, 2025-07-21 at 11:07 -0700, Matthew Brost wrote:
> > > > > On Mon, Jul 21, 2025 at 12:14:31PM +0200, Danilo Krummrich wrote:
> > > > > > On Mon Jul 21, 2025 at 10:16 AM CEST, Philipp Stanner wrote:
> > > > > > > On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote:
> > > > > > > > On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote:
> > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > index bfea608a7106..997a2cc1a635 100644
> > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
> > > > > > > > >  
> > > > > > > > >  	entity->oldest_job_waiting = ts;
> > > > > > > > >  
> > > > > > > > > -	rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root,
> > > > > > > > > -		      drm_sched_entity_compare_before);
> > > > > > > > > +	if (!entity->stopped) {
> > > > > > > > > +		rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root,
> > > > > > > > > +			      drm_sched_entity_compare_before);
> > > > > > > > > +	}
> > > > > > > > 
> > > > > > > > If this is a race, then this patch here is broken, too, because you're
> > > > > > > > checking the 'stopped' boolean as the callers of that function do, too
> > > > > > > > – just later. :O
> > > > > > > > 
> > > > > > > > Could still race, just less likely.
> > > > > > > > 
> > > > > > > > The proper way to fix it would then be to address the issue where the
> > > > > > > > locking is supposed to happen. Let's look at, for example,
> > > > > > > > drm_sched_entity_push_job():
> > > > > > > > 
> > > > > > > > 
> > > > > > > > void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> > > > > > > > {
> > > > > > > > 	(Bla bla bla)
> > > > > > > > 
> > > > > > > >  	…………
> > > > > > > > 
> > > > > > > > 	/* first job wakes up scheduler */
> > > > > > > > 	if (first) {
> > > > > > > > 		struct drm_gpu_scheduler *sched;
> > > > > > > > 		struct drm_sched_rq *rq;
> > > > > > > > 
> > > > > > > > 		/* Add the entity to the run queue */
> > > > > > > > 		spin_lock(&entity->lock);
> > > > > > > > 		if (entity->stopped) {                  <---- Aha!
> > > > > > > > 			spin_unlock(&entity->lock);
> > > > > > > > 
> > > > > > > > 			DRM_ERROR("Trying to push to a killed entity\n");
> > > > > > > > 			return;
> > > > > > > > 		}
> > > > > > > > 
> > > > > > > > 		rq = entity->rq;
> > > > > > > > 		sched = rq->sched;
> > > > > > > > 
> > > > > > > > 		spin_lock(&rq->lock);
> > > > > > > > 		drm_sched_rq_add_entity(rq, entity);
> > > > > > > > 
> > > > > > > > 		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> > > > > > > > 			drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); <---- bumm!
> > > > > > > > 
> > > > > > > > 		spin_unlock(&rq->lock);
> > > > > > > > 		spin_unlock(&entity->lock);
> > > > > > > > 
> > > > > > > > But the locks are still being hold. So that "shouldn't be happening"(tm).
> > > > > > > > 
> > > > > > > > Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini()
> > > > > > > > stop entities. The former holds appropriate locks, but drm_sched_fini()
> > > > > > > > doesn't. So that looks like a hot candidate to me. Opinions?
> > > > > > > > 
> > > > > > > > On the other hand, aren't drivers prohibited from calling
> > > > > > > > drm_sched_entity_push_job() after calling drm_sched_fini()? If the
> > > > > > > > fuzzer does that, then it's not the scheduler's fault.
> > > > > > 
> > > > > > Exactly, this is the first question to ask.
> > > > > > 
> > > > > > And I think it's even more restrictive:
> > > > > > 
> > > > > > In drm_sched_fini()
> > > > > > 
> > > > > > 	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
> > > > > > 		struct drm_sched_rq *rq = sched->sched_rq[i];
> > > > > > 
> > > > > > 		spin_lock(&rq->lock);
> > > > > > 		list_for_each_entry(s_entity, &rq->entities, list)
> > > > > > 			/*
> > > > > > 			 * Prevents reinsertion and marks job_queue as idle,
> > > > > > 			 * it will be removed from the rq in drm_sched_entity_fini()
> > > > > > 			 * eventually
> > > > > > 			 */
> > > > > > 			s_entity->stopped = true;
> > > > > > 		spin_unlock(&rq->lock);
> > > > > > 		kfree(sched->sched_rq[i]);
> > > > > > 	}
> > > > > > 
> > > > > > In drm_sched_entity_kill()
> > > > > > 
> > > > > > 	static void drm_sched_entity_kill(struct drm_sched_entity *entity)
> > > > > > 	{
> > > > > > 		struct drm_sched_job *job;
> > > > > > 		struct dma_fence *prev;
> > > > > > 
> > > > > > 		if (!entity->rq)
> > > > > > 			return;
> > > > > > 
> > > > > > 		spin_lock(&entity->lock);
> > > > > > 		entity->stopped = true;
> > > > > > 		drm_sched_rq_remove_entity(entity->rq, entity);
> > > > > > 		spin_unlock(&entity->lock);
> > > > > > 
> > > > > > 		[...]
> > > > > > 	}
> > > > > > 
> > > > > > If this runs concurrently, this is a UAF as well.
> > > > > > 
> > > > > > Personally, I have always been working with the assupmtion that entites have to
> > > > > > be torn down *before* the scheduler, but those lifetimes are not documented
> > > > > > properly.
> > > > > 
> > > > > Yes, this is my assumption too. I would even take it further: an entity
> > > > > shouldn't be torn down until all jobs associated with it are freed as
> > > > > well. I think this would solve a lot of issues I've seen on the list
> > > > > related to UAF, teardown, etc.
> > > > 
> > > > That's kind of impossible with the new tear down design, because
> > > > drm_sched_fini() ensures that all jobs are freed on teardown. And
> > > > drm_sched_fini() wouldn't be called before all jobs are gone,
> > > > effectively resulting in a chicken-egg-problem, or rather: the driver
> > > > implementing its own solution for teardown.
> > > > 
> > > 
> > > I've read this four times and I'm still generally confused.
> > > 
> > > "drm_sched_fini ensures that all jobs are freed on teardown" — Yes,
> > > that's how a refcounting-based solution works. drm_sched_fini would
> > > never be called if there were pending jobs.
> > > 
> > > "drm_sched_fini() wouldn't be called before all jobs are gone" — See
> > > above.
> > > 
> > > "effectively resulting in a chicken-and-egg problem" — A job is created
> > > after the scheduler, and it holds a reference to the scheduler until
> > > it's freed. I don't see how this idiom applies.
> > > 
> > > "the driver implementing its own solution for teardown" — It’s just
> > > following the basic lifetime rules I outlined below. Perhaps Xe was
> > > ahead of its time, but the number of DRM scheduler blowups we've had is
> > > zero — maybe a strong indication that this design is correct.
> > > 
> > 
> > Sorry—self-reply.
> > 
> > To expand on this: the reason Xe implemented a refcount-based teardown
> > solution is because the internals of the DRM scheduler during teardown
> > looked wildly scary. A lower layer should not impose its will on upper
> > layers. I think that’s the root cause of all the problems I've listed.
> > 
> > In my opinion, we should document the lifetime rules I’ve outlined, fix
> > all drivers accordingly, and assert these rules in the scheduler layer.
> 
> 
> Everyone had a separate solution for that. Nouveau used a waitqueue.
> That's what happens when there's no centralized mechanism for solving a
> problem.
> 

Right, this is essentially my point — I think refcounting on the driver
side is what the long-term solution really needs to be.

To recap the basic rules:

- Entities should not be finalized or freed until all jobs associated
  with them are freed.
- Schedulers should not be finalized or freed until all associated
  entities are finalized.
- Jobs should hold a reference to the entity.
- Entities should hold a reference to the scheduler.

I understand this won’t happen overnight — or perhaps ever — but
adopting this model would solve a lot of problems across the subsystem
and reduce a significant amount of complexity in the DRM scheduler. I’ll
also acknowledge that part of this is my fault — years ago, I worked
around problems (implemented above ref count model) in the scheduler
related to teardown rather than proposing a common, unified solution,
and clear lifetime rules.

For drivers with a 1:1 entity-to-scheduler relationship, teardown
becomes fairly simple: set the TDR timeout to zero and naturally let the
remaining jobs flush out via TDR + the timedout_job callback, which
signals the job’s fence. Free job, is called after that.

For non-1:1 setups, we could introduce something like
drm_sched_entity_kill, which would move all jobs on the pending list of
a given entity to a kill list. A worker could then process that kill
list — calling timedout_job and signaling the associated fences.
Similarly, any jobs that had unresolved dependencies could be
immediately added to the kill list. The kill list would have to be
checked in drm_sched_free_job_work too.

This would ensure that all jobs submitted would go through the full
lifecycle:

- run_job is called
- free_job is called
- If the fence returned from run_job needs to be artificially signaled,
  timedout_job is called

We can add the infrastructure for this and once all driver adhere this
model, clean up ugliness in the scheduler related to teardown and all
races here.

> Did you see the series we recently merged which repairs the memory
> leaks of drm/sched? It had been around for quite some time.
> 
> https://lore.kernel.org/dri-devel/20250701132142.76899-3-phasta@kernel.org/
>

I would say this is just hacking around the fundamental issues with the
lifetime of these objects. Do you see anything in Nouveau that would
prevent the approach I described above from working?

Also, what if jobs have dependencies that aren't even on the pending
list yet? This further illustrates the problems with trying to finalize
objects while child objects (entities, job) are still around.

Matt

> 
> P.
> 
> > 
> > Matt
> > 
> > > Matt
> > > 
> > > > P.
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > There are two solutions:
> > > > > > 
> > > > > >   (1) Strictly require all entities to be torn down before drm_sched_fini(),
> > > > > >       i.e. stick to the natural ownership and lifetime rules here (see below).
> > > > > > 
> > > > > >   (2) Actually protect *any* changes of the relevent fields of the entity
> > > > > >       structure with the entity lock.
> > > > > > 
> > > > > > While (2) seems rather obvious, we run into lock inversion with this approach,
> > > > > > as you note below as well. And I think drm_sched_fini() should not mess with
> > > > > > entities anyways.
> > > > > > 
> > > > > > The ownership here seems obvious:
> > > > > > 
> > > > > > The scheduler *owns* a resource that is used by entities. Consequently, entities
> > > > > > are not allowed to out-live the scheduler.
> > > > > > 
> > > > > > Surely, the current implementation to just take the resource away from the
> > > > > > entity under the hood can work as well with appropriate locking, but that's a
> > > > > > mess.
> > > > > > 
> > > > > > If the resource *really* needs to be shared for some reason (which I don't see),
> > > > > > shared ownership, i.e. reference counting, is much less error prone.
> > > > > 
> > > > > Yes, Xe solves all of this via reference counting (jobs refcount the
> > > > > entity). It's a bit easier in Xe since the scheduler and entities are
> > > > > the same object due to their 1:1 relationship. But even in non-1:1
> > > > > relationships, an entity could refcount the scheduler. The teardown
> > > > > sequence would then be: all jobs complete on the entity → teardown the
> > > > > entity → all entities torn down → teardown the scheduler.
> > > > > 
> > > > > Matt
> > > > 
>