[PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

Fri Sep 20 13:26:07 UTC 2024

On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
> > > Tearing down the scheduler with jobs still on the pending list
> > > can
> > > lead to use after free issues. Add a warning if drivers try to
> > > destroy a scheduler which still has work pushed to the HW.
> > Did you have time yet to look into my proposed waitque-solution?
> 
> I don't remember seeing anything. What have I missed?

https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/

> 
> > 
> > > When there are still entities with jobs the situation is even
> > > worse
> > > since the dma_fences for those jobs can never signal we can just
> > > choose between potentially locking up core memory management and
> > > random memory corruption. When drivers really mess it up that
> > > well
> > > let them run into a BUG_ON().
> > > 
> > > Signed-off-by: Christian König <christian.koenig at amd.com>
> > > ---
> > >   drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
> > >   1 file changed, 18 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index f093616fe53c..8a46fab5cdc8 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
> > > drm_gpu_scheduler
> > > *sched)
> > I agree with Sima that it should first be documented in the
> > function's
> > docstring what the user is expected to have done before calling the
> > function.
> 
> Good point, going to update the documentation as well.

Cool thing, thx.
Would be great if everything (not totally trivial) necessary to be done
before _fini() is mentioned.

One could also think about providing a hint at how the driver can do
that. AFAICS the only way for the driver to ensure that is to maintain
its own, separate list of submitted jobs.

P.

> 
> Thanks,
> Christian.
> 
> > 
> > P.
> > 
> > >   
> > >   	drm_sched_wqueue_stop(sched);
> > >   
> > > +	/*
> > > +	 * Tearing down the scheduler wile there are still
> > > unprocessed jobs can
> > > +	 * lead to use after free issues in the scheduler fence.
> > > +	 */
> > > +	WARN_ON(!list_empty(&sched->pending_list));
> > > +
> > >   	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
> > > i++)
> > > {
> > >   		struct drm_sched_rq *rq = sched->sched_rq[i];
> > >   
> > >   		spin_lock(&rq->lock);
> > > -		list_for_each_entry(s_entity, &rq->entities,
> > > list)
> > > +		list_for_each_entry(s_entity, &rq->entities,
> > > list) {
> > > +			/*
> > > +			 * The justification for this BUG_ON()
> > > is
> > > that tearing
> > > +			 * down the scheduler while jobs are
> > > pending
> > > leaves
> > > +			 * dma_fences unsignaled. Since we have
> > > dependencies
> > > +			 * from the core memory management to
> > > eventually signal
> > > +			 * dma_fences this can trivially lead to
> > > a
> > > system wide
> > > +			 * stop because of a locked up memory
> > > management.
> > > +			 */
> > > +			BUG_ON(spsc_queue_count(&s_entity-
> > > > job_queue));
> > > +
> > >   			/*
> > >   			 * Prevents reinsertion and marks
> > > job_queue
> > > as idle,
> > >   			 * it will removed from rq in
> > > drm_sched_entity_fini
> > >   			 * eventually
> > >   			 */
> > >   			s_entity->stopped = true;
> > > +		}
> > >   		spin_unlock(&rq->lock);
> > >   		kfree(sched->sched_rq[i]);
> > >   	}
>