[PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

Fri Sep 20 08:57:52 UTC 2024

On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
> Tearing down the scheduler with jobs still on the pending list can
> lead to use after free issues. Add a warning if drivers try to
> destroy a scheduler which still has work pushed to the HW.

Did you have time yet to look into my proposed waitque-solution?

> 
> When there are still entities with jobs the situation is even worse
> since the dma_fences for those jobs can never signal we can just
> choose between potentially locking up core memory management and
> random memory corruption. When drivers really mess it up that well
> let them run into a BUG_ON().
> 
> Signed-off-by: Christian König <christian.koenig at amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index f093616fe53c..8a46fab5cdc8 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler
> *sched)

I agree with Sima that it should first be documented in the function's
docstring what the user is expected to have done before calling the
function.

P.

>  
>  	drm_sched_wqueue_stop(sched);
>  
> +	/*
> +	 * Tearing down the scheduler wile there are still
> unprocessed jobs can
> +	 * lead to use after free issues in the scheduler fence.
> +	 */
> +	WARN_ON(!list_empty(&sched->pending_list));
> +
>  	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++)
> {
>  		struct drm_sched_rq *rq = sched->sched_rq[i];
>  
>  		spin_lock(&rq->lock);
> -		list_for_each_entry(s_entity, &rq->entities, list)
> +		list_for_each_entry(s_entity, &rq->entities, list) {
> +			/*
> +			 * The justification for this BUG_ON() is
> that tearing
> +			 * down the scheduler while jobs are pending
> leaves
> +			 * dma_fences unsignaled. Since we have
> dependencies
> +			 * from the core memory management to
> eventually signal
> +			 * dma_fences this can trivially lead to a
> system wide
> +			 * stop because of a locked up memory
> management.
> +			 */
> +			BUG_ON(spsc_queue_count(&s_entity-
> >job_queue));
> +
>  			/*
>  			 * Prevents reinsertion and marks job_queue
> as idle,
>  			 * it will removed from rq in
> drm_sched_entity_fini
>  			 * eventually
>  			 */
>  			s_entity->stopped = true;
> +		}
>  		spin_unlock(&rq->lock);
>  		kfree(sched->sched_rq[i]);
>  	}