[PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

Fri Sep 20 10:33:54 UTC 2024

Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
>> Tearing down the scheduler with jobs still on the pending list can
>> lead to use after free issues. Add a warning if drivers try to
>> destroy a scheduler which still has work pushed to the HW.
> Did you have time yet to look into my proposed waitque-solution?

I don't remember seeing anything. What have I missed?

>
>> When there are still entities with jobs the situation is even worse
>> since the dma_fences for those jobs can never signal we can just
>> choose between potentially locking up core memory management and
>> random memory corruption. When drivers really mess it up that well
>> let them run into a BUG_ON().
>>
>> Signed-off-by: Christian König <christian.koenig at amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
>>   1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index f093616fe53c..8a46fab5cdc8 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler
>> *sched)
> I agree with Sima that it should first be documented in the function's
> docstring what the user is expected to have done before calling the
> function.

Good point, going to update the documentation as well.

Thanks,
Christian.

>
> P.
>
>>   
>>   	drm_sched_wqueue_stop(sched);
>>   
>> +	/*
>> +	 * Tearing down the scheduler wile there are still
>> unprocessed jobs can
>> +	 * lead to use after free issues in the scheduler fence.
>> +	 */
>> +	WARN_ON(!list_empty(&sched->pending_list));
>> +
>>   	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++)
>> {
>>   		struct drm_sched_rq *rq = sched->sched_rq[i];
>>   
>>   		spin_lock(&rq->lock);
>> -		list_for_each_entry(s_entity, &rq->entities, list)
>> +		list_for_each_entry(s_entity, &rq->entities, list) {
>> +			/*
>> +			 * The justification for this BUG_ON() is
>> that tearing
>> +			 * down the scheduler while jobs are pending
>> leaves
>> +			 * dma_fences unsignaled. Since we have
>> dependencies
>> +			 * from the core memory management to
>> eventually signal
>> +			 * dma_fences this can trivially lead to a
>> system wide
>> +			 * stop because of a locked up memory
>> management.
>> +			 */
>> +			BUG_ON(spsc_queue_count(&s_entity-
>>> job_queue));
>> +
>>   			/*
>>   			 * Prevents reinsertion and marks job_queue
>> as idle,
>>   			 * it will removed from rq in
>> drm_sched_entity_fini
>>   			 * eventually
>>   			 */
>>   			s_entity->stopped = true;
>> +		}
>>   		spin_unlock(&rq->lock);
>>   		kfree(sched->sched_rq[i]);
>>   	}