[PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini
Christian König
ckoenig.leichtzumerken at gmail.com
Wed Sep 18 14:56:15 UTC 2024
Am 18.09.24 um 16:41 schrieb Jani Nikula:
> On Wed, 18 Sep 2024, "Christian König" <ckoenig.leichtzumerken at gmail.com> wrote:
>> Tearing down the scheduler with jobs still on the pending list can
>> lead to use after free issues. Add a warning if drivers try to
>> destroy a scheduler which still has work pushed to the HW.
>>
>> When there are still entities with jobs the situation is even worse
>> since the dma_fences for those jobs can never signal we can just
>> choose between potentially locking up core memory management and
>> random memory corruption. When drivers really mess it up that well
>> let them run into a BUG_ON().
>>
>> Signed-off-by: Christian König <christian.koenig at amd.com>
>> ---
>> drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
>> 1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index f093616fe53c..8a46fab5cdc8 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>
>> drm_sched_wqueue_stop(sched);
>>
>> + /*
>> + * Tearing down the scheduler wile there are still unprocessed jobs can
>> + * lead to use after free issues in the scheduler fence.
>> + */
>> + WARN_ON(!list_empty(&sched->pending_list));
> drm_WARN_ON(sched->dev, ...) would identify the device, which I presume
> would be helpful in multi-GPU systems.
Good point, going to fix that.
Regards,
Christian.
>
> BR,
> Jani.
>
>> +
>> for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>> struct drm_sched_rq *rq = sched->sched_rq[i];
>>
>> spin_lock(&rq->lock);
>> - list_for_each_entry(s_entity, &rq->entities, list)
>> + list_for_each_entry(s_entity, &rq->entities, list) {
>> + /*
>> + * The justification for this BUG_ON() is that tearing
>> + * down the scheduler while jobs are pending leaves
>> + * dma_fences unsignaled. Since we have dependencies
>> + * from the core memory management to eventually signal
>> + * dma_fences this can trivially lead to a system wide
>> + * stop because of a locked up memory management.
>> + */
>> + BUG_ON(spsc_queue_count(&s_entity->job_queue));
>> +
>> /*
>> * Prevents reinsertion and marks job_queue as idle,
>> * it will removed from rq in drm_sched_entity_fini
>> * eventually
>> */
>> s_entity->stopped = true;
>> + }
>> spin_unlock(&rq->lock);
>> kfree(sched->sched_rq[i]);
>> }
More information about the dri-devel
mailing list