[PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini
Christian König
ckoenig.leichtzumerken at gmail.com
Mon Sep 23 15:24:10 UTC 2024
Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
>> Am 20.09.24 um 10:57 schrieb Philipp Stanner:
>>> On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
>>>> Tearing down the scheduler with jobs still on the pending list
>>>> can
>>>> lead to use after free issues. Add a warning if drivers try to
>>>> destroy a scheduler which still has work pushed to the HW.
>>> Did you have time yet to look into my proposed waitque-solution?
>> I don't remember seeing anything. What have I missed?
> https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/
Mhm, I didn't got that in my inbox for some reason.
Interesting approach, I'm just not sure if we can or should wait in
drm_sched_fini().
Probably better to make that a separate function, something like
drm_sched_flush() or similar.
>>>> When there are still entities with jobs the situation is even
>>>> worse
>>>> since the dma_fences for those jobs can never signal we can just
>>>> choose between potentially locking up core memory management and
>>>> random memory corruption. When drivers really mess it up that
>>>> well
>>>> let them run into a BUG_ON().
>>>>
>>>> Signed-off-by: Christian König<christian.koenig at amd.com>
>>>> ---
>>>> drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
>>>> 1 file changed, 18 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index f093616fe53c..8a46fab5cdc8 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
>>>> drm_gpu_scheduler
>>>> *sched)
>>> I agree with Sima that it should first be documented in the
>>> function's
>>> docstring what the user is expected to have done before calling the
>>> function.
>> Good point, going to update the documentation as well.
> Cool thing, thx.
> Would be great if everything (not totally trivial) necessary to be done
> before _fini() is mentioned.
>
> One could also think about providing a hint at how the driver can do
> that. AFAICS the only way for the driver to ensure that is to maintain
> its own, separate list of submitted jobs.
Even with a duplicated pending list it's actually currently impossible
to do this fully cleanly.
The problem is that the dma_fence object gives no guarantee when
callbacks are processed, e.g. they can be both processed from interrupt
context as well as from a CPU which called dma_fence_is_signaled().
So when a driver (or drm_sched_fini) waits for the last submitted fence
it actually can be that the drm_sched object still needs to do some
processing. See the hack in amdgpu_vm_tlb_seq() for more background on
the problem.
Regards,
Christian.
>
> P.
>
>> Thanks,
>> Christian.
>>
>>> P.
>>>
>>>>
>>>> drm_sched_wqueue_stop(sched);
>>>>
>>>> + /*
>>>> + * Tearing down the scheduler wile there are still
>>>> unprocessed jobs can
>>>> + * lead to use after free issues in the scheduler fence.
>>>> + */
>>>> + WARN_ON(!list_empty(&sched->pending_list));
>>>> +
>>>> for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
>>>> i++)
>>>> {
>>>> struct drm_sched_rq *rq = sched->sched_rq[i];
>>>>
>>>> spin_lock(&rq->lock);
>>>> - list_for_each_entry(s_entity, &rq->entities,
>>>> list)
>>>> + list_for_each_entry(s_entity, &rq->entities,
>>>> list) {
>>>> + /*
>>>> + * The justification for this BUG_ON()
>>>> is
>>>> that tearing
>>>> + * down the scheduler while jobs are
>>>> pending
>>>> leaves
>>>> + * dma_fences unsignaled. Since we have
>>>> dependencies
>>>> + * from the core memory management to
>>>> eventually signal
>>>> + * dma_fences this can trivially lead to
>>>> a
>>>> system wide
>>>> + * stop because of a locked up memory
>>>> management.
>>>> + */
>>>> + BUG_ON(spsc_queue_count(&s_entity-
>>>>> job_queue));
>>>> +
>>>> /*
>>>> * Prevents reinsertion and marks
>>>> job_queue
>>>> as idle,
>>>> * it will removed from rq in
>>>> drm_sched_entity_fini
>>>> * eventually
>>>> */
>>>> s_entity->stopped = true;
>>>> + }
>>>> spin_unlock(&rq->lock);
>>>> kfree(sched->sched_rq[i]);
>>>> }
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20240923/24156781/attachment.htm>
More information about the dri-devel
mailing list