[PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler is torn down

Wed Mar 8 10:03:34 UTC 2023

Am 08.03.23 um 10:57 schrieb Maarten Lankhorst:
>
> On 2023-03-07 15:25, Asahi Lina wrote:
>> drm_sched_fini() currently leaves any pending jobs dangling, which
>> causes segfaults and other badness when job completion fences are
>> signaled after the scheduler is torn down.
>>
>> Explicitly detach all jobs from their completion callbacks and free
>> them. This makes it possible to write a sensible safe abstraction for
>> drm_sched, without having to externally duplicate the tracking of
>> in-flight jobs.
>>
>> This shouldn't regress any existing drivers, since calling
>> drm_sched_fini() with any pending jobs is broken and this change should
>> be a no-op if there are no pending jobs.
>>
>> Signed-off-by: Asahi Lina <lina at asahilina.net>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 27 
>> +++++++++++++++++++++++++--
>>   1 file changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 5c0add2c7546..0aab1e0aebdd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>   {
>>       struct drm_sched_entity *s_entity;
>> +    struct drm_sched_job *s_job, *tmp;
>>       int i;
>>   -    if (sched->thread)
>> -        kthread_stop(sched->thread);
>> +    if (!sched->thread)
>> +        return;
>> +
>> +    /*
>> +     * Stop the scheduler, detaching all jobs from their hardware 
>> callbacks
>> +     * and cleaning up complete jobs.
>> +     */
>> +    drm_sched_stop(sched, NULL);
>> +
>> +    /*
>> +     * Iterate through the pending job list and free all jobs.
>> +     * This assumes the driver has either guaranteed jobs are 
>> already stopped, or that
>> +     * otherwise it is responsible for keeping any necessary data 
>> structures for
>> +     * in-progress jobs alive even when the free_job() callback is 
>> called early (e.g. by
>> +     * putting them in its own queue or doing its own refcounting).
>> +     */
>> +    list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>> +        spin_lock(&sched->job_list_lock);
>> +        list_del_init(&s_job->list);
>> +        spin_unlock(&sched->job_list_lock);
>> +        sched->ops->free_job(s_job);
>> +    }
>
> I would stop the kthread first, then delete all jobs without spinlock 
> since nothing else can race against sched_fini?
>
> If you do need the spinlock, It would need to guard 
> list_for_each_entry too.

Well this case here actually should not happen in the first place.

Jobs depend on their device, so as long as there are jobs there should 
also be a reference to the scheduler.

What could be is that you have allocated a scheduler instance 
dynamically, but even then you should first tear down all entities and 
then the scheduler.

Regards,
Christian.

>
>> +
>> +    kthread_stop(sched->thread);
>>         for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>> DRM_SCHED_PRIORITY_MIN; i--) {
>>           struct drm_sched_rq *rq = &sched->sched_rq[i];
>>