[PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler is torn down
Christian König
christian.koenig at amd.com
Wed Mar 8 10:03:34 UTC 2023
Am 08.03.23 um 10:57 schrieb Maarten Lankhorst:
>
> On 2023-03-07 15:25, Asahi Lina wrote:
>> drm_sched_fini() currently leaves any pending jobs dangling, which
>> causes segfaults and other badness when job completion fences are
>> signaled after the scheduler is torn down.
>>
>> Explicitly detach all jobs from their completion callbacks and free
>> them. This makes it possible to write a sensible safe abstraction for
>> drm_sched, without having to externally duplicate the tracking of
>> in-flight jobs.
>>
>> This shouldn't regress any existing drivers, since calling
>> drm_sched_fini() with any pending jobs is broken and this change should
>> be a no-op if there are no pending jobs.
>>
>> Signed-off-by: Asahi Lina <lina at asahilina.net>
>> ---
>> drivers/gpu/drm/scheduler/sched_main.c | 27
>> +++++++++++++++++++++++++--
>> 1 file changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 5c0add2c7546..0aab1e0aebdd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init);
>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>> {
>> struct drm_sched_entity *s_entity;
>> + struct drm_sched_job *s_job, *tmp;
>> int i;
>> - if (sched->thread)
>> - kthread_stop(sched->thread);
>> + if (!sched->thread)
>> + return;
>> +
>> + /*
>> + * Stop the scheduler, detaching all jobs from their hardware
>> callbacks
>> + * and cleaning up complete jobs.
>> + */
>> + drm_sched_stop(sched, NULL);
>> +
>> + /*
>> + * Iterate through the pending job list and free all jobs.
>> + * This assumes the driver has either guaranteed jobs are
>> already stopped, or that
>> + * otherwise it is responsible for keeping any necessary data
>> structures for
>> + * in-progress jobs alive even when the free_job() callback is
>> called early (e.g. by
>> + * putting them in its own queue or doing its own refcounting).
>> + */
>> + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>> + spin_lock(&sched->job_list_lock);
>> + list_del_init(&s_job->list);
>> + spin_unlock(&sched->job_list_lock);
>> + sched->ops->free_job(s_job);
>> + }
>
> I would stop the kthread first, then delete all jobs without spinlock
> since nothing else can race against sched_fini?
>
> If you do need the spinlock, It would need to guard
> list_for_each_entry too.
Well this case here actually should not happen in the first place.
Jobs depend on their device, so as long as there are jobs there should
also be a reference to the scheduler.
What could be is that you have allocated a scheduler instance
dynamically, but even then you should first tear down all entities and
then the scheduler.
Regards,
Christian.
>
>> +
>> + kthread_stop(sched->thread);
>> for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>> DRM_SCHED_PRIORITY_MIN; i--) {
>> struct drm_sched_rq *rq = &sched->sched_rq[i];
>>
More information about the dri-devel
mailing list