[PATCH 1/4] drm/scheduler: Add drm_sched_cancel_all_jobs helper

Thu Feb 6 14:01:32 UTC 2025

Am 06.02.25 um 14:53 schrieb Tvrtko Ursulin:
>
> On 06/02/2025 13:46, Christian König wrote:
>> Am 06.02.25 um 14:35 schrieb Philipp Stanner:
>>> On Wed, 2025-02-05 at 15:33 +0000, Tvrtko Ursulin wrote:
>>>> The helper copies code from the existing
>>>> amdgpu_job_stop_all_jobs_on_sched
>>>> with the purpose of reducing the amount of driver code which directly
>>>> touch scheduler internals.
>>>>
>>>> If or when amdgpu manages to change the approach for handling the
>>>> permanently wedged state this helper can be removed.
>>> Have you checked how many other drivers might need such a helper?
>>>
>>> I have a bit mixed feelings about this, because, AFAICT, in the past
>>> helpers have been added for just 1 driver, such as
>>> drm_sched_wqueue_ready(), and then they have stayed for almost a
>>> decade.
>>>
>>> AFAIU this is just code move, and only really "decouples" amdgpu in the
>>> sense of having an official scheduler function that does what amdgpu
>>> used to do.
>>>
>>> So my tendency here would be to continue "allowing" amdgpu to touch the
>>> scheduler internals until amdgpu fixes this "permanently wedged
>>> state". And if that's too difficult, couldn't the helper reside in a
>>> amdgpu/sched_helpers.c or similar?
>>>
>>> I think that's better than adding 1 helper for just 1 driver and then
>>> supposedly removing it again in the future.
>>
>> Yeah, agree to that general approach.
>>
>> What amdgpu does here is kind of nasty and looks unnecessary, but 
>> changing it means we need time from Hawkings and his people involved 
>> on RAS for amdgpu.
>>
>> When we move the code to the scheduler we make it official scheduler 
>> interface to others to replicate and that is exactly what we should 
>> try to avoid.
>>
>> So my suggestion is to add a /* TODO: This is nasty and should be 
>> avoided */ to the amdgpu code instead.
>
> So I got a no go to export a low level queue pop helper, no go to move 
> the whole dodgy code to common (reasonable). Any third way to break 
> the status quo? What if I respin with just a change local to amdgpu 
> which would, instead of duplicating the to_drm_sched_job macro, 
> duplicate __drm_sched_entity_queue_pop from 3/4 of this series?

Removing the necessity for queue to be the first memory is still a good 
idea.

I would add internal container_of helpers to the scheduler and then use 
explicit container_of in amdgpu. E.g. don't expose the scheduler 
helpers, but rather manually code them up.

Regards,
Christian.

>
> Regards,
>
> Tvrtko
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> P.
>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>> Cc: Christian König <christian.koenig at amd.com>
>>>> Cc: Danilo Krummrich <dakr at kernel.org>
>>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>>> Cc: Philipp Stanner <phasta at kernel.org>
>>>> ---
>>>>   drivers/gpu/drm/scheduler/sched_main.c | 44
>>>> ++++++++++++++++++++++++++
>>>>   include/drm/gpu_scheduler.h            |  1 +
>>>>   2 files changed, 45 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index a48be16ab84f..0363655db22d 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -703,6 +703,50 @@ void drm_sched_start(struct drm_gpu_scheduler
>>>> *sched, int errno)
>>>>   }
>>>>   EXPORT_SYMBOL(drm_sched_start);
>>>> +/**
>>>> + * drm_sched_cancel_all_jobs - Cancel all queued and scheduled jobs
>>>> + *
>>>> + * @sched: scheduler instance
>>>> + * @errno: error value to set on signaled fences
>>>> + *
>>>> + * Signal all queued and scheduled jobs and set them to error state.
>>>> + *
>>>> + * Scheduler must be stopped before calling this.
>>>> + */
>>>> +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched, int
>>>> errno)
>>>> +{
>>>> +    struct drm_sched_entity *entity;
>>>> +    struct drm_sched_fence *s_fence;
>>>> +    struct drm_sched_job *job;
>>>> +    enum drm_sched_priority p;
>>>> +
>>>> +    drm_WARN_ON_ONCE(sched, !sched->pause_submit);
>>>> +
>>>> +    /* Signal all jobs not yet scheduled */
>>>> +    for (p = DRM_SCHED_PRIORITY_KERNEL; p < sched->num_rqs; p++)
>>>> {
>>>> +        struct drm_sched_rq *rq = sched->sched_rq[p];
>>>> +
>>>> +        spin_lock(&rq->lock);
>>>> +        list_for_each_entry(entity, &rq->entities, list) {
>>>> +            while ((job =
>>>> to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>>>> +                s_fence = job->s_fence;
>>>> +                dma_fence_signal(&s_fence-
>>>>> scheduled);
>>>> +                dma_fence_set_error(&s_fence-
>>>>> finished, errno);
>>>> +                dma_fence_signal(&s_fence-
>>>>> finished);
>>>> +            }
>>>> +        }
>>>> +        spin_unlock(&rq->lock);
>>>> +    }
>>>> +
>>>> +    /* Signal all jobs already scheduled to HW */
>>>> +    list_for_each_entry(job, &sched->pending_list, list) {
>>>> +        s_fence = job->s_fence;
>>>> +        dma_fence_set_error(&s_fence->finished, errno);
>>>> +        dma_fence_signal(&s_fence->finished);
>>>> +    }
>>>> +}
>>>> +EXPORT_SYMBOL(drm_sched_cancel_all_jobs);
>>>> +
>>>>   /**
>>>>    * drm_sched_resubmit_jobs - Deprecated, don't use in new code!
>>>>    *
>>>> diff --git a/include/drm/gpu_scheduler.h
>>>> b/include/drm/gpu_scheduler.h
>>>> index a0ff08123f07..298513f8c327 100644
>>>> --- a/include/drm/gpu_scheduler.h
>>>> +++ b/include/drm/gpu_scheduler.h
>>>> @@ -579,6 +579,7 @@ void drm_sched_wqueue_stop(struct
>>>> drm_gpu_scheduler *sched);
>>>>   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>>>>   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
>>>> drm_sched_job *bad);
>>>>   void drm_sched_start(struct drm_gpu_scheduler *sched, int errno);
>>>> +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched, int
>>>> errno);
>>>>   void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>>>>   void drm_sched_increase_karma(struct drm_sched_job *bad);
>>>>   void drm_sched_reset_karma(struct drm_sched_job *bad);
>>