[PATCH 1/4] drm/scheduler: Add drm_sched_cancel_all_jobs helper

Thu Feb 6 13:53:02 UTC 2025

On 06/02/2025 13:46, Christian König wrote:
> Am 06.02.25 um 14:35 schrieb Philipp Stanner:
>> On Wed, 2025-02-05 at 15:33 +0000, Tvrtko Ursulin wrote:
>>> The helper copies code from the existing
>>> amdgpu_job_stop_all_jobs_on_sched
>>> with the purpose of reducing the amount of driver code which directly
>>> touch scheduler internals.
>>>
>>> If or when amdgpu manages to change the approach for handling the
>>> permanently wedged state this helper can be removed.
>> Have you checked how many other drivers might need such a helper?
>>
>> I have a bit mixed feelings about this, because, AFAICT, in the past
>> helpers have been added for just 1 driver, such as
>> drm_sched_wqueue_ready(), and then they have stayed for almost a
>> decade.
>>
>> AFAIU this is just code move, and only really "decouples" amdgpu in the
>> sense of having an official scheduler function that does what amdgpu
>> used to do.
>>
>> So my tendency here would be to continue "allowing" amdgpu to touch the
>> scheduler internals until amdgpu fixes this "permanently wedged
>> state". And if that's too difficult, couldn't the helper reside in a
>> amdgpu/sched_helpers.c or similar?
>>
>> I think that's better than adding 1 helper for just 1 driver and then
>> supposedly removing it again in the future.
> 
> Yeah, agree to that general approach.
> 
> What amdgpu does here is kind of nasty and looks unnecessary, but 
> changing it means we need time from Hawkings and his people involved on 
> RAS for amdgpu.
> 
> When we move the code to the scheduler we make it official scheduler 
> interface to others to replicate and that is exactly what we should try 
> to avoid.
> 
> So my suggestion is to add a /* TODO: This is nasty and should be 
> avoided */ to the amdgpu code instead.

So I got a no go to export a low level queue pop helper, no go to move 
the whole dodgy code to common (reasonable). Any third way to break the 
status quo? What if I respin with just a change local to amdgpu which 
would, instead of duplicating the to_drm_sched_job macro, duplicate 
__drm_sched_entity_queue_pop from 3/4 of this series?

Regards,

Tvrtko

> 
> Regards,
> Christian.
> 
>>
>> P.
>>
>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>> Cc: Christian König <christian.koenig at amd.com>
>>> Cc: Danilo Krummrich <dakr at kernel.org>
>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>> Cc: Philipp Stanner <phasta at kernel.org>
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_main.c | 44
>>> ++++++++++++++++++++++++++
>>>   include/drm/gpu_scheduler.h            |  1 +
>>>   2 files changed, 45 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index a48be16ab84f..0363655db22d 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -703,6 +703,50 @@ void drm_sched_start(struct drm_gpu_scheduler
>>> *sched, int errno)
>>>   }
>>>   EXPORT_SYMBOL(drm_sched_start);
>>> +/**
>>> + * drm_sched_cancel_all_jobs - Cancel all queued and scheduled jobs
>>> + *
>>> + * @sched: scheduler instance
>>> + * @errno: error value to set on signaled fences
>>> + *
>>> + * Signal all queued and scheduled jobs and set them to error state.
>>> + *
>>> + * Scheduler must be stopped before calling this.
>>> + */
>>> +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched, int
>>> errno)
>>> +{
>>> +    struct drm_sched_entity *entity;
>>> +    struct drm_sched_fence *s_fence;
>>> +    struct drm_sched_job *job;
>>> +    enum drm_sched_priority p;
>>> +
>>> +    drm_WARN_ON_ONCE(sched, !sched->pause_submit);
>>> +
>>> +    /* Signal all jobs not yet scheduled */
>>> +    for (p = DRM_SCHED_PRIORITY_KERNEL; p < sched->num_rqs; p++)
>>> {
>>> +        struct drm_sched_rq *rq = sched->sched_rq[p];
>>> +
>>> +        spin_lock(&rq->lock);
>>> +        list_for_each_entry(entity, &rq->entities, list) {
>>> +            while ((job =
>>> to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>>> +                s_fence = job->s_fence;
>>> +                dma_fence_signal(&s_fence-
>>>> scheduled);
>>> +                dma_fence_set_error(&s_fence-
>>>> finished, errno);
>>> +                dma_fence_signal(&s_fence-
>>>> finished);
>>> +            }
>>> +        }
>>> +        spin_unlock(&rq->lock);
>>> +    }
>>> +
>>> +    /* Signal all jobs already scheduled to HW */
>>> +    list_for_each_entry(job, &sched->pending_list, list) {
>>> +        s_fence = job->s_fence;
>>> +        dma_fence_set_error(&s_fence->finished, errno);
>>> +        dma_fence_signal(&s_fence->finished);
>>> +    }
>>> +}
>>> +EXPORT_SYMBOL(drm_sched_cancel_all_jobs);
>>> +
>>>   /**
>>>    * drm_sched_resubmit_jobs - Deprecated, don't use in new code!
>>>    *
>>> diff --git a/include/drm/gpu_scheduler.h
>>> b/include/drm/gpu_scheduler.h
>>> index a0ff08123f07..298513f8c327 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -579,6 +579,7 @@ void drm_sched_wqueue_stop(struct
>>> drm_gpu_scheduler *sched);
>>>   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>>>   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
>>> drm_sched_job *bad);
>>>   void drm_sched_start(struct drm_gpu_scheduler *sched, int errno);
>>> +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched, int
>>> errno);
>>>   void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>>>   void drm_sched_increase_karma(struct drm_sched_job *bad);
>>>   void drm_sched_reset_karma(struct drm_sched_job *bad);
>