[PATCH 1/4] drm/scheduler: Add drm_sched_cancel_all_jobs helper
Tvrtko Ursulin
tvrtko.ursulin at igalia.com
Thu Feb 6 13:42:40 UTC 2025
On 06/02/2025 13:35, Philipp Stanner wrote:
> On Wed, 2025-02-05 at 15:33 +0000, Tvrtko Ursulin wrote:
>> The helper copies code from the existing
>> amdgpu_job_stop_all_jobs_on_sched
>> with the purpose of reducing the amount of driver code which directly
>> touch scheduler internals.
>>
>> If or when amdgpu manages to change the approach for handling the
>> permanently wedged state this helper can be removed.
>
> Have you checked how many other drivers might need such a helper?
>
> I have a bit mixed feelings about this, because, AFAICT, in the past
> helpers have been added for just 1 driver, such as
> drm_sched_wqueue_ready(), and then they have stayed for almost a
> decade.
>
> AFAIU this is just code move, and only really "decouples" amdgpu in the
> sense of having an official scheduler function that does what amdgpu
> used to do.
>
> So my tendency here would be to continue "allowing" amdgpu to touch the
> scheduler internals until amdgpu fixes this "permanently wedged
> state". And if that's too difficult, couldn't the helper reside in a
> amdgpu/sched_helpers.c or similar?
>
> I think that's better than adding 1 helper for just 1 driver and then
> supposedly removing it again in the future.
I was 50% nudging Christian into providing a more concrete idea on how
to fix amdgpu ;) and other 50% I want to get rid of three copies of
to_drm_sched_job and remove the hidden "queue node must be first"
dependency.
So let it marinate a bit and we will see if a nicer solution shows up.
Regards,
Tvrtko
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>> Cc: Christian König <christian.koenig at amd.com>
>> Cc: Danilo Krummrich <dakr at kernel.org>
>> Cc: Matthew Brost <matthew.brost at intel.com>
>> Cc: Philipp Stanner <phasta at kernel.org>
>> ---
>> drivers/gpu/drm/scheduler/sched_main.c | 44
>> ++++++++++++++++++++++++++
>> include/drm/gpu_scheduler.h | 1 +
>> 2 files changed, 45 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index a48be16ab84f..0363655db22d 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -703,6 +703,50 @@ void drm_sched_start(struct drm_gpu_scheduler
>> *sched, int errno)
>> }
>> EXPORT_SYMBOL(drm_sched_start);
>>
>> +/**
>> + * drm_sched_cancel_all_jobs - Cancel all queued and scheduled jobs
>> + *
>> + * @sched: scheduler instance
>> + * @errno: error value to set on signaled fences
>> + *
>> + * Signal all queued and scheduled jobs and set them to error state.
>> + *
>> + * Scheduler must be stopped before calling this.
>> + */
>> +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched, int
>> errno)
>> +{
>> + struct drm_sched_entity *entity;
>> + struct drm_sched_fence *s_fence;
>> + struct drm_sched_job *job;
>> + enum drm_sched_priority p;
>> +
>> + drm_WARN_ON_ONCE(sched, !sched->pause_submit);
>> +
>> + /* Signal all jobs not yet scheduled */
>> + for (p = DRM_SCHED_PRIORITY_KERNEL; p < sched->num_rqs; p++)
>> {
>> + struct drm_sched_rq *rq = sched->sched_rq[p];
>> +
>> + spin_lock(&rq->lock);
>> + list_for_each_entry(entity, &rq->entities, list) {
>> + while ((job =
>> to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>> + s_fence = job->s_fence;
>> + dma_fence_signal(&s_fence-
>>> scheduled);
>> + dma_fence_set_error(&s_fence-
>>> finished, errno);
>> + dma_fence_signal(&s_fence-
>>> finished);
>> + }
>> + }
>> + spin_unlock(&rq->lock);
>> + }
>> +
>> + /* Signal all jobs already scheduled to HW */
>> + list_for_each_entry(job, &sched->pending_list, list) {
>> + s_fence = job->s_fence;
>> + dma_fence_set_error(&s_fence->finished, errno);
>> + dma_fence_signal(&s_fence->finished);
>> + }
>> +}
>> +EXPORT_SYMBOL(drm_sched_cancel_all_jobs);
>> +
>> /**
>> * drm_sched_resubmit_jobs - Deprecated, don't use in new code!
>> *
>> diff --git a/include/drm/gpu_scheduler.h
>> b/include/drm/gpu_scheduler.h
>> index a0ff08123f07..298513f8c327 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -579,6 +579,7 @@ void drm_sched_wqueue_stop(struct
>> drm_gpu_scheduler *sched);
>> void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>> void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
>> drm_sched_job *bad);
>> void drm_sched_start(struct drm_gpu_scheduler *sched, int errno);
>> +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched, int
>> errno);
>> void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>> void drm_sched_increase_karma(struct drm_sched_job *bad);
>> void drm_sched_reset_karma(struct drm_sched_job *bad);
>
More information about the amd-gfx
mailing list