[PATCH 1/4] drm/scheduler: Add drm_sched_cancel_all_jobs helper

Thu Feb 6 13:42:40 UTC 2025

On 06/02/2025 13:35, Philipp Stanner wrote:
> On Wed, 2025-02-05 at 15:33 +0000, Tvrtko Ursulin wrote:
>> The helper copies code from the existing
>> amdgpu_job_stop_all_jobs_on_sched
>> with the purpose of reducing the amount of driver code which directly
>> touch scheduler internals.
>>
>> If or when amdgpu manages to change the approach for handling the
>> permanently wedged state this helper can be removed.
> 
> Have you checked how many other drivers might need such a helper?
> 
> I have a bit mixed feelings about this, because, AFAICT, in the past
> helpers have been added for just 1 driver, such as
> drm_sched_wqueue_ready(), and then they have stayed for almost a
> decade.
> 
> AFAIU this is just code move, and only really "decouples" amdgpu in the
> sense of having an official scheduler function that does what amdgpu
> used to do.
> 
> So my tendency here would be to continue "allowing" amdgpu to touch the
> scheduler internals until amdgpu fixes this "permanently wedged
> state". And if that's too difficult, couldn't the helper reside in a
> amdgpu/sched_helpers.c or similar?
> 
> I think that's better than adding 1 helper for just 1 driver and then
> supposedly removing it again in the future.

I was 50% nudging Christian into providing a more concrete idea on how 
to fix amdgpu ;) and other 50% I want to get rid of three copies of 
to_drm_sched_job and remove the hidden "queue node must be first" 
dependency.

So let it marinate a bit and we will see if a nicer solution shows up.

Regards,

Tvrtko

>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>> Cc: Christian König <christian.koenig at amd.com>
>> Cc: Danilo Krummrich <dakr at kernel.org>
>> Cc: Matthew Brost <matthew.brost at intel.com>
>> Cc: Philipp Stanner <phasta at kernel.org>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 44
>> ++++++++++++++++++++++++++
>>   include/drm/gpu_scheduler.h            |  1 +
>>   2 files changed, 45 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index a48be16ab84f..0363655db22d 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -703,6 +703,50 @@ void drm_sched_start(struct drm_gpu_scheduler
>> *sched, int errno)
>>   }
>>   EXPORT_SYMBOL(drm_sched_start);
>>   
>> +/**
>> + * drm_sched_cancel_all_jobs - Cancel all queued and scheduled jobs
>> + *
>> + * @sched: scheduler instance
>> + * @errno: error value to set on signaled fences
>> + *
>> + * Signal all queued and scheduled jobs and set them to error state.
>> + *
>> + * Scheduler must be stopped before calling this.
>> + */
>> +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched, int
>> errno)
>> +{
>> +	struct drm_sched_entity *entity;
>> +	struct drm_sched_fence *s_fence;
>> +	struct drm_sched_job *job;
>> +	enum drm_sched_priority p;
>> +
>> +	drm_WARN_ON_ONCE(sched, !sched->pause_submit);
>> +
>> +	/* Signal all jobs not yet scheduled */
>> +	for (p = DRM_SCHED_PRIORITY_KERNEL; p < sched->num_rqs; p++)
>> {
>> +		struct drm_sched_rq *rq = sched->sched_rq[p];
>> +
>> +		spin_lock(&rq->lock);
>> +		list_for_each_entry(entity, &rq->entities, list) {
>> +			while ((job =
>> to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>> +				s_fence = job->s_fence;
>> +				dma_fence_signal(&s_fence-
>>> scheduled);
>> +				dma_fence_set_error(&s_fence-
>>> finished, errno);
>> +				dma_fence_signal(&s_fence-
>>> finished);
>> +			}
>> +		}
>> +		spin_unlock(&rq->lock);
>> +	}
>> +
>> +	/* Signal all jobs already scheduled to HW */
>> +	list_for_each_entry(job, &sched->pending_list, list) {
>> +		s_fence = job->s_fence;
>> +		dma_fence_set_error(&s_fence->finished, errno);
>> +		dma_fence_signal(&s_fence->finished);
>> +	}
>> +}
>> +EXPORT_SYMBOL(drm_sched_cancel_all_jobs);
>> +
>>   /**
>>    * drm_sched_resubmit_jobs - Deprecated, don't use in new code!
>>    *
>> diff --git a/include/drm/gpu_scheduler.h
>> b/include/drm/gpu_scheduler.h
>> index a0ff08123f07..298513f8c327 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -579,6 +579,7 @@ void drm_sched_wqueue_stop(struct
>> drm_gpu_scheduler *sched);
>>   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>>   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
>> drm_sched_job *bad);
>>   void drm_sched_start(struct drm_gpu_scheduler *sched, int errno);
>> +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched, int
>> errno);
>>   void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>>   void drm_sched_increase_karma(struct drm_sched_job *bad);
>>   void drm_sched_reset_karma(struct drm_sched_job *bad);
>