[PATCH v4 2/8] drm/sched: Allow drivers to skip the reset and keep on running

Tue Jul 8 07:02:19 UTC 2025

On Mon, 2025-07-07 at 11:46 -0300, Maíra Canal wrote:
> When the DRM scheduler times out, it's possible that the GPU isn't
> hung;
> instead, a job may still be running, and there may be no valid reason
> to
> reset the hardware. This can occur in two situations:
> 
>   1. The GPU exposes some mechanism that ensures the GPU is still
> making
>      progress. By checking this mechanism, the driver can safely skip

I think this should be rephrased, because it reads as if there is a
mechanism with which the GPU can be forced to still make progress even
with a while (1) job or something.

I think what we want probably is:

"When the DRM scheduler times out, it's possible that the GPU isn't
hung; instead, a job just took unusually long (longer than the timeout)
but is still running, and there is, thus, no reason to reset the
hardware. A false-positive timeout can occur in two scenarios:

1. The job took too long, but the driver determined through a GPU-
specific mechanism that the hardware is still making progress. Hence,
the driver would like the scheduler to skip the timeout and treat the
job as still pending from then onward.

> the
>      reset, re-arm the timeout, and allow the job to continue running
> until
>      completion. This is the case for v3d, Etnaviv, and Xe.
>   2. Timeout has fired before the free-job worker. Consequently, the
>      scheduler calls `sched->ops->timedout_job()` for a job that
> isn't
>      timed out.

"2. The job actually did complete from the driver's point of view, but
there was a race with the scheduler's timeout, which determined this
job timed out slightly before the free-job worker could remove it from
the pending_list."

Feel free to adjust the wording to your liking.

> 
> These two scenarios are problematic because the job was removed from
> the
> `sched->pending_list` before calling `sched->ops->timedout_job()`,
> which
> means that when the job finishes, it won't be freed by the scheduler
> though `sched->ops->free_job()` - leading to a memory leak.
> 
> To solve those problems, create a new `drm_gpu_sched_stat`, called
> DRM_GPU_SCHED_STAT_NO_HANG, that allows a driver to skip the reset.

nit:
s/that/which

? Reads a bit clearer IMO

> The
> new status will indicate that the job must be reinserted into the
> pending list, and the hardware / driver will still complete that job.

I think it would be cool to write it as pending_list consistently
throughout the patch.

> 
> Signed-off-by: Maíra Canal <mcanal at igalia.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 43
> ++++++++++++++++++++++++++++++++--
>  include/drm/gpu_scheduler.h            |  3 +++
>  2 files changed, 44 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index
> 0f32e2cb43d6af294408968a970990f9f5c47bee..d3f48526883cc7ceea0cd1d0c62
> fb119f7092704 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -374,11 +374,16 @@ static void drm_sched_run_free_queue(struct
> drm_gpu_scheduler *sched)
>  {
>  	struct drm_sched_job *job;
>  
> -	spin_lock(&sched->job_list_lock);
>  	job = list_first_entry_or_null(&sched->pending_list,
>  				       struct drm_sched_job, list);
>  	if (job && dma_fence_is_signaled(&job->s_fence->finished))
>  		__drm_sched_run_free_queue(sched);
> +}
> +
> +static void drm_sched_run_free_queue_unlocked(struct
> drm_gpu_scheduler *sched)
> +{
> +	spin_lock(&sched->job_list_lock);
> +	drm_sched_run_free_queue(sched);
>  	spin_unlock(&sched->job_list_lock);
>  }
>  
> @@ -531,6 +536,31 @@ static void drm_sched_job_begin(struct
> drm_sched_job *s_job)
>  	spin_unlock(&sched->job_list_lock);
>  }
>  
> +/**
> + * drm_sched_job_reinsert_on_false_timeout - reinsert the job on a
> false timeout
> + * @sched: scheduler instance
> + * @job: job to be reinserted on the pending list
> + *
> + * In the case of a "false timeout" - when a timeout occurs but the
> GPU isn't
> + * hung and the job is making progress, the scheduler must reinsert

nit:
s/job/GPU

> the job back
> + * into the pending list. Otherwise, the job and its resources won't
> 
> Dito
>  be freed
> + * through the &drm_sched_backend_ops.free_job callback.
> + *
> + * Note that after reinserting the job, the scheduler enqueues the
> free-job
> + * work again if ready. Otherwise, a signaled job could be added to
> the pending
> + * list, but never freed.
> + *
> + * This function must be used in "false timeout" cases only.

I think the "Note" should be removed, because it reads as if the API
user has to mind about that in any way.

If you want to highlight that, maybe move it down in the function body
as a code comment. I think it's good to know, but not relevant for the
API user, or is it?

> + */
> +static void drm_sched_job_reinsert_on_false_timeout(struct
> drm_gpu_scheduler *sched,
> +						    struct
> drm_sched_job *job)
> +{
> +	spin_lock(&sched->job_list_lock);
> +	list_add(&job->list, &sched->pending_list);
> +	drm_sched_run_free_queue(sched);
> +	spin_unlock(&sched->job_list_lock);
> +}
> +
>  static void drm_sched_job_timedout(struct work_struct *work)
>  {
>  	struct drm_gpu_scheduler *sched;
> @@ -564,6 +594,9 @@ static void drm_sched_job_timedout(struct
> work_struct *work)
>  			job->sched->ops->free_job(job);
>  			sched->free_guilty = false;
>  		}
> +
> +		if (status == DRM_GPU_SCHED_STAT_NO_HANG)
> +			drm_sched_job_reinsert_on_false_timeout(sche
> d, job);
>  	} else {
>  		spin_unlock(&sched->job_list_lock);
>  	}
> @@ -586,6 +619,9 @@ static void drm_sched_job_timedout(struct
> work_struct *work)
>   * This function is typically used for reset recovery (see the docu
> of
>   * drm_sched_backend_ops.timedout_job() for details). Do not call it
> for
>   * scheduler teardown, i.e., before calling drm_sched_fini().
> + *
> + * As it's used for reset recovery, drm_sched_stop() shouldn't be
> called
> + * if the driver skipped the reset (DRM_GPU_SCHED_STAT_NO_HANG).

I know I suggested these comments; but reading them again I think
they're confusing.

It reads as if drm_sched_stop() should not be called if the driver,
sometime in the past, skipped a reset.

It would be more waterproof like so:

"As it is only used for reset recovery, drivers must not call this
function in their &struct drm_sched_backend_ops.timedout_job callback
if they are skipping the reset through status &enum
drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG."

Note that we should also prohibit using the function with
s/should/must not

Better safe than sorry..

>   */
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
> drm_sched_job *bad)
>  {
> @@ -671,6 +707,9 @@ EXPORT_SYMBOL(drm_sched_stop);
>   * drm_sched_backend_ops.timedout_job() for details). Do not call it
> for
>   * scheduler startup. The scheduler itself is fully operational
> after
>   * drm_sched_init() succeeded.
> + *
> + * As it's used for reset recovery, drm_sched_start() shouldn't be
> called
> + * if the driver skipped the reset (DRM_GPU_SCHED_STAT_NO_HANG).

Same.

Otherwise, great series. Looking forward to having this merged.

P.

>   */
>  void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
>  {
> @@ -1192,7 +1231,7 @@ static void drm_sched_free_job_work(struct
> work_struct *w)
>  	if (job)
>  		sched->ops->free_job(job);
>  
> -	drm_sched_run_free_queue(sched);
> +	drm_sched_run_free_queue_unlocked(sched);
>  	drm_sched_run_job_queue(sched);
>  }
>  
> diff --git a/include/drm/gpu_scheduler.h
> b/include/drm/gpu_scheduler.h
> index
> 83e5c00d8dd9a83ab20547a93d6fc572de97616e..257d21d8d1d2c4f035d6d4882e1
> 59de59b263c76 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -393,11 +393,14 @@ struct drm_sched_job {
>   * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>   * @DRM_GPU_SCHED_STAT_RESET: The GPU hung and successfully reset.
>   * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available
> anymore.
> + * @DRM_GPU_SCHED_STAT_NO_HANG: Contrary to scheduler's assumption,
> the GPU
> + * did not hang and is still running.
>   */
>  enum drm_gpu_sched_stat {
>  	DRM_GPU_SCHED_STAT_NONE,
>  	DRM_GPU_SCHED_STAT_RESET,
>  	DRM_GPU_SCHED_STAT_ENODEV,
> +	DRM_GPU_SCHED_STAT_NO_HANG,
>  };
>  
>  /**
>