[PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running

Mon Jul 14 09:10:16 UTC 2025

On 11.07.25 19:23, Matthew Brost wrote:
> On Fri, Jul 11, 2025 at 05:20:49PM +0200, Christian König wrote:
>> On 11.07.25 15:37, Philipp Stanner wrote:
>>> On Fri, 2025-07-11 at 15:22 +0200, Christian König wrote:
>>>>
>>>>
>>>> On 08.07.25 15:25, Maíra Canal wrote:
>>>>> When the DRM scheduler times out, it's possible that the GPU isn't hung;
>>>>> instead, a job just took unusually long (longer than the timeout) but is
>>>>> still running, and there is, thus, no reason to reset the hardware. This
>>>>> can occur in two scenarios:
>>>>>
>>>>>   1. The job is taking longer than the timeout, but the driver determined
>>>>>      through a GPU-specific mechanism that the hardware is still making
>>>>>      progress. Hence, the driver would like the scheduler to skip the
>>>>>      timeout and treat the job as still pending from then onward. This
>>>>>      happens in v3d, Etnaviv, and Xe.
>>>>>   2. Timeout has fired before the free-job worker. Consequently, the
>>>>>      scheduler calls `sched->ops->timedout_job()` for a job that isn't
>>>>>      timed out.
>>>>>
>>>>> These two scenarios are problematic because the job was removed from the
>>>>> `sched->pending_list` before calling `sched->ops->timedout_job()`, which
>>>>> means that when the job finishes, it won't be freed by the scheduler
>>>>> though `sched->ops->free_job()` - leading to a memory leak.
>>>>
>>>> Yeah, that is unfortunately intentional.
>>>>
>>>>> To solve these problems, create a new `drm_gpu_sched_stat`, called
>>>>> DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. The
>>>>> new status will indicate that the job must be reinserted into
>>>>> `sched->pending_list`, and the hardware / driver will still complete that
>>>>> job.
>>>>
>>>> Well long story short we have already tried this and the whole approach doesn't work correctly in all cases. See the git history around how we used to destroy the jobs.
>>>>
>>>> The basic problem is that you can always race between timing out and Signaling/destroying the job. This is the long lasting job lifetime issue we already discussed more than once.
>>>
>>> The scheduler destroys the job, through free_job().
>>> I think we have agreed that for now the scheduler is the party
>>> responsible for the job lifetime.
>>
>> That's what I strongly disagree on. The job is just a state bag between the submission and scheduling state of a submission.
>>
> 
> See below. free_job is called after the fence signals, and drivers have
> now built assumptions around this, which would be hard to untangle.
> Sure, the scheduler could have been designed to free the job immediately
> after calling run_job(), but it wasn’t.

Exactly that's the point! The scheduler *was* designed around the idea of immediately freeing the job when it starts to run.

The original idea was that the scheduler just works with a void* as the job representation and converts that into the HW fence by calling run_job.

This avoids tons of problems, starting from that you can't allocate memory for the HW fence while run_job is called all the way to that you don't have a properly defined job lifetime.

I screamed out quite loud when people started to change this because it was absolutely foreseeable that this goes boom, but I just wasn't the maintainer of that stuff at this point.

> Honestly, I kind of agree with
> freeing the job immediately after running it—but that’s not the world we
> live in, and we don’t have a time machine to fix it.

Yeah, that's unfortunately true. 

> I’d rather document
> the current rules around job lifetime and look for ways to improve it.

Exactly that's what I'm trying to do here.

See we have spend the last 8 years iterating over the same problem over and over again, and the current solution just barely works. Take a look at the git history how often we have tried to get this to work properly.

And yes, it's well known that it leaks the job when the driver doesn't do a reset and as Maíra absolutely correctly pointed out as well that re-inserting the job during the reset is just a completely broken design.

It's just that everybody working on that has given up at some point.

> Changing the job lifetime rules would require broad community buy-in,
> which I doubt everyone would agree to.

Well, changing the parameter of the timedout callback from the job to the fence should be relatively easy to do. That should immediately fix the issue Maíra is working on.

Then potentially changing the job lifetime in the scheduler can be a different topic.

>> For the scheduler the control starts when it is pushed into the entity and ends when run_job is called.
>>
>> The real representation of the submission is the scheduler fence and that object has a perfectly defined lifetime, state and error handling.
>>
>>>>
>>>> If you want to fix this I think the correct approach is to completely drop tracking jobs in the scheduler at all.
>>>
>>> I don't see how this series introduces a problem?
>>>
>>> The fact is that drivers are abusing the API by just firing jobs back
>>> into the scheduler's job list. This series legalizes the abuse by
>>> providing scheduler functionality for that.
>>>
>>> IOW, the series improves the situation but does not add a *new*
>>> problem. Even less so as driver's aren't forced to use the new status
>>> code, but can continue having job completion race with timeout
>>> handlers.
>>
>> Maybe yes, but I'm really not sure about it.
>>
>> Take a look at the git history or job destruction, we already had exactly that approach, removed it and said that leaking memory is at least better than an use after free issue.
>>
>>>>
>>>> Instead we should track the HW fences (or maybe the scheduler fences which point to the HW fence) the scheduler waits for.
>>>>
>>>> This HW fence is then given as a parameter to the driver when we run into a timeout.
>>>>
>>>> This has the clear advantage that dma_fence objects have a well defined livetime and necessary state transition. E.g. you can check at all times if the fence is signaled or not.
>>>
>>> I'd say that centralizing job handling in the scheduler is a
>>> prerequisite of what you suggest. You can't clean up anything while
>>> certain drivers (sometimes without even locking!) just willy-nilly re-
>>> add jobs to the pending_list.
>>
>> The point is we should get completely rid of the pending list.
>>
>> Then the whole question of removing or re-adding anything to the pending list doesn't appear in the first place.
>>
>> The issue with that is that we need to change the timeout callback to take the fence and not the job. And that is a rather big change affecting all drivers using the scheduler.
>>
> 
> I agree with the idea that the pending list should generally be a list
> of pending fences, but once you start considering the implications, I'm
> not really sure it works—or at least, not easily.
> 
> Most drivers these days decouple the fence and the job (i.e., the fence
> isn't embedded in the job), so there's no inherent way to go from a
> fence → job. In Xe, the job’s fence can be one of a variety of different
> fences, depending on the submission type. To handle fence timeouts, we
> don’t necessarily need the job itself, but we do need the driver-side
> software queue. In Xe, this would be the scheduler, given the 1:1
> relationship—so we could likely make it work. For drivers without a 1:1
> relationship, it's unclear how this would be resolved, though perhaps
> the hardware queue would be sufficient in that case.
> 
> It also follows that the job would be freed immediately after run_job()
> if we don’t maintain a pending list of jobs, right? That opens a huge
> can of worms, especially considering how Xe depends on the job being
> freed only after the fence signals. Our job reference counts tie into
> power management, hold a reference to the software queue, which in turn
> reference-counts the VM, etc. I suspect other drivers do similar things.
> It seems far simpler to just keep the job around until its fence
> signals.
> 
> TL;DR: This seems like quite a lot of trouble.

For a start my idea is to have a pointer to the job in the scheduler fence, then track the pending fences instead of the pending jobs.

As a next step give the timedout callback the scheduler fence (which has both HW fence pointer and job at that point) instead of the job and stop messing with the pending list all together during a timeout.

As a next step we could move the job pointer from the scheduler fence into the driver specific HW fence.

Then we add a DMA-buf utility object which starts a work item when a dma_fence signals (I wanted to do that anyway cause that is a rather common pattern).

This utility object would then replace the free_job callback from the scheduler, so that drivers can basically decide themselves when they want their job object freed.

Regards,
Christian.

> 
> Matt
> 
>> Regards,
>> Christian. 
>>
>>>
>>>
>>> P.
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>  
>>>>> Signed-off-by: Maíra Canal <mcanal at igalia.com>
>>>>> ---
>>>>>  drivers/gpu/drm/scheduler/sched_main.c | 46 ++++++++++++++++++++++++++++++++--
>>>>>  include/drm/gpu_scheduler.h            |  3 +++
>>>>>  2 files changed, 47 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 0f32e2cb43d6af294408968a970990f9f5c47bee..657846d56dacd4f26fffc954fc3d025c1e6bfc9f 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -374,11 +374,16 @@ static void drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
>>>>>  {
>>>>>  	struct drm_sched_job *job;
>>>>>  
>>>>> -	spin_lock(&sched->job_list_lock);
>>>>>  	job = list_first_entry_or_null(&sched->pending_list,
>>>>>  				       struct drm_sched_job, list);
>>>>>  	if (job && dma_fence_is_signaled(&job->s_fence->finished))
>>>>>  		__drm_sched_run_free_queue(sched);
>>>>> +}
>>>>> +
>>>>> +static void drm_sched_run_free_queue_unlocked(struct drm_gpu_scheduler *sched)
>>>>> +{
>>>>> +	spin_lock(&sched->job_list_lock);
>>>>> +	drm_sched_run_free_queue(sched);
>>>>>  	spin_unlock(&sched->job_list_lock);
>>>>>  }
>>>>>  
>>>>> @@ -531,6 +536,32 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
>>>>>  	spin_unlock(&sched->job_list_lock);
>>>>>  }
>>>>>  
>>>>> +/**
>>>>> + * drm_sched_job_reinsert_on_false_timeout - reinsert the job on a false timeout
>>>>> + * @sched: scheduler instance
>>>>> + * @job: job to be reinserted on the pending list
>>>>> + *
>>>>> + * In the case of a "false timeout" - when a timeout occurs but the GPU isn't
>>>>> + * hung and is making progress, the scheduler must reinsert the job back into
>>>>> + * @sched->pending_list. Otherwise, the job and its resources won't be freed
>>>>> + * through the &struct drm_sched_backend_ops.free_job callback.
>>>>> + *
>>>>> + * This function must be used in "false timeout" cases only.
>>>>> + */
>>>>> +static void drm_sched_job_reinsert_on_false_timeout(struct drm_gpu_scheduler *sched,
>>>>> +						    struct drm_sched_job *job)
>>>>> +{
>>>>> +	spin_lock(&sched->job_list_lock);
>>>>> +	list_add(&job->list, &sched->pending_list);
>>>>> +
>>>>> +	/* After reinserting the job, the scheduler enqueues the free-job work
>>>>> +	 * again if ready. Otherwise, a signaled job could be added to the
>>>>> +	 * pending list, but never freed.
>>>>> +	 */
>>>>> +	drm_sched_run_free_queue(sched);
>>>>> +	spin_unlock(&sched->job_list_lock);
>>>>> +}
>>>>> +
>>>>>  static void drm_sched_job_timedout(struct work_struct *work)
>>>>>  {
>>>>>  	struct drm_gpu_scheduler *sched;
>>>>> @@ -564,6 +595,9 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>  			job->sched->ops->free_job(job);
>>>>>  			sched->free_guilty = false;
>>>>>  		}
>>>>> +
>>>>> +		if (status == DRM_GPU_SCHED_STAT_NO_HANG)
>>>>> +			drm_sched_job_reinsert_on_false_timeout(sched, job);
>>>>>  	} else {
>>>>>  		spin_unlock(&sched->job_list_lock);
>>>>>  	}
>>>>> @@ -586,6 +620,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>   * This function is typically used for reset recovery (see the docu of
>>>>>   * drm_sched_backend_ops.timedout_job() for details). Do not call it for
>>>>>   * scheduler teardown, i.e., before calling drm_sched_fini().
>>>>> + *
>>>>> + * As it's only used for reset recovery, drivers must not call this function
>>>>> + * in their &struct drm_sched_backend_ops.timedout_job callback when they
>>>>> + * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
>>>>>   */
>>>>>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>  {
>>>>> @@ -671,6 +709,10 @@ EXPORT_SYMBOL(drm_sched_stop);
>>>>>   * drm_sched_backend_ops.timedout_job() for details). Do not call it for
>>>>>   * scheduler startup. The scheduler itself is fully operational after
>>>>>   * drm_sched_init() succeeded.
>>>>> + *
>>>>> + * As it's only used for reset recovery, drivers must not call this function
>>>>> + * in their &struct drm_sched_backend_ops.timedout_job callback when they
>>>>> + * skip a reset using &enum drm_gpu_sched_stat.DRM_GPU_SCHED_STAT_NO_HANG.
>>>>>   */
>>>>>  void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
>>>>>  {
>>>>> @@ -1192,7 +1234,7 @@ static void drm_sched_free_job_work(struct work_struct *w)
>>>>>  	if (job)
>>>>>  		sched->ops->free_job(job);
>>>>>  
>>>>> -	drm_sched_run_free_queue(sched);
>>>>> +	drm_sched_run_free_queue_unlocked(sched);
>>>>>  	drm_sched_run_job_queue(sched);
>>>>>  }
>>>>>  
>>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>>> index 83e5c00d8dd9a83ab20547a93d6fc572de97616e..257d21d8d1d2c4f035d6d4882e159de59b263c76 100644
>>>>> --- a/include/drm/gpu_scheduler.h
>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>> @@ -393,11 +393,14 @@ struct drm_sched_job {
>>>>>   * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>>>>>   * @DRM_GPU_SCHED_STAT_RESET: The GPU hung and successfully reset.
>>>>>   * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
>>>>> + * @DRM_GPU_SCHED_STAT_NO_HANG: Contrary to scheduler's assumption, the GPU
>>>>> + * did not hang and is still running.
>>>>>   */
>>>>>  enum drm_gpu_sched_stat {
>>>>>  	DRM_GPU_SCHED_STAT_NONE,
>>>>>  	DRM_GPU_SCHED_STAT_RESET,
>>>>>  	DRM_GPU_SCHED_STAT_ENODEV,
>>>>> +	DRM_GPU_SCHED_STAT_NO_HANG,
>>>>>  };
>>>>>  
>>>>>  /**
>>>>>
>>>>
>>>
>>