[PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running

Mon Jul 14 11:46:30 UTC 2025

On 14.07.25 12:16, Philipp Stanner wrote:
> On Mon, 2025-07-14 at 11:23 +0200, Christian König wrote:
>> On 13.07.25 21:03, Maíra Canal wrote:
>>> Hi Christian,
>>>
>>> On 11/07/25 12:20, Christian König wrote:
>>>> On 11.07.25 15:37, Philipp Stanner wrote:
>>>>> On Fri, 2025-07-11 at 15:22 +0200, Christian König wrote:
>>>>>>
>>>>>>
>>>>>> On 08.07.25 15:25, Maíra Canal wrote:
>>>>>>> When the DRM scheduler times out, it's possible that the
>>>>>>> GPU isn't hung;
>>>>>>> instead, a job just took unusually long (longer than the
>>>>>>> timeout) but is
>>>>>>> still running, and there is, thus, no reason to reset the
>>>>>>> hardware. This
>>>>>>> can occur in two scenarios:
>>>>>>>
>>>>>>>    1. The job is taking longer than the timeout, but the
>>>>>>> driver determined
>>>>>>>       through a GPU-specific mechanism that the hardware is
>>>>>>> still making
>>>>>>>       progress. Hence, the driver would like the scheduler
>>>>>>> to skip the
>>>>>>>       timeout and treat the job as still pending from then
>>>>>>> onward. This
>>>>>>>       happens in v3d, Etnaviv, and Xe.
>>>>>>>    2. Timeout has fired before the free-job worker.
>>>>>>> Consequently, the
>>>>>>>       scheduler calls `sched->ops->timedout_job()` for a
>>>>>>> job that isn't
>>>>>>>       timed out.
>>>>>>>
>>>>>>> These two scenarios are problematic because the job was
>>>>>>> removed from the
>>>>>>> `sched->pending_list` before calling `sched->ops-
>>>>>>>> timedout_job()`, which
>>>>>>> means that when the job finishes, it won't be freed by the
>>>>>>> scheduler
>>>>>>> though `sched->ops->free_job()` - leading to a memory leak.
>>>>>>
>>>>>> Yeah, that is unfortunately intentional.
>>>>>>
>>>>>>> To solve these problems, create a new `drm_gpu_sched_stat`,
>>>>>>> called
>>>>>>> DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip
>>>>>>> the reset. The
>>>>>>> new status will indicate that the job must be reinserted
>>>>>>> into
>>>>>>> `sched->pending_list`, and the hardware / driver will still
>>>>>>> complete that
>>>>>>> job.
>>>>>>
>>>>>> Well long story short we have already tried this and the
>>>>>> whole approach doesn't work correctly in all cases. See the
>>>>>> git history around how we used to destroy the jobs.
>>>>>>
>>>>>> The basic problem is that you can always race between timing
>>>>>> out and Signaling/destroying the job. This is the long
>>>>>> lasting job lifetime issue we already discussed more than
>>>>>> once.
>>>>>
>>>>> The scheduler destroys the job, through free_job().
>>>>> I think we have agreed that for now the scheduler is the party
>>>>> responsible for the job lifetime.
>>>>
>>>> That's what I strongly disagree on. The job is just a state bag
>>>> between the submission and scheduling state of a submission.
>>>>
>>>> For the scheduler the control starts when it is pushed into the
>>>> entity and ends when run_job is called.
>>>>
>>>> The real representation of the submission is the scheduler fence
>>>> and that object has a perfectly defined lifetime, state and error
>>>> handling.
>>>>
>>>>>>
>>>>>> If you want to fix this I think the correct approach is to
>>>>>> completely drop tracking jobs in the scheduler at all.
>>>>>
>>>>> I don't see how this series introduces a problem?
>>>>>
>>>>> The fact is that drivers are abusing the API by just firing
>>>>> jobs back
>>>>> into the scheduler's job list. This series legalizes the abuse
>>>>> by
>>>>> providing scheduler functionality for that.
>>>>>
>>>>> IOW, the series improves the situation but does not add a *new*
>>>>> problem. Even less so as driver's aren't forced to use the new
>>>>> status
>>>>> code, but can continue having job completion race with timeout
>>>>> handlers.
>>>>
>>>> Maybe yes, but I'm really not sure about it.
>>>>
>>>> Take a look at the git history or job destruction, we already had
>>>> exactly that approach, removed it and said that leaking memory is
>>>> at least better than an use after free issue.
>>>>
>>>
>>> If the job was removed from the pending list in the beginning of
>>> the
>>> timeout and drm_sched_get_finished_job() fetches jobs from the
>>> pending
>>> list, how can we end up with an use-after-free issue?
>>
>> By adding it back into the list after the timeout handling completed.
>>
>> We had exactly that before we came up with the horrible design to add
>> it back to the pending list in drm_sched_stop() and the free_guilty
>> flag.
>>
>> Could be that your approach now works better, but I'm really not sure
>> about that.
> 
> This isn't Maíra's approach; this is an approach that is already
> established practice in DRM. Have you looked at the code? There isn't
> that much magic happening there.
> 
> The job gets added back to pending_list. For the scheduler, this is
> like a completely new job, with the timeout handler running again and
> the other work items picking the job up when appropriate.
> 
> Maíra ports several drivers which have been (illegally, outside our
> API) doing that and it worked for them.

Oh, good point! I completely missed that.

> Now there's a centralized solution; one which is only active for the
> drivers which choose to use it. Other driver's aren't affected.
> 
> IOW, my main argument still stands: this series doesn't add a new bug,
> but improves the overall situation in DRM.

Yeah, if it's just a cleanup then please go ahead with it. Certainly better to have it centralized in the scheduler then every driver doing it's own thing.

Regards,
Christian.

> 
> 
> P.
> 
>>
>> Regards,
>> Christian.
>>
>>>
>>> Best Regards,
>>> - Maíra
>>>
>>
>