[PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running

Wed May 7 12:50:33 UTC 2025

On 07/05/2025 13:33, Maíra Canal wrote:
> Hi Tvrtko,
> 
> Thanks for the review!
> 
> On 06/05/25 08:32, Tvrtko Ursulin wrote:
>>
>> On 03/05/2025 21:59, Maíra Canal wrote:
>>> When the DRM scheduler times out, it's possible that the GPU isn't hung;
>>> instead, a job may still be running, and there may be no valid reason to
>>> reset the hardware. This can occur in two situations:
>>>
>>>    1. The GPU exposes some mechanism that ensures the GPU is still 
>>> making
>>>       progress. By checking this mechanism, we can safely skip the 
>>> reset,
>>>       rearm the timeout, and allow the job to continue running until
>>>       completion. This is the case for v3d and Etnaviv.
>>>    2. TDR has fired before the IRQ that signals the fence. Consequently,
>>>       the job actually finishes, but it triggers a timeout before 
>>> signaling
>>>       the completion fence.
>>>
>>> These two scenarios are problematic because we remove the job from the
>>> `sched->pending_list` before calling `sched->ops->timedout_job()`. This
>>> means that when the job finally signals completion (e.g. in the IRQ
>>> handler), the scheduler won't call `sched->ops->free_job()`. As a 
>>> result,
>>> the job and its resources won't be freed, leading to a memory leak.
>>>
>>> To resolve this issue, we create a new `drm_gpu_sched_stat` that 
>>> allows a
>>> driver to skip the reset. This new status will indicate that the job
>>> should be reinserted into the pending list, and the driver will still
>>> signal its completion.
>>
> 
> [...]
> 
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index 
>>> 1a7e377d4cbb4fc12ed93c548b236970217945e8..fe9043b6d43141bee831b5fc16b927202a507d51 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -389,11 +389,13 @@ struct drm_sched_job {
>>>    * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>>>    * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
>>>    * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
>>> + * @DRM_GPU_SCHED_STAT_RUNNING: GPU is still running, so skip the 
>>> reset.
>>
>> s/GPU/job/ ?
>>
>>>    */
>>>   enum drm_gpu_sched_stat {
>>>       DRM_GPU_SCHED_STAT_NONE,
>>>       DRM_GPU_SCHED_STAT_NOMINAL,
>>>       DRM_GPU_SCHED_STAT_ENODEV,
>>> +    DRM_GPU_SCHED_STAT_RUNNING,
>>
>> I am wondering if we could make it more obvious what is the difference 
>> between "nominal" and "running" and from whose point of view should 
>> those statuses be considered.
>  > > So far we have "nominal" which means scheduler/hardware is working 
> fine
>> but the job may or may have not been cancelled. With "running" we kind 
>> of split it into two sub-statuses and it would be nice for that to be 
>> intuitively visible from the naming. But I struggle to suggest an 
>> elegant name while preserving nominal as is.
> 
> I was thinking: how about changing DRM_GPU_SCHED_STAT_NOMINAL to
> DRM_GPU_SCHED_STAT_RESET (the hardware is fine, but we reset it)?
> 
> Then, when we skip the reset, we would have DRM_GPU_SCHED_STAT_NOMINAL
> (which means the hardware is fine and we didn't reset it).
> 
> I'm open to other suggestions.

DRM_GPU_SCHED_STAT_RESET sounds like a good name and seems to paint a 
consistent story between running - reset - enodev.

>> Thinking out loud here - perhaps that is pointing towards an 
>> alternative that instead of a new status, a new helper to re-insert 
>> the single job (like drm_sched_resubmit_job(sched, job)) would fit 
>> better? Although it would be more churn.
>>
> 
> Although your solution might be more elegant, I'm worried that such a
> function could be used improperly by new users (e.g. being called in
> contexts other than `timedout_job()`).

We could call it drm_sched_untimedout_job(). </humour>

> I'd prefer to have a new status as it'll be use solely for
> `timedout_job()` (making it harder for users to use it inappropriately).
> With the addition of Matthew's feedback (calling
> `drm_sched_run_free_queue()` after adding the job to the pending list),
> I think it makes even more sense to keep it inside the timeout function.
> 
> I hope others can chime in and give their opinions about your idea.

Yeah - Philipp - Danilo - what do you prefer? Third enum with or a new 
helper?

Regards,

Tvrtko