[PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running
Tvrtko Ursulin
tvrtko.ursulin at igalia.com
Wed May 7 12:50:33 UTC 2025
On 07/05/2025 13:33, Maíra Canal wrote:
> Hi Tvrtko,
>
> Thanks for the review!
>
> On 06/05/25 08:32, Tvrtko Ursulin wrote:
>>
>> On 03/05/2025 21:59, Maíra Canal wrote:
>>> When the DRM scheduler times out, it's possible that the GPU isn't hung;
>>> instead, a job may still be running, and there may be no valid reason to
>>> reset the hardware. This can occur in two situations:
>>>
>>> 1. The GPU exposes some mechanism that ensures the GPU is still
>>> making
>>> progress. By checking this mechanism, we can safely skip the
>>> reset,
>>> rearm the timeout, and allow the job to continue running until
>>> completion. This is the case for v3d and Etnaviv.
>>> 2. TDR has fired before the IRQ that signals the fence. Consequently,
>>> the job actually finishes, but it triggers a timeout before
>>> signaling
>>> the completion fence.
>>>
>>> These two scenarios are problematic because we remove the job from the
>>> `sched->pending_list` before calling `sched->ops->timedout_job()`. This
>>> means that when the job finally signals completion (e.g. in the IRQ
>>> handler), the scheduler won't call `sched->ops->free_job()`. As a
>>> result,
>>> the job and its resources won't be freed, leading to a memory leak.
>>>
>>> To resolve this issue, we create a new `drm_gpu_sched_stat` that
>>> allows a
>>> driver to skip the reset. This new status will indicate that the job
>>> should be reinserted into the pending list, and the driver will still
>>> signal its completion.
>>
>
> [...]
>
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index
>>> 1a7e377d4cbb4fc12ed93c548b236970217945e8..fe9043b6d43141bee831b5fc16b927202a507d51 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -389,11 +389,13 @@ struct drm_sched_job {
>>> * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
>>> * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
>>> * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore.
>>> + * @DRM_GPU_SCHED_STAT_RUNNING: GPU is still running, so skip the
>>> reset.
>>
>> s/GPU/job/ ?
>>
>>> */
>>> enum drm_gpu_sched_stat {
>>> DRM_GPU_SCHED_STAT_NONE,
>>> DRM_GPU_SCHED_STAT_NOMINAL,
>>> DRM_GPU_SCHED_STAT_ENODEV,
>>> + DRM_GPU_SCHED_STAT_RUNNING,
>>
>> I am wondering if we could make it more obvious what is the difference
>> between "nominal" and "running" and from whose point of view should
>> those statuses be considered.
> > > So far we have "nominal" which means scheduler/hardware is working
> fine
>> but the job may or may have not been cancelled. With "running" we kind
>> of split it into two sub-statuses and it would be nice for that to be
>> intuitively visible from the naming. But I struggle to suggest an
>> elegant name while preserving nominal as is.
>
> I was thinking: how about changing DRM_GPU_SCHED_STAT_NOMINAL to
> DRM_GPU_SCHED_STAT_RESET (the hardware is fine, but we reset it)?
>
> Then, when we skip the reset, we would have DRM_GPU_SCHED_STAT_NOMINAL
> (which means the hardware is fine and we didn't reset it).
>
> I'm open to other suggestions.
DRM_GPU_SCHED_STAT_RESET sounds like a good name and seems to paint a
consistent story between running - reset - enodev.
>> Thinking out loud here - perhaps that is pointing towards an
>> alternative that instead of a new status, a new helper to re-insert
>> the single job (like drm_sched_resubmit_job(sched, job)) would fit
>> better? Although it would be more churn.
>>
>
> Although your solution might be more elegant, I'm worried that such a
> function could be used improperly by new users (e.g. being called in
> contexts other than `timedout_job()`).
We could call it drm_sched_untimedout_job(). </humour>
> I'd prefer to have a new status as it'll be use solely for
> `timedout_job()` (making it harder for users to use it inappropriately).
> With the addition of Matthew's feedback (calling
> `drm_sched_run_free_queue()` after adding the job to the pending list),
> I think it makes even more sense to keep it inside the timeout function.
>
> I hope others can chime in and give their opinions about your idea.
Yeah - Philipp - Danilo - what do you prefer? Third enum with or a new
helper?
Regards,
Tvrtko
More information about the etnaviv
mailing list