[PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running

Mon May 12 14:04:50 UTC 2025

Hi Philipp,

On 12/05/25 08:13, Philipp Stanner wrote:
> On Tue, 2025-05-06 at 07:32 -0700, Matthew Brost wrote:
>> On Mon, May 05, 2025 at 07:41:09PM -0700, Matthew Brost wrote:
>>> On Sat, May 03, 2025 at 05:59:52PM -0300, Maíra Canal wrote:
>>>> When the DRM scheduler times out, it's possible that the GPU
>>>> isn't hung;
>>>> instead, a job may still be running, and there may be no valid
>>>> reason to
>>>> reset the hardware. This can occur in two situations:
>>>>
>>>>    1. The GPU exposes some mechanism that ensures the GPU is still
>>>> making
>>>>       progress. By checking this mechanism, we can safely skip the
>>>> reset,
>>>>       rearm the timeout, and allow the job to continue running
>>>> until
>>>>       completion. This is the case for v3d and Etnaviv.
>>>>    2. TDR has fired before the IRQ that signals the fence.
>>>> Consequently,
>>>>       the job actually finishes, but it triggers a timeout before
>>>> signaling
>>>>       the completion fence.
>>>>
>>>
>>> We have both of these cases in Xe too. We implement the requeuing
>>> in Xe
>>> via driver side function - xe_sched_add_pending_job but this looks
>>> better and will make use of this.
>>>
>>>> These two scenarios are problematic because we remove the job
>>>> from the
>>>> `sched->pending_list` before calling `sched->ops-
>>>>> timedout_job()`. This
>>>> means that when the job finally signals completion (e.g. in the
>>>> IRQ
>>>> handler), the scheduler won't call `sched->ops->free_job()`. As a
>>>> result,
>>>> the job and its resources won't be freed, leading to a memory
>>>> leak.
>>>>
>>>> To resolve this issue, we create a new `drm_gpu_sched_stat` that
>>>> allows a
>>>> driver to skip the reset. This new status will indicate that the
>>>> job
>>>> should be reinserted into the pending list, and the driver will
>>>> still
>>>> signal its completion.
>>>>
>>>> Signed-off-by: Maíra Canal <mcanal at igalia.com>
>>>
>>> Reviewed-by: Matthew Brost <matthew.brost at intel.com>
>>>
>>
>> Wait - nevermind I think one issue is below.
>>
>>>> ---
>>>>   drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++++++++++
>>>>   include/drm/gpu_scheduler.h            |  2 ++
>>>>   2 files changed, 16 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index
>>>> 829579c41c6b5d8b2abce5ad373c7017469b7680..68ca827d77e32187a034309
>>>> f881135dbc639a9b4 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -568,6 +568,17 @@ static void drm_sched_job_timedout(struct
>>>> work_struct *work)
>>>>   			job->sched->ops->free_job(job);
>>>>   			sched->free_guilty = false;
>>>>   		}
>>>> +
>>>> +		/*
>>>> +		 * If the driver indicated that the GPU is still
>>>> running and wants to skip
>>>> +		 * the reset, reinsert the job back into the
>>>> pending list and realarm the
>>>> +		 * timeout.
>>>> +		 */
>>>> +		if (status == DRM_GPU_SCHED_STAT_RUNNING) {
>>>> +			spin_lock(&sched->job_list_lock);
>>>> +			list_add(&job->list, &sched-
>>>>> pending_list);
>>>> +			spin_unlock(&sched->job_list_lock);
>>>> +		}
>>
>> I think you need to requeue free_job wq here. It is possible the
>> free_job wq ran, didn't find a job, goes to sleep, then we add a
>> signaled job here which will never get freed.
> 
> I wonder if that could be solved by holding job_list_lock a bit longer.
> free_job_work will try to check the list for the next signaled job, but
> will wait for the lock.
> 
> If that works, we could completely rely on the standard mechanism
> without requeuing, which would be neat.

I believe it works. However, the tradeoff would be holding the lock for
the entire reset of the GPU (in the cases the GPU actually hanged),
which looks like a lot of time.

Do you think it's reasonable to do so?

Best Regards,
- Maíra

> 
> P.
> 
>>
>> Matt
>>