[PATCH v4 2/8] drm/sched: Allow drivers to skip the reset and keep on running
Maíra Canal
mcanal at igalia.com
Tue Jul 8 12:38:24 UTC 2025
Hi Philipp,
On 08/07/25 04:02, Philipp Stanner wrote:
> On Mon, 2025-07-07 at 11:46 -0300, Maíra Canal wrote:
>> When the DRM scheduler times out, it's possible that the GPU isn't
>> hung;
>> instead, a job may still be running, and there may be no valid reason
>> to
>> reset the hardware. This can occur in two situations:
>>
>> 1. The GPU exposes some mechanism that ensures the GPU is still
>> making
>> progress. By checking this mechanism, the driver can safely skip
>
> I think this should be rephrased, because it reads as if there is a
> mechanism with which the GPU can be forced to still make progress even
> with a while (1) job or something.
>
> I think what we want probably is:
>
> "When the DRM scheduler times out, it's possible that the GPU isn't
> hung; instead, a job just took unusually long (longer than the timeout)
> but is still running, and there is, thus, no reason to reset the
> hardware. A false-positive timeout can occur in two scenarios:
>
> 1. The job took too long, but the driver determined through a GPU-
> specific mechanism that the hardware is still making progress. Hence,
> the driver would like the scheduler to skip the timeout and treat the
> job as still pending from then onward.
>
Applied it locally.
>> the
>> reset, re-arm the timeout, and allow the job to continue running
>> until
>> completion. This is the case for v3d, Etnaviv, and Xe.
>> 2. Timeout has fired before the free-job worker. Consequently, the
>> scheduler calls `sched->ops->timedout_job()` for a job that
>> isn't
>> timed out.
>
>
> "2. The job actually did complete from the driver's point of view, but
> there was a race with the scheduler's timeout, which determined this
> job timed out slightly before the free-job worker could remove it from
> the pending_list."
>
Actually, for this second point, I prefer my wording. It's more straight
to the point and easier to understand when you read the code. I'd prefer
to keep the second point as it is.
All other comments have been applied. Thanks for your feedback!
Best Regards,
- Maíra
More information about the Intel-xe
mailing list