[PATCH v5 2/8] drm/sched: Allow drivers to skip the reset and keep on running

Mon Jul 14 10:16:07 UTC 2025

On Mon, 2025-07-14 at 11:23 +0200, Christian König wrote:
> On 13.07.25 21:03, Maíra Canal wrote:
> > Hi Christian,
> > 
> > On 11/07/25 12:20, Christian König wrote:
> > > On 11.07.25 15:37, Philipp Stanner wrote:
> > > > On Fri, 2025-07-11 at 15:22 +0200, Christian König wrote:
> > > > > 
> > > > > 
> > > > > On 08.07.25 15:25, Maíra Canal wrote:
> > > > > > When the DRM scheduler times out, it's possible that the
> > > > > > GPU isn't hung;
> > > > > > instead, a job just took unusually long (longer than the
> > > > > > timeout) but is
> > > > > > still running, and there is, thus, no reason to reset the
> > > > > > hardware. This
> > > > > > can occur in two scenarios:
> > > > > > 
> > > > > >    1. The job is taking longer than the timeout, but the
> > > > > > driver determined
> > > > > >       through a GPU-specific mechanism that the hardware is
> > > > > > still making
> > > > > >       progress. Hence, the driver would like the scheduler
> > > > > > to skip the
> > > > > >       timeout and treat the job as still pending from then
> > > > > > onward. This
> > > > > >       happens in v3d, Etnaviv, and Xe.
> > > > > >    2. Timeout has fired before the free-job worker.
> > > > > > Consequently, the
> > > > > >       scheduler calls `sched->ops->timedout_job()` for a
> > > > > > job that isn't
> > > > > >       timed out.
> > > > > > 
> > > > > > These two scenarios are problematic because the job was
> > > > > > removed from the
> > > > > > `sched->pending_list` before calling `sched->ops-
> > > > > > >timedout_job()`, which
> > > > > > means that when the job finishes, it won't be freed by the
> > > > > > scheduler
> > > > > > though `sched->ops->free_job()` - leading to a memory leak.
> > > > > 
> > > > > Yeah, that is unfortunately intentional.
> > > > > 
> > > > > > To solve these problems, create a new `drm_gpu_sched_stat`,
> > > > > > called
> > > > > > DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip
> > > > > > the reset. The
> > > > > > new status will indicate that the job must be reinserted
> > > > > > into
> > > > > > `sched->pending_list`, and the hardware / driver will still
> > > > > > complete that
> > > > > > job.
> > > > > 
> > > > > Well long story short we have already tried this and the
> > > > > whole approach doesn't work correctly in all cases. See the
> > > > > git history around how we used to destroy the jobs.
> > > > > 
> > > > > The basic problem is that you can always race between timing
> > > > > out and Signaling/destroying the job. This is the long
> > > > > lasting job lifetime issue we already discussed more than
> > > > > once.
> > > > 
> > > > The scheduler destroys the job, through free_job().
> > > > I think we have agreed that for now the scheduler is the party
> > > > responsible for the job lifetime.
> > > 
> > > That's what I strongly disagree on. The job is just a state bag
> > > between the submission and scheduling state of a submission.
> > > 
> > > For the scheduler the control starts when it is pushed into the
> > > entity and ends when run_job is called.
> > > 
> > > The real representation of the submission is the scheduler fence
> > > and that object has a perfectly defined lifetime, state and error
> > > handling.
> > > 
> > > > > 
> > > > > If you want to fix this I think the correct approach is to
> > > > > completely drop tracking jobs in the scheduler at all.
> > > > 
> > > > I don't see how this series introduces a problem?
> > > > 
> > > > The fact is that drivers are abusing the API by just firing
> > > > jobs back
> > > > into the scheduler's job list. This series legalizes the abuse
> > > > by
> > > > providing scheduler functionality for that.
> > > > 
> > > > IOW, the series improves the situation but does not add a *new*
> > > > problem. Even less so as driver's aren't forced to use the new
> > > > status
> > > > code, but can continue having job completion race with timeout
> > > > handlers.
> > > 
> > > Maybe yes, but I'm really not sure about it.
> > > 
> > > Take a look at the git history or job destruction, we already had
> > > exactly that approach, removed it and said that leaking memory is
> > > at least better than an use after free issue.
> > > 
> > 
> > If the job was removed from the pending list in the beginning of
> > the
> > timeout and drm_sched_get_finished_job() fetches jobs from the
> > pending
> > list, how can we end up with an use-after-free issue?
> 
> By adding it back into the list after the timeout handling completed.
> 
> We had exactly that before we came up with the horrible design to add
> it back to the pending list in drm_sched_stop() and the free_guilty
> flag.
> 
> Could be that your approach now works better, but I'm really not sure
> about that.

This isn't Maíra's approach; this is an approach that is already
established practice in DRM. Have you looked at the code? There isn't
that much magic happening there.

The job gets added back to pending_list. For the scheduler, this is
like a completely new job, with the timeout handler running again and
the other work items picking the job up when appropriate.

Maíra ports several drivers which have been (illegally, outside our
API) doing that and it worked for them.

Now there's a centralized solution; one which is only active for the
drivers which choose to use it. Other driver's aren't affected.

IOW, my main argument still stands: this series doesn't add a new bug,
but improves the overall situation in DRM.

P.

> 
> Regards,
> Christian.
> 
> > 
> > Best Regards,
> > - Maíra
> > 
>