[PATCH 3/3] drm/sched: Update timedout_job()'s documentation
Philipp Stanner
pstanner at redhat.com
Mon Jan 20 10:07:44 UTC 2025
On Mon, 2025-01-13 at 12:46 +0100, Danilo Krummrich wrote:
> On Thu, Jan 09, 2025 at 02:37:12PM +0100, Philipp Stanner wrote:
> > drm_sched_backend_ops.timedout_job()'s documentation is outdated.
> > It
> > mentions the deprecated function drm_sched_resubmit_job().
> > Furthermore,
> > it does not point out the important distinction between hardware
> > and
> > firmware schedulers.
> >
> > Since firmware schedulers tyipically only use one entity per
> > scheduler,
> > timeout handling is significantly more simple because the entity
> > the
> > faulted job came from can just be killed without affecting innocent
> > processes.
> >
> > Update the documentation with that distinction and other details.
> >
> > Reformat the docstring to work to a unified style with the other
> > handles.
> >
> > Signed-off-by: Philipp Stanner <phasta at kernel.org>
> > ---
> > include/drm/gpu_scheduler.h | 83 +++++++++++++++++++++++----------
> > ----
> > 1 file changed, 52 insertions(+), 31 deletions(-)
> >
> > diff --git a/include/drm/gpu_scheduler.h
> > b/include/drm/gpu_scheduler.h
> > index c4e65f9f7f22..380b8840c591 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -445,43 +445,64 @@ struct drm_sched_backend_ops {
> > * @timedout_job: Called when a job has taken too long to
> > execute,
> > * to trigger GPU recovery.
> > *
> > - * This method is called in a workqueue context.
> > + * @sched_job: The job that has timed out
> > *
> > - * Drivers typically issue a reset to recover from GPU
> > hangs, and this
> > - * procedure usually follows the following workflow:
> > + * Returns:
> > + * - DRM_GPU_SCHED_STAT_NOMINAL, on success, i.e., if the
> > underlying
> > + * driver has started or completed recovery.
> > + * - DRM_GPU_SCHED_STAT_ENODEV, if the device is no longer
> > + * available, i.e., has been unplugged.
>
> Maybe it'd be better to document this at the enum level and add a
> link.
>
> > *
> > - * 1. Stop the scheduler using drm_sched_stop(). This will
> > park the
> > - * scheduler thread and cancel the timeout work,
> > guaranteeing that
> > - * nothing is queued while we reset the hardware queue
> > - * 2. Try to gracefully stop non-faulty jobs (optional)
> > - * 3. Issue a GPU reset (driver-specific)
> > - * 4. Re-submit jobs using drm_sched_resubmit_jobs()
> > + * Drivers typically issue a reset to recover from GPU
> > hangs.
> > + * This procedure looks very different depending on
> > whether a firmware
> > + * or a hardware scheduler is being used.
> > + *
> > + * For a FIRMWARE SCHEDULER, each (pseudo-)ring has one
> > scheduler, and
> > + * each scheduler (typically) has one entity. Hence, you
> > typically
>
> I think you can remove the first "typically" here. I don't think that
> for a
> firmware scheduler we ever have something else than a 1:1 relation,
> besides that
> having something else than a 1:1 relation (whatever that would be)
> would likely
> be a contradiction to the statement above.
>
> > + * follow those steps:
> > + *
> > + * 1. Stop the scheduler using drm_sched_stop(). This will
> > pause the
> > + * scheduler workqueues and cancel the timeout work,
> > guaranteeing
> > + * that nothing is queued while we reset the hardware
> > queue.
>
> Rather reset / remove the software queue / ring.
>
> (Besides: I think we should also define a distinct terminology,
> sometimes "queue"
> refers to the ring buffer, sometimes it refers to the entity, etc. At
> least we
> should be consequent within this commit, and then fix the rest.)
Very good idea!
How about we, from now on, always just call it "ring" or "hardware
ring"?
I think queue is very generic and, as you point out, often used for the
entities and stuff like that.
>
> > + * 2. Try to gracefully stop non-faulty jobs (optional).
> > + * TODO: RFC ^ Folks, should we remove this step? What
> > does it even mean
> > + * precisely to "stop" those jobs? Is that even helpful to
> > userspace in
> > + * any way?
>
> I think this means to prevent unrelated queues / jobs from being
> impacted by the
> subsequent GPU reset.
>
> So, this is likely not applicable here, see below.
>
> > + * 3. Issue a GPU reset (driver-specific).
>
> I'm not entirely sure it really applies to all GPUs that feature a FW
> scheduler,
> but I'd expect that the FW takes care of resetting the corresponding
> HW
> (including preventing impact on other FW rings) if the faulty FW ring
> is removed
> by the driver.
@Christian: Agree? AMD is still purely a HW scheduler afaik.
>
> > + * 4. Kill the entity the faulted job stems from, and the
> > associated
> > + * scheduler.
> > * 5. Restart the scheduler using drm_sched_start(). At
> > that point, new
> > - * jobs can be queued, and the scheduler thread is
> > unblocked
> > + * jobs can be queued, and the scheduler workqueues
> > awake again.
>
> How can we start the scheduler again after we've killed it? I think
> the most
> likely scenario is that the FW ring (including the scheduler
> structures) is
> removed entirely and simply re-created. So, I think we can probably
> remove 5.
ACK
>
> > + *
> > + * For a HARDWARE SCHEDULER, each ring also has one
> > scheduler, but each
> > + * scheduler typically has many attached entities. This
> > implies that you
>
> Maybe better "associated". For the second sentence, I'd rephrase it,
> e.g. "This
> implies that all entities associated with the affected scheduler
> can't be torn
> down, because [...]".
>
> > + * cannot tear down all entities associated with the
> > affected scheduler,
> > + * because this would effectively also kill innocent
> > userspace processes
> > + * which did not submit faulty jobs (for example).
> > + *
> > + * Consequently, the procedure to recover with a hardware
> > scheduler
> > + * should look like this:
> > + *
> > + * 1. Stop all schedulers impacted by the reset using
> > drm_sched_stop().
> > + * 2. Figure out to which entity the faulted job belongs.
>
> "belongs to"
>
> > + * 3. Try to gracefully stop non-faulty jobs (optional).
> > + * TODO: RFC ^ Folks, should we remove this step? What
> > does it even mean
> > + * precisely to "stop" those jobs? Is that even helpful to
> > userspace in
> > + * any way?
>
> See above.
Agree with all. Will fix those in v2.
Danke,
P.
>
> > + * 4. Kill that entity.
> > + * 5. Issue a GPU reset on all faulty rings (driver-
> > specific).
> > + * 6. Re-submit jobs on all schedulers impacted by re-
> > submitting them to
> > + * the entities which are still alive.
> > + * 7. Restart all schedulers that were stopped in step #1
> > using
> > + * drm_sched_start().
> > *
> > * Note that some GPUs have distinct hardware queues but
> > need to reset
> > * the GPU globally, which requires extra synchronization
> > between the
> > - * timeout handler of the different &drm_gpu_scheduler.
> > One way to
> > - * achieve this synchronization is to create an ordered
> > workqueue
> > - * (using alloc_ordered_workqueue()) at the driver level,
> > and pass this
> > - * queue to drm_sched_init(), to guarantee that timeout
> > handlers are
> > - * executed sequentially. The above workflow needs to be
> > slightly
> > - * adjusted in that case:
> > - *
> > - * 1. Stop all schedulers impacted by the reset using
> > drm_sched_stop()
> > - * 2. Try to gracefully stop non-faulty jobs on all queues
> > impacted by
> > - * the reset (optional)
> > - * 3. Issue a GPU reset on all faulty queues (driver-
> > specific)
> > - * 4. Re-submit jobs on all schedulers impacted by the
> > reset using
> > - * drm_sched_resubmit_jobs()
> > - * 5. Restart all schedulers that were stopped in step #1
> > using
> > - * drm_sched_start()
> > - *
> > - * Return DRM_GPU_SCHED_STAT_NOMINAL, when all is normal,
> > - * and the underlying driver has started or completed
> > recovery.
> > - *
> > - * Return DRM_GPU_SCHED_STAT_ENODEV, if the device is no
> > longer
> > - * available, i.e. has been unplugged.
> > + * timeout handlers of different schedulers. One way to
> > achieve this
> > + * synchronization is to create an ordered workqueue
> > (using
> > + * alloc_ordered_workqueue()) at the driver level, and
> > pass this queue
> > + * as drm_sched_init()'s @timeout_wq parameter. This will
> > guarantee
> > + * that timeout handlers are executed sequentially.
> > */
> > enum drm_gpu_sched_stat (*timedout_job)(struct
> > drm_sched_job *sched_job);
> >
> > --
> > 2.47.1
> >
>
More information about the dri-devel
mailing list