[RFC PATCH 2/5] drm/sched: Prevent teardown waitque from blocking too long
Philipp Stanner
phasta at kernel.org
Mon Mar 24 18:57:26 UTC 2025
The waitqueue that ensures that drm_sched_fini() blocks until the
pending_list has become empty could theoretically cause that function to
block for a very long time. That, ultimately, could block userspace
procesess and prevent them from being killable through SIGKILL.
When a driver calls drm_sched_fini(), it is safe to assume that all
still pending jobs are not needed anymore anyways. Thus, they can be
cancelled and thereby it can be ensured that drm_sched_fini() will
return relatively quickly.
Implement a new helper to stop all work items / submission except for
the drm_sched_backend_ops.run_job().
Implement a driver callback, kill_fence_context(), that instructs the
driver to kill the fence context associated with this scheduler, thereby
causing all pending hardware fences to be signalled.
Call those new routines in drm_sched_fini() and ensure backwards
compatibility if the new callback is not implemented.
Suggested-by: Danilo Krummrich <dakr at redhat.com>
Signed-off-by: Philipp Stanner <phasta at kernel.org>
---
drivers/gpu/drm/scheduler/sched_main.c | 47 +++++++++++++++++---------
include/drm/gpu_scheduler.h | 11 ++++++
2 files changed, 42 insertions(+), 16 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index c1ab48ef61cf..87acbcb1a821 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1388,31 +1388,46 @@ drm_sched_no_jobs_pending(struct drm_gpu_scheduler *sched)
return empty;
}
+/**
+ * drm_sched_cancel_jobs_and_wait - trigger freeing of all pending jobs
+ * @sched: scheduler instance
+ *
+ * Must only be called if &struct drm_sched_backend_ops.kill_fence_context is
+ * implemented.
+ *
+ * Instructs the driver to kill the fence context associated with this scheduler,
+ * thereby signalling all pending fences. This, in turn, will trigger
+ * &struct drm_sched_backend_ops.free_job to be called for all pending jobs.
+ * The function then blocks until all pending jobs have been freed.
+ */
+static inline void
+drm_sched_cancel_jobs_and_wait(struct drm_gpu_scheduler *sched)
+{
+ sched->ops->kill_fence_context(sched);
+ wait_event(sched->pending_list_waitque, drm_sched_no_jobs_pending(sched));
+}
+
/**
* drm_sched_fini - Destroy a gpu scheduler
*
* @sched: scheduler instance
*
- * Tears down and cleans up the scheduler.
- *
- * Note that this function blocks until all the fences returned by
- * &struct drm_sched_backend_ops.run_job have been signalled.
+ * Tears down and cleans up the scheduler. Might leak memory if
+ * &struct drm_sched_backend_ops.kill_fence_context is not implemented.
*/
void drm_sched_fini(struct drm_gpu_scheduler *sched)
{
struct drm_sched_entity *s_entity;
int i;
- /*
- * Jobs that have neither been scheduled or which have timed out are
- * gone by now, but jobs that have been submitted through
- * backend_ops.run_job() and have not yet terminated are still pending.
- *
- * Wait for the pending_list to become empty to avoid leaking those jobs.
- */
- drm_sched_submission_and_timeout_stop(sched);
- wait_event(sched->pending_list_waitque, drm_sched_no_jobs_pending(sched));
- drm_sched_free_stop(sched);
+ if (sched->ops->kill_fence_context) {
+ drm_sched_submission_and_timeout_stop(sched);
+ drm_sched_cancel_jobs_and_wait(sched);
+ drm_sched_free_stop(sched);
+ } else {
+ /* We're in "legacy free-mode" and ignore potential mem leaks */
+ drm_sched_wqueue_stop(sched);
+ }
for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
struct drm_sched_rq *rq = sched->sched_rq[i];
@@ -1500,7 +1515,7 @@ bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched)
EXPORT_SYMBOL(drm_sched_wqueue_ready);
/**
- * drm_sched_wqueue_stop - stop scheduler submission
+ * drm_sched_wqueue_stop - stop scheduler submission and freeing
* @sched: scheduler instance
*
* Stops the scheduler from pulling new jobs from entities. It also stops
@@ -1516,7 +1531,7 @@ void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched)
EXPORT_SYMBOL(drm_sched_wqueue_stop);
/**
- * drm_sched_wqueue_start - start scheduler submission
+ * drm_sched_wqueue_start - start scheduler submission and freeing
* @sched: scheduler instance
*
* Restarts the scheduler after drm_sched_wqueue_stop() has stopped it.
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index badcf9ea4501..e13eb7e4c93a 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -509,6 +509,17 @@ struct drm_sched_backend_ops {
* and it's time to clean it up.
*/
void (*free_job)(struct drm_sched_job *sched_job);
+
+ /**
+ * @kill_fence_context: kill the fence context belonging to this scheduler
+ *
+ * Needed to cleanly tear the scheduler down in drm_sched_fini(). This
+ * callback will cause all hardware fences to be signalled by the driver,
+ * which, ultimately, ensures that all jobs get freed before teardown.
+ *
+ * This callback is optional, but it is highly recommended to implement it.
+ */
+ void (*kill_fence_context)(struct drm_gpu_scheduler *sched);
};
/**
--
2.48.1
More information about the Nouveau
mailing list