[PATCH v2 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_NO_HANG

Maíra Canal mcanal at igalia.com
Fri May 30 14:01:29 UTC 2025


When the DRM scheduler times out, it's possible that the GPU isn't hung;
instead, a job may still be running, and there may be no valid reason to
reset the hardware. This can occur in two situations:

  1. The GPU exposes some mechanism that ensures the GPU is still making
     progress. By checking this mechanism, the driver can safely skip the
     reset, re-arm the timeout, and allow the job to continue running until
     completion. This is the case for v3d, Etnaviv, and Xe.

  2. Timeout has fired before the free-job worker. Consequently, the
     scheduler calls `timedout_job()` for a job that isn't timed out.

These two scenarios are problematic because the job was removed from the
`sched->pending_list` before calling `sched->ops->timedout_job()`, which
means that when the job finishes, it won't be freed by the scheduler
though `sched->ops->free_job()`. As a result, the job and its resources
won't be freed, leading to a memory leak.

For v3d specifically, we have observed that these memory leaks can be
significant in certain scenarios, as reported by users in [1][2]. To
address this situation, I submitted a patch similar to commit 704d3d60fec4
("drm/etnaviv: don't block scheduler when GPU is still active") for v3d [3].
This patch has already landed in drm-misc-fixes and successfully resolved
the users' issues.

However, as I mentioned in [3], exposing the scheduler's internals within
the drivers isn't ideal and I believe this specific situation can be
addressed within the DRM scheduler framework.

This series aims to resolve this issue by adding a new DRM sched status
that allows a driver to skip the reset. This new status will indicate that
the job should be reinserted into the pending list, and the driver will
still signal its completion.

[1] https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227
[2] https://github.com/raspberrypi/linux/issues/6817
[3] https://lore.kernel.org/dri-devel/20250430210643.57924-1-mcanal@igalia.com/T/

Best Regards,
- Maíra

---
v1 -> v2:

- Fix several grammar nits across the documentation and commit messages.
- Drop "drm/sched: Always free the job after the timeout" (Tvrtko Ursulin)
- [1/8] NEW PATCH: Rename DRM_GPU_SCHED_STAT_NOMINAL to a more semantic
	name (Tvrtko Ursulin, Philipp Stanner)
- [2/8] Rename DRM_GPU_SCHED_STAT_RUNNING to DRM_GPU_SCHED_STAT_NO_HANG (Tvrtko Ursulin, Philipp Stanner)
- [2/8] Requeue free-job work after reinserting the job to the pending list (Matthew Brost)
- [2/8] Create a helper function to reinsert the job (Philipp Stanner)
- [2/8] Rewrite the commit message (Philipp Stanner)
- [2/8] Add a comment to `drm_sched_start()` documentation, similar to what
	was commented in `drm_sched_stop()` (Philipp Stanner)
- [3/8] Keep HZ as timeout for `drm_mock_sched_job_wait_scheduled()` (Tvrtko Ursulin)
- [4/8] Use a job flag to indicate that `timedout_job()` should skip the
	reset (Tvrtko Ursulin)
- [7/8] Use DRM_GPU_SCHED_STAT_NO_HANG to re-arm the timer in other cases
	as well (Matthew Brost)
- Link to v1: https://lore.kernel.org/r/20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com

---
Maíra Canal (8):
      drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET
      drm/sched: Allow drivers to skip the reset and keep on running
      drm/sched: Reduce scheduler's timeout for timeout tests
      drm/sched: Add new test for DRM_GPU_SCHED_STAT_NO_HANG
      drm/v3d: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
      drm/etnaviv: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
      drm/xe: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset
      drm/panfrost: Use DRM_GPU_SCHED_STAT_NO_HANG to skip the reset

 drivers/accel/amdxdna/aie2_ctx.c                 |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c          |  2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c          | 13 +++---
 drivers/gpu/drm/imagination/pvr_queue.c          |  4 +-
 drivers/gpu/drm/lima/lima_sched.c                |  6 +--
 drivers/gpu/drm/nouveau/nouveau_exec.c           |  2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c          |  2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c          | 10 ++---
 drivers/gpu/drm/panthor/panthor_mmu.c            |  2 +-
 drivers/gpu/drm/panthor/panthor_sched.c          |  2 +-
 drivers/gpu/drm/scheduler/sched_main.c           | 51 ++++++++++++++++++++++--
 drivers/gpu/drm/scheduler/tests/mock_scheduler.c |  7 +++-
 drivers/gpu/drm/scheduler/tests/sched_tests.h    |  1 +
 drivers/gpu/drm/scheduler/tests/tests_basic.c    | 51 ++++++++++++++++++++++--
 drivers/gpu/drm/v3d/v3d_sched.c                  |  6 +--
 drivers/gpu/drm/xe/xe_guc_submit.c               | 14 ++-----
 include/drm/gpu_scheduler.h                      |  7 +++-
 17 files changed, 136 insertions(+), 46 deletions(-)
---
base-commit: df1c3093aee3daee4d56a5cb8cdba75c1ef6962f
change-id: 20250502-sched-skip-reset-bf7c163233da



More information about the etnaviv mailing list