[PATCH v3 2/5] drm/sched/tests: Port tests to new cleanup method
Philipp Stanner
phasta at mailbox.org
Thu May 22 14:59:35 UTC 2025
On Thu, 2025-05-22 at 15:06 +0100, Tvrtko Ursulin wrote:
>
> On 22/05/2025 09:27, Philipp Stanner wrote:
> > The drm_gpu_scheduler now supports a callback to help
> > drm_sched_fini()
> > avoid memory leaks. This callback instructs the driver to signal
> > all
> > pending hardware fences.
> >
> > Implement the new callback
> > drm_sched_backend_ops.cancel_pending_fences().
> >
> > Have the callback use drm_mock_sched_job_complete() with a new
> > error
> > field for the fence error.
> >
> > Keep the job status as DRM_MOCK_SCHED_JOB_DONE for now, since there
> > is
> > no party for which checking for a CANCELED status would be useful
> > currently.
> >
> > Signed-off-by: Philipp Stanner <phasta at kernel.org>
> > ---
> > .../gpu/drm/scheduler/tests/mock_scheduler.c | 67 +++++++-------
> > -----
> > drivers/gpu/drm/scheduler/tests/sched_tests.h | 4 +-
> > 2 files changed, 25 insertions(+), 46 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > index f999c8859cf7..eca47f0395bc 100644
> > --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> > @@ -55,7 +55,7 @@ void drm_mock_sched_entity_free(struct
> > drm_mock_sched_entity *entity)
> > drm_sched_entity_destroy(&entity->base);
> > }
> >
> > -static void drm_mock_sched_job_complete(struct drm_mock_sched_job
> > *job)
> > +static void drm_mock_sched_job_complete(struct drm_mock_sched_job
> > *job, int err)
> > {
> > struct drm_mock_scheduler *sched =
> > drm_sched_to_mock_sched(job->base.sched);
> > @@ -63,7 +63,9 @@ static void drm_mock_sched_job_complete(struct
> > drm_mock_sched_job *job)
> > lockdep_assert_held(&sched->lock);
> >
> > job->flags |= DRM_MOCK_SCHED_JOB_DONE;
> > - list_move_tail(&job->link, &sched->done_list);
> > + list_del(&job->link);
> > + if (err)
> > + dma_fence_set_error(&job->hw_fence, err);
> > dma_fence_signal(&job->hw_fence);
> > complete(&job->done);
> > }
> > @@ -89,7 +91,7 @@ drm_mock_sched_job_signal_timer(struct hrtimer
> > *hrtimer)
> > break;
> >
> > sched->hw_timeline.cur_seqno = job-
> > >hw_fence.seqno;
> > - drm_mock_sched_job_complete(job);
> > + drm_mock_sched_job_complete(job, 0);
> > }
> > spin_unlock_irqrestore(&sched->lock, flags);
> >
> > @@ -212,26 +214,33 @@ mock_sched_timedout_job(struct drm_sched_job
> > *sched_job)
> >
> > static void mock_sched_free_job(struct drm_sched_job *sched_job)
> > {
> > - struct drm_mock_scheduler *sched =
> > - drm_sched_to_mock_sched(sched_job->sched);
> > struct drm_mock_sched_job *job =
> > drm_sched_job_to_mock_job(sched_job);
> > - unsigned long flags;
> >
> > - /* Remove from the scheduler done list. */
> > - spin_lock_irqsave(&sched->lock, flags);
> > - list_del(&job->link);
> > - spin_unlock_irqrestore(&sched->lock, flags);
> > dma_fence_put(&job->hw_fence);
> > -
> > drm_sched_job_cleanup(sched_job);
> >
> > /* Mock job itself is freed by the kunit framework. */
> > }
> >
> > +static void mock_sched_cancel_pending_fences(struct
> > drm_gpu_scheduler *gsched)
>
> "gsched" feels like a first time invention. Maybe drm_sched?
Alright
>
> > +{
> > + struct drm_mock_sched_job *job, *next;
> > + struct drm_mock_scheduler *sched;
> > + unsigned long flags;
> > +
> > + sched = container_of(gsched, struct drm_mock_scheduler,
> > base);
> > +
> > + spin_lock_irqsave(&sched->lock, flags);
> > + list_for_each_entry_safe(job, next, &sched->job_list,
> > link)
> > + drm_mock_sched_job_complete(job, -ECANCELED);
> > + spin_unlock_irqrestore(&sched->lock, flags);
>
> Canceling of the timers belongs in this call back I think. Otherwise
> jobs are not fully cancelled.
I wouldn't say so – the timers represent things like the hardware
interrupts. And those must be deactivated by the driver itself.
See, one big reason why I like my approach is that the contract between
driver and scheduler is made very simple:
"Driver, signal all fences that you ever returned through run_job() to
this scheduler!"
That always works, and the driver always has all those fences. It's
based on the most fundamental agreement regarding dma_fences: they must
all be signaled.
>
> Hm, I also think, conceptually, the order of first canceling the
> timer
> and then signaling the fence should be kept.
That's the case here, no?
It must indeed be kept, otherwise the timers could fire after
everything is torn down -> UAF
>
> At the moment it does not matter hugely, since the timer does not
> signal
> the jobs directly and will not find unlinked jobs, but if that
> changes
> in the future, the reversed order could cause double signaling. So if
> you keep it in the correct logical order that potential gotcha is
> avoided. Basically just keep the two pass approach verbatim, as is in
> the current drm_mock_sched_fini.
>
> The rest of the conversion is I think good.
:)
>
> Only a slight uncertainty after I cross-referenced with my version
> (->cancel_job()) around why I needed to add signaling to
> mock_sched_timedout_job() and manual job cleanup to the timeout test.
> It
> was more than a month ago that I wrote it so can't remember right
> now.
> You checked for memory leaks and the usual stuff?
Hm, it seems I indeed ran into that leak that you fixed (in addition to
the other stuff) in your RFC, for the timeout tests.
We should fix that in a separate patch, probably.
P.
>
> Regards,
>
> Tvrtko
>
> > +}
> > +
> > static const struct drm_sched_backend_ops drm_mock_scheduler_ops
> > = {
> > .run_job = mock_sched_run_job,
> > .timedout_job = mock_sched_timedout_job,
> > - .free_job = mock_sched_free_job
> > + .free_job = mock_sched_free_job,
> > + .cancel_pending_fences = mock_sched_cancel_pending_fences,
> > };
> >
> > /**
> > @@ -265,7 +274,6 @@ struct drm_mock_scheduler
> > *drm_mock_sched_new(struct kunit *test, long timeout)
> > sched->hw_timeline.context = dma_fence_context_alloc(1);
> > atomic_set(&sched->hw_timeline.next_seqno, 0);
> > INIT_LIST_HEAD(&sched->job_list);
> > - INIT_LIST_HEAD(&sched->done_list);
> > spin_lock_init(&sched->lock);
> >
> > return sched;
> > @@ -280,38 +288,11 @@ struct drm_mock_scheduler
> > *drm_mock_sched_new(struct kunit *test, long timeout)
> > */
> > void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
> > {
> > - struct drm_mock_sched_job *job, *next;
> > - unsigned long flags;
> > - LIST_HEAD(list);
> > + struct drm_mock_sched_job *job;
> >
> > - drm_sched_wqueue_stop(&sched->base);
> > -
> > - /* Force complete all unfinished jobs. */
> > - spin_lock_irqsave(&sched->lock, flags);
> > - list_for_each_entry_safe(job, next, &sched->job_list,
> > link)
> > - list_move_tail(&job->link, &list);
> > - spin_unlock_irqrestore(&sched->lock, flags);
> > -
> > - list_for_each_entry(job, &list, link)
> > + list_for_each_entry(job, &sched->job_list, link)
> > hrtimer_cancel(&job->timer);
> >
> > - spin_lock_irqsave(&sched->lock, flags);
> > - list_for_each_entry_safe(job, next, &list, link)
> > - drm_mock_sched_job_complete(job);
> > - spin_unlock_irqrestore(&sched->lock, flags);
> > -
> > - /*
> > - * Free completed jobs and jobs not yet processed by the
> > DRM scheduler
> > - * free worker.
> > - */
> > - spin_lock_irqsave(&sched->lock, flags);
> > - list_for_each_entry_safe(job, next, &sched->done_list,
> > link)
> > - list_move_tail(&job->link, &list);
> > - spin_unlock_irqrestore(&sched->lock, flags);
> > -
> > - list_for_each_entry_safe(job, next, &list, link)
> > - mock_sched_free_job(&job->base);
> > -
> > drm_sched_fini(&sched->base);
> > }
> >
> > @@ -346,7 +327,7 @@ unsigned int drm_mock_sched_advance(struct
> > drm_mock_scheduler *sched,
> > if (sched->hw_timeline.cur_seqno < job-
> > >hw_fence.seqno)
> > break;
> >
> > - drm_mock_sched_job_complete(job);
> > + drm_mock_sched_job_complete(job, 0);
> > found++;
> > }
> > unlock:
> > diff --git a/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > b/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > index 27caf8285fb7..22e530d87791 100644
> > --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h
> > @@ -32,9 +32,8 @@
> > *
> > * @base: DRM scheduler base class
> > * @test: Backpointer to owning the kunit test case
> > - * @lock: Lock to protect the simulated @hw_timeline, @job_list
> > and @done_list
> > + * @lock: Lock to protect the simulated @hw_timeline and @job_list
> > * @job_list: List of jobs submitted to the mock GPU
> > - * @done_list: List of jobs completed by the mock GPU
> > * @hw_timeline: Simulated hardware timeline has a @context,
> > @next_seqno and
> > * @cur_seqno for implementing a struct dma_fence
> > signaling the
> > * simulated job completion.
> > @@ -49,7 +48,6 @@ struct drm_mock_scheduler {
> >
> > spinlock_t lock;
> > struct list_head job_list;
> > - struct list_head done_list;
> >
> > struct {
> > u64 context;
>
More information about the dri-devel
mailing list