[PATCH v6 14/15] drm/sched: Queue all free credits in one worker invocation

Tue Jul 8 18:59:01 UTC 2025

On Tue, Jul 08, 2025 at 04:31:31PM +0100, Tvrtko Ursulin wrote:
> 
> On 08/07/2025 14:02, Christian König wrote:
> > On 08.07.25 14:54, Tvrtko Ursulin wrote:
> > > 
> > > On 08/07/2025 13:37, Christian König wrote:
> > > > On 08.07.25 11:51, Tvrtko Ursulin wrote:
> > > > > There is no reason to queue just a single job if scheduler can take more
> > > > > and re-queue the worker to queue more.
> > > > 
> > > > That's not correct. This was intentionally avoided.
> > > > 
> > > > If more than just the scheduler is using the single threaded workqeueu other workers, especially the timeout worker, can jump in and execute first.
> > > > 
> > > > We explicitely removed submitting more than one job in each worker run.
> > > 
> > > I wanted to ask why, but then I had a look to see if anyone actually does this. And I did not find any driver sharing a single threaded workqueue between submit and timeout.
> > > 
> > > The only driver which even passes in the same workqueue for both is PVR, but it is not a single threaded one.
> > > 
> > > Or perhaps I misunderstood what you said. Could you please clarify either way?
> > 
> > You correctly understood that.
> > 
> > The argument was that submitting more than one job in a worker is simply not beneficial and other work items can jump in and execute.
> > 
> > I have no idea if that is actually used or not. You would need to dig up the discussion when we switched from a kernel thread to work items for the full background.
> > 

I think Christian is capturing the gist of the discussion. I originally
had it coded the way Tvrtko did, but got pushback and switched to the
requeue approach. If I recall correctly, at the time the default
workqueue was a system WQ, which we definitely didn’t want to hog.

Now that the default is a dedicated worker, this is less of an issue.
However, technically, a system worker could still be passed in—though it
shouldn't be, since the WQ should be marked with WQ_RECLAIM.

I don’t have a strong opinion either way, so I’m going to stay out of
this one.

Matt

> > But in general to do as less work as possible in each worker and then re-submit it is usually a good idea.
> 
> From the point of view that the single work item invocation shouldn't hog
> the worker, if the worker is shared, I agree. But what we also want is to
> feed the GPU as fast as possible, ie. put the CPU to sleep as quickly as
> possible.
> 
> If we consider drivers with dedicated workqueues per hardware engine, or
> even per userspace context, then especially in those cases I don't see what
> is the benefit of playing the wq re-queue games.
> 
> Anyway, I can park this patch for now, I *think* it will be easy to drop and
> will just need to rebase 15/15 to cope.
> 
> In the meantime I have collected some stats when running Cyberpunk 2077
> benchmark on amdgpu, just to remind myself that it does happen more than one
> job can be ready to be passed on to the GPU. Stats of number of submitted
> jobs per worker invocation (with this patch):
> 
> 		1	2	3	4	5
> gfx_0.0.0	21315	541	9849	171	0
> comp_1.3.0	3093	9	2	0	0
> comp_1.1.0	3501	46	2	1	0
> comp_1.0.1	3451	46	2	0	0
> sdma0		4400	746	279	481	7
> 
> This is for userspace contexts only. Quite a good number of three jobs
> submitted per worker invocation.
> 
> Kernel sdma appears to favour deeper queues even more but I forgot to log
> above 2 jobs per worker invocation:
> 
> 	1	2	
> sdma0	8009	1913
> 
> I can try to measure the latencies of worker re-queue approach. Another
> interesting thing would be C-state residencies and CPU power. But given how
> when the scheduler went from kthread to wq and lost the ability the queue
> more than one job, I don't think back then anyone measured this? In which
> case I suspect we even don't know if some latency or efficiency was lost.
> 
> Regards,
> 
> Tvrtko
> 
> > > > > We can simply feed the hardware
> > > > > with as much as it can take in one go and hopefully win some latency.
> > > > > 
> > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
> > > > > Cc: Christian König <christian.koenig at amd.com>
> > > > > Cc: Danilo Krummrich <dakr at kernel.org>
> > > > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > > > Cc: Philipp Stanner <phasta at kernel.org>
> > > > > ---
> > > > >    drivers/gpu/drm/scheduler/sched_internal.h |   2 -
> > > > >    drivers/gpu/drm/scheduler/sched_main.c     | 132 ++++++++++-----------
> > > > >    drivers/gpu/drm/scheduler/sched_rq.c       |  12 +-
> > > > >    3 files changed, 64 insertions(+), 82 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
> > > > > index 15d78abc48df..1a5c2f255223 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_internal.h
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_internal.h
> > > > > @@ -22,8 +22,6 @@ struct drm_sched_entity_stats {
> > > > >        u64        vruntime;
> > > > >    };
> > > > >    -bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
> > > > > -             struct drm_sched_entity *entity);
> > > > >    void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> > > > >      void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > index 35025edea669..1fb3f1da4821 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > @@ -95,35 +95,6 @@ static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
> > > > >        return credits;
> > > > >    }
> > > > >    -/**
> > > > > - * drm_sched_can_queue -- Can we queue more to the hardware?
> > > > > - * @sched: scheduler instance
> > > > > - * @entity: the scheduler entity
> > > > > - *
> > > > > - * Return true if we can push at least one more job from @entity, false
> > > > > - * otherwise.
> > > > > - */
> > > > > -bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
> > > > > -             struct drm_sched_entity *entity)
> > > > > -{
> > > > > -    struct drm_sched_job *s_job;
> > > > > -
> > > > > -    s_job = drm_sched_entity_queue_peek(entity);
> > > > > -    if (!s_job)
> > > > > -        return false;
> > > > > -
> > > > > -    /* If a job exceeds the credit limit, truncate it to the credit limit
> > > > > -     * itself to guarantee forward progress.
> > > > > -     */
> > > > > -    if (s_job->credits > sched->credit_limit) {
> > > > > -        dev_WARN(sched->dev,
> > > > > -             "Jobs may not exceed the credit limit, truncate.\n");
> > > > > -        s_job->credits = sched->credit_limit;
> > > > > -    }
> > > > > -
> > > > > -    return drm_sched_available_credits(sched) >= s_job->credits;
> > > > > -}
> > > > > -
> > > > >    /**
> > > > >     * drm_sched_run_job_queue - enqueue run-job work
> > > > >     * @sched: scheduler instance
> > > > > @@ -940,54 +911,77 @@ static void drm_sched_run_job_work(struct work_struct *w)
> > > > >    {
> > > > >        struct drm_gpu_scheduler *sched =
> > > > >            container_of(w, struct drm_gpu_scheduler, work_run_job);
> > > > > +    u32 job_credits, submitted_credits = 0;
> > > > >        struct drm_sched_entity *entity;
> > > > > -    struct dma_fence *fence;
> > > > >        struct drm_sched_fence *s_fence;
> > > > >        struct drm_sched_job *sched_job;
> > > > > -    int r;
> > > > > +    struct dma_fence *fence;
> > > > >    -    /* Find entity with a ready job */
> > > > > -    entity = drm_sched_rq_select_entity(sched, sched->rq);
> > > > > -    if (IS_ERR_OR_NULL(entity))
> > > > > -        return;    /* No more work */
> > > > > +    while (!READ_ONCE(sched->pause_submit)) {
> > > > > +        /* Find entity with a ready job */
> > > > > +        entity = drm_sched_rq_select_entity(sched, sched->rq);
> > > > > +        if (!entity)
> > > > > +            break;    /* No more work */
> > > > > +
> > > > > +        sched_job = drm_sched_entity_queue_peek(entity);
> > > > > +        if (!sched_job) {
> > > > > +            complete_all(&entity->entity_idle);
> > > > > +            continue;
> > > > > +        }
> > > > > +
> > > > > +        job_credits = sched_job->credits;
> > > > > +        /*
> > > > > +         * If a job exceeds the credit limit truncate it to guarantee
> > > > > +         * forward progress.
> > > > > +         */
> > > > > +        if (dev_WARN_ONCE(sched->dev, job_credits > sched->credit_limit,
> > > > > +                  "Jobs may not exceed the credit limit, truncating.\n"))
> > > > > +            job_credits = sched_job->credits = sched->credit_limit;
> > > > > +
> > > > > +        if (job_credits > drm_sched_available_credits(sched)) {
> > > > > +            complete_all(&entity->entity_idle);
> > > > > +            break;
> > > > > +        }
> > > > > +
> > > > > +        sched_job = drm_sched_entity_pop_job(entity);
> > > > > +        if (!sched_job) {
> > > > > +            /* Top entity is not yet runnable after all */
> > > > > +            complete_all(&entity->entity_idle);
> > > > > +            continue;
> > > > > +        }
> > > > > +
> > > > > +        s_fence = sched_job->s_fence;
> > > > > +        drm_sched_job_begin(sched_job);
> > > > > +        trace_drm_sched_job_run(sched_job, entity);
> > > > > +        submitted_credits += job_credits;
> > > > > +        atomic_add(job_credits, &sched->credit_count);
> > > > > +
> > > > > +        fence = sched->ops->run_job(sched_job);
> > > > > +        drm_sched_fence_scheduled(s_fence, fence);
> > > > > +
> > > > > +        if (!IS_ERR_OR_NULL(fence)) {
> > > > > +            int r;
> > > > > +
> > > > > +            /* Drop for original kref_init of the fence */
> > > > > +            dma_fence_put(fence);
> > > > > +
> > > > > +            r = dma_fence_add_callback(fence, &sched_job->cb,
> > > > > +                           drm_sched_job_done_cb);
> > > > > +            if (r == -ENOENT)
> > > > > +                drm_sched_job_done(sched_job, fence->error);
> > > > > +            else if (r)
> > > > > +                DRM_DEV_ERROR(sched->dev,
> > > > > +                          "fence add callback failed (%d)\n", r);
> > > > > +        } else {
> > > > > +            drm_sched_job_done(sched_job, IS_ERR(fence) ?
> > > > > +                              PTR_ERR(fence) : 0);
> > > > > +        }
> > > > >    -    sched_job = drm_sched_entity_pop_job(entity);
> > > > > -    if (!sched_job) {
> > > > >            complete_all(&entity->entity_idle);
> > > > > -        drm_sched_run_job_queue(sched);
> > > > > -        return;
> > > > >        }
> > > > >    -    s_fence = sched_job->s_fence;
> > > > > -
> > > > > -    atomic_add(sched_job->credits, &sched->credit_count);
> > > > > -    drm_sched_job_begin(sched_job);
> > > > > -
> > > > > -    trace_drm_sched_job_run(sched_job, entity);
> > > > > -    /*
> > > > > -     * The run_job() callback must by definition return a fence whose
> > > > > -     * refcount has been incremented for the scheduler already.
> > > > > -     */
> > > > > -    fence = sched->ops->run_job(sched_job);
> > > > > -    complete_all(&entity->entity_idle);
> > > > > -    drm_sched_fence_scheduled(s_fence, fence);
> > > > > -
> > > > > -    if (!IS_ERR_OR_NULL(fence)) {
> > > > > -        r = dma_fence_add_callback(fence, &sched_job->cb,
> > > > > -                       drm_sched_job_done_cb);
> > > > > -        if (r == -ENOENT)
> > > > > -            drm_sched_job_done(sched_job, fence->error);
> > > > > -        else if (r)
> > > > > -            DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n", r);
> > > > > -
> > > > > -        dma_fence_put(fence);
> > > > > -    } else {
> > > > > -        drm_sched_job_done(sched_job, IS_ERR(fence) ?
> > > > > -                   PTR_ERR(fence) : 0);
> > > > > -    }
> > > > > -
> > > > > -    wake_up(&sched->job_scheduled);
> > > > > -    drm_sched_run_job_queue(sched);
> > > > > +    if (submitted_credits)
> > > > > +        wake_up(&sched->job_scheduled);
> > > > >    }
> > > > >      static struct workqueue_struct *drm_sched_alloc_wq(const char *name)
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
> > > > > index e22f9ff88822..f0afdc0bd417 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_rq.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_rq.c
> > > > > @@ -197,9 +197,7 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
> > > > >     *
> > > > >     * Find oldest waiting ready entity.
> > > > >     *
> > > > > - * Return an entity if one is found; return an error-pointer (!NULL) if an
> > > > > - * entity was ready, but the scheduler had insufficient credits to accommodate
> > > > > - * its job; return NULL, if no ready entity was found.
> > > > > + * Return an entity if one is found or NULL if no ready entity was found.
> > > > >     */
> > > > >    struct drm_sched_entity *
> > > > >    drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
> > > > > @@ -213,14 +211,6 @@ drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
> > > > >              entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> > > > >            if (drm_sched_entity_is_ready(entity)) {
> > > > > -            /* If we can't queue yet, preserve the current entity in
> > > > > -             * terms of fairness.
> > > > > -             */
> > > > > -            if (!drm_sched_can_queue(sched, entity)) {
> > > > > -                spin_unlock(&rq->lock);
> > > > > -                return ERR_PTR(-ENOSPC);
> > > > > -            }
> > > > > -
> > > > >                reinit_completion(&entity->entity_idle);
> > > > >                break;
> > > > >            }
> > > > 
> > > 
> > 
>