[RFC PATCH 01/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread
Matthew Brost
matthew.brost at intel.com
Mon Jul 31 00:56:31 UTC 2023
On Fri, Jun 09, 2023 at 08:58:39AM +0200, Boris Brezillon wrote:
> Hi Matthew,
>
> On Mon, 3 Apr 2023 17:22:02 -0700
> Matthew Brost <matthew.brost at intel.com> wrote:
>
> > -static int drm_sched_main(void *param)
> > +static void drm_sched_main(struct work_struct *w)
> > {
> > - struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
> > + struct drm_gpu_scheduler *sched =
> > + container_of(w, struct drm_gpu_scheduler, work_run);
> > int r;
> >
> > - sched_set_fifo_low(current);
> > -
> > - while (!kthread_should_stop()) {
> > - struct drm_sched_entity *entity = NULL;
> > + while (!READ_ONCE(sched->pause_run_wq)) {
>
> During an informal discussion on IRC I mentioned that this loop might
> become problematic if all the 1:1 entities share the same wq
> (especially if it's an ordered wq), and one of them is getting passed a
> lot of requests. Just wanted to tell you that we've hit that case in
> PowerVR:
>
> Geometry and fragment queues get passed X requests respectively, each
> pair of request corresponding to a rendering operation. Because we're
> using an ordered wq (which I know we shouldn't do, and I intend to
> fix that, but I think it shows the problem exists by making it more
> visible), all geometry requests get submitted first, then come the
> fragment requests. It turns out the submission time is non-negligible
> compared to the geometry job execution time, and geometry jobs end up
> generating data for the fragment jobs that are not consumed fast enough
> by the fragment job to allow the following geom jobs to re-use the same
> portion of memory, leading to on-demand allocation of extra memory
> chunks which wouldn't happen if submissions were interleaved.
>
> I know you were not fundamentally opposed to killing this loop and doing
> one iteration at a time (you even provided a patch doing that), just
> wanted to share my findings to prove this is not just a theoretical
> issue, and the lack of fairness in the submission path can cause trouble
> in practice.
>
> Best Regards,
>
> Boris
>
Thanks for the info Boris, about to revive this series in a non-RFC form.
This loop seems controversial, let me drop it. Going to cook up a patch
for the Xe branch and get this merged for CI / UMD benchmarks to absorb
and if there any noticeable differences.
Also be on the lookout for a new rev of this series hopefully this week.
Matt
> > + struct drm_sched_entity *entity;
> > struct drm_sched_fence *s_fence;
> > struct drm_sched_job *sched_job;
> > struct dma_fence *fence;
> > - struct drm_sched_job *cleanup_job = NULL;
> > + struct drm_sched_job *cleanup_job;
> >
> > - wait_event_interruptible(sched->wake_up_worker,
> > - (cleanup_job = drm_sched_get_cleanup_job(sched)) ||
> > - (!drm_sched_blocked(sched) &&
> > - (entity = drm_sched_select_entity(sched))) ||
> > - kthread_should_stop());
> > + cleanup_job = drm_sched_get_cleanup_job(sched);
> > + entity = drm_sched_select_entity(sched);
> >
> > if (cleanup_job)
> > sched->ops->free_job(cleanup_job);
> >
> > - if (!entity)
> > + if (!entity) {
> > + if (!cleanup_job)
> > + break;
> > continue;
> > + }
> >
> > sched_job = drm_sched_entity_pop_job(entity);
> >
> > if (!sched_job) {
> > complete_all(&entity->entity_idle);
> > + if (!cleanup_job)
> > + break;
> > continue;
> > }
> >
> > @@ -1055,14 +1083,14 @@ static int drm_sched_main(void *param)
> > r);
> > } else {
> > if (IS_ERR(fence))
> > - dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
> > + dma_fence_set_error(&s_fence->finished,
> > + PTR_ERR(fence));
> >
> > drm_sched_job_done(sched_job);
> > }
> >
> > wake_up(&sched->job_scheduled);
> > }
> > - return 0;
> > }
More information about the dri-devel
mailing list