[PATCH] drm/sched: Drain all entities in DRM sched run job worker

Matthew Brost matthew.brost at intel.com
Thu Jan 25 17:30:17 UTC 2024


On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote:
> 
> 
> Am 24.01.24 um 22:08 schrieb Matthew Brost:
> > All entities must be drained in the DRM scheduler run job worker to
> > avoid the following case. An entity found that is ready, no job found
> > ready on entity, and run job worker goes idle with other entities + jobs
> > ready. Draining all ready entities (i.e. loop over all ready entities)
> > in the run job worker ensures all job that are ready will be scheduled.
> 
> That doesn't make sense. drm_sched_select_entity() only returns entities
> which are "ready", e.g. have a job to run.
> 

That is what I thought too, hence my original design but it is not
exactly true. Let me explain.

drm_sched_select_entity() returns an entity with a non-empty spsc queue
(job in queue) and no *current* waiting dependecies [1]. Dependecies for
an entity can be added when drm_sched_entity_pop_job() is called [2][3]
returning a NULL job. Thus we can get into a scenario where 2 entities
A and B both have jobs and no current dependecies. A's job is waiting
B's job, entity A gets selected first, a dependecy gets installed in
drm_sched_entity_pop_job(), run work goes idle, and now we deadlock.

The proper solution is to loop over all ready entities until one with a
job is found via drm_sched_entity_pop_job() and then requeue the run
job worker. Or loop over all entities until drm_sched_select_entity()
returns NULL and then let the run job worker go idle. This is what the
old threaded design did too [4]. Hope this clears everything up.

Matt

[1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L144
[2] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L464
[3] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L397
[4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_main.c#L1011

> If that's not the case any more then you have broken something else.
> 
> Regards,
> Christian.
> 
> > 
> > Cc: Thorsten Leemhuis <regressions at leemhuis.info>
> > Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov at gmail.com>
> > Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/
> > Reported-and-tested-by: Mario Limonciello <mario.limonciello at amd.com>
> > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124
> > Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/
> > Reported-by: Vlastimil Babka <vbabka at suse.cz>
> > Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a
> > Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> > ---
> >   drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++--------
> >   1 file changed, 7 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 550492a7a031..85f082396d42 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w)
> >   	struct drm_sched_entity *entity;
> >   	struct dma_fence *fence;
> >   	struct drm_sched_fence *s_fence;
> > -	struct drm_sched_job *sched_job;
> > +	struct drm_sched_job *sched_job = NULL;
> >   	int r;
> >   	if (READ_ONCE(sched->pause_submit))
> >   		return;
> > -	entity = drm_sched_select_entity(sched);
> > +	/* Find entity with a ready job */
> > +	while (!sched_job && (entity = drm_sched_select_entity(sched))) {
> > +		sched_job = drm_sched_entity_pop_job(entity);
> > +		if (!sched_job)
> > +			complete_all(&entity->entity_idle);
> > +	}
> >   	if (!entity)
> > -		return;
> > -
> > -	sched_job = drm_sched_entity_pop_job(entity);
> > -	if (!sched_job) {
> > -		complete_all(&entity->entity_idle);
> >   		return;	/* No more work */
> > -	}
> >   	s_fence = sched_job->s_fence;
> 


More information about the Intel-xe mailing list