[PATCH] drm/xe: Grab pending job list lock before resubmit

Thu Jun 19 12:46:21 UTC 2025

> -----Original Message-----
> From: Brost, Matthew <matthew.brost at intel.com>
> Sent: 17 June 2025 20:49
> To: Auld, Matthew <matthew.auld at intel.com>
> Cc: Upadhyay, Tejas <tejas.upadhyay at intel.com>; intel-
> xe at lists.freedesktop.org
> Subject: Re: [PATCH] drm/xe: Grab pending job list lock before resubmit
> 
> On Tue, Jun 17, 2025 at 04:14:05PM +0100, Matthew Auld wrote:
> > On 17/06/2025 15:47, Tejas Upadhyay wrote:
> > > Pending job list is protected by pending job lock to avoid any
> > > racing thus grab pending job list lock before resubmitting jobs.
> > >
> > > Fixes: 38fafa9f392f ("drm/xe/sched: stop re-submitting signalled
> > > jobs")
> > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
> > > ---
> > >   drivers/gpu/drm/xe/xe_gpu_scheduler.h | 2 ++
> > >   1 file changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > > b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > > index 308061f0cf37..dfdb409da6ba 100644
> > > --- a/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > > +++ b/drivers/gpu/drm/xe/xe_gpu_scheduler.h
> > > @@ -53,6 +53,7 @@ static inline void xe_sched_resubmit_jobs(struct
> xe_gpu_scheduler *sched)
> > >   {
> > >   	struct drm_sched_job *s_job;
> > > +	spin_lock(&sched->base.job_list_lock);
> > >   	list_for_each_entry(s_job, &sched->base.pending_list, list) {
> > >   		struct drm_sched_fence *s_fence = s_job->s_fence;
> > >   		struct dma_fence *hw_fence = s_fence->parent; @@ -60,6
> +61,7 @@
> > > static inline void xe_sched_resubmit_jobs(struct xe_gpu_scheduler *sched)
> > >   		if (hw_fence && !dma_fence_is_signaled(hw_fence))
> > >   			sched->base.ops->run_job(s_job);
> >
> > Is this fixing a real issue? There is a sleeping lock and maybe other
> > nasties in run_job() AFAICT, so I don't think this will work as-is,
> > but I don't believe we need this, since sched should be stopped at
> > this point, so nothing else should be modifying pending_list until we
> > start it back up, which is only later?
> >
> 
> Matt Auld is right - the scheduler is stopped here so AFIAK it should work
> without &sched->base.job_list_lock lock. This patch itself won't work as
> run_job can grab the CT mutex which lock would not be happy about (i.e., you
> cannot hold an non-sleeping lock and then take a sleeping lock).
> 
> I'd vote for just dropping this patch.

Alright. Did not think it was scheduler stopped here. Thanks for comments. Dropping the patch.

Tejas
> 
> Matt
> 
> > >   	}
> > > +	spin_unlock(&sched->base.job_list_lock);
> > >   }
> > >   static inline bool
> >