[RFC 1/4] drm/sched: Add locking to drm_sched_entity_modify_sched

Philipp Stanner pstanner at redhat.com
Mon Sep 9 12:13:50 UTC 2024


On Mon, 2024-09-09 at 13:29 +0200, Christian König wrote:
> Am 09.09.24 um 11:44 schrieb Philipp Stanner:
> > On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
> > > From: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
> > > 
> > > Without the locking amdgpu currently can race
> > > amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(),
> > I would explicitly say "amdgpu's amdgpu_ctx_set_entity_priority()
> > races
> > through drm_sched_entity_modify_sched() with drm_sched_job_arm()".
> > 
> > The actual issue then seems to be drm_sched_job_arm() calling
> > drm_sched_entity_select_rq(). I would mention that, too.
> > 
> > 
> > > leading to the
> > > latter accesing potentially inconsitent entity->sched_list and
> > > entity->num_sched_list pair.
> > > 
> > > The comment on drm_sched_entity_modify_sched() however says:
> > > 
> > > """
> > >   * Note that this must be called under the same common lock for
> > > @entity as
> > >   * drm_sched_job_arm() and drm_sched_entity_push_job(), or the
> > > driver
> > > needs to
> > >   * guarantee through some other means that this is never called
> > > while
> > > new jobs
> > >   * can be pushed to @entity.
> > > """
> > > 
> > > It is unclear if that is referring to this race or something
> > > else.
> > That comment is indeed a bit awkward. Both
> > drm_sched_entity_push_job()
> > and drm_sched_job_arm() take rq_lock. But
> > drm_sched_entity_modify_sched() doesn't.
> > 
> > The comment was written in 981b04d968561. Interestingly, in
> > drm_sched_entity_push_job(), this "common lock" is mentioned with
> > the
> > soft requirement word "should" and apparently is more about keeping
> > sequence numbers in order when inserting.
> > 
> > I tend to think that the issue discovered by you is unrelated to
> > that
> > comment. But if no one can make sense of the comment, should it
> > maybe
> > be removed? Confusing comment is arguably worse than no comment
> 
> Agree, we probably mixed up in 981b04d968561 that submission needs a 
> common lock and that rq/priority needs to be protected by the
> rq_lock.
> 
> There is also the big FIXME in the drm_sched_entity documentation 
> pointing out that this is most likely not implemented correctly.
> 
> I suggest to move the rq, priority and rq_lock fields together in the
> drm_sched_entity structure and document that rq_lock is protecting
> the two.

That could also be a great opportunity for improving the lock naming:

void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
{
	/*
	 * Both locks need to be grabbed, one to protect from entity->rq change
	 * for entity from within concurrent drm_sched_entity_select_rq and the
	 * other to update the rb tree structure.
	 */
	spin_lock(&entity->rq_lock);
	spin_lock(&entity->rq->lock);

[...]


P.


> 
> Then audit the code if all users of rq and priority actually hold the
> correct locks while reading and writing them.
> 
> Regards,
> Christian.
> 
> > 
> > P.
> > 
> > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
> > > Fixes: b37aced31eb0 ("drm/scheduler: implement a function to
> > > modify
> > > sched list")
> > > Cc: Christian König <christian.koenig at amd.com>
> > > Cc: Alex Deucher <alexander.deucher at amd.com>
> > > Cc: Luben Tuikov <ltuikov89 at gmail.com>
> > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > Cc: David Airlie <airlied at gmail.com>
> > > Cc: Daniel Vetter <daniel at ffwll.ch>
> > > Cc: dri-devel at lists.freedesktop.org
> > > Cc: <stable at vger.kernel.org> # v5.7+
> > > ---
> > >   drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
> > >   1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > index 58c8161289fe..ae8be30472cd 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > @@ -133,8 +133,10 @@ void drm_sched_entity_modify_sched(struct
> > > drm_sched_entity *entity,
> > >   {
> > >   	WARN_ON(!num_sched_list || !sched_list);
> > >   
> > > +	spin_lock(&entity->rq_lock);
> > >   	entity->sched_list = sched_list;
> > >   	entity->num_sched_list = num_sched_list;
> > > +	spin_unlock(&entity->rq_lock);
> > >   }
> > >   EXPORT_SYMBOL(drm_sched_entity_modify_sched);
> > >   
> 



More information about the amd-gfx mailing list