[RFC 1/4] drm/sched: Add locking to drm_sched_entity_modify_sched

Tvrtko Ursulin tvrtko.ursulin at igalia.com
Mon Sep 9 13:27:15 UTC 2024


On 09/09/2024 13:46, Philipp Stanner wrote:
> On Mon, 2024-09-09 at 13:37 +0100, Tvrtko Ursulin wrote:
>>
>> On 09/09/2024 13:18, Christian König wrote:
>>> Am 09.09.24 um 14:13 schrieb Philipp Stanner:
>>>> On Mon, 2024-09-09 at 13:29 +0200, Christian König wrote:
>>>>> Am 09.09.24 um 11:44 schrieb Philipp Stanner:
>>>>>> On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
>>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>>>>>
>>>>>>> Without the locking amdgpu currently can race
>>>>>>> amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(),
>>>>>> I would explicitly say "amdgpu's
>>>>>> amdgpu_ctx_set_entity_priority()
>>>>>> races
>>>>>> through drm_sched_entity_modify_sched() with
>>>>>> drm_sched_job_arm()".
>>>>>>
>>>>>> The actual issue then seems to be drm_sched_job_arm() calling
>>>>>> drm_sched_entity_select_rq(). I would mention that, too.
>>>>>>
>>>>>>
>>>>>>> leading to the
>>>>>>> latter accesing potentially inconsitent entity->sched_list
>>>>>>> and
>>>>>>> entity->num_sched_list pair.
>>>>>>>
>>>>>>> The comment on drm_sched_entity_modify_sched() however
>>>>>>> says:
>>>>>>>
>>>>>>> """
>>>>>>>     * Note that this must be called under the same common
>>>>>>> lock for
>>>>>>> @entity as
>>>>>>>     * drm_sched_job_arm() and drm_sched_entity_push_job(),
>>>>>>> or the
>>>>>>> driver
>>>>>>> needs to
>>>>>>>     * guarantee through some other means that this is never
>>>>>>> called
>>>>>>> while
>>>>>>> new jobs
>>>>>>>     * can be pushed to @entity.
>>>>>>> """
>>>>>>>
>>>>>>> It is unclear if that is referring to this race or
>>>>>>> something
>>>>>>> else.
>>>>>> That comment is indeed a bit awkward. Both
>>>>>> drm_sched_entity_push_job()
>>>>>> and drm_sched_job_arm() take rq_lock. But
>>>>>> drm_sched_entity_modify_sched() doesn't.
>>>>>>
>>>>>> The comment was written in 981b04d968561. Interestingly, in
>>>>>> drm_sched_entity_push_job(), this "common lock" is mentioned
>>>>>> with
>>>>>> the
>>>>>> soft requirement word "should" and apparently is more about
>>>>>> keeping
>>>>>> sequence numbers in order when inserting.
>>>>>>
>>>>>> I tend to think that the issue discovered by you is unrelated
>>>>>> to
>>>>>> that
>>>>>> comment. But if no one can make sense of the comment, should
>>>>>> it
>>>>>> maybe
>>>>>> be removed? Confusing comment is arguably worse than no
>>>>>> comment
>>>>> Agree, we probably mixed up in 981b04d968561 that submission
>>>>> needs a
>>>>> common lock and that rq/priority needs to be protected by the
>>>>> rq_lock.
>>>>>
>>>>> There is also the big FIXME in the drm_sched_entity
>>>>> documentation
>>>>> pointing out that this is most likely not implemented
>>>>> correctly.
>>>>>
>>>>> I suggest to move the rq, priority and rq_lock fields together
>>>>> in the
>>>>> drm_sched_entity structure and document that rq_lock is
>>>>> protecting
>>>>> the two.
>>>> That could also be a great opportunity for improving the lock
>>>> naming:
>>>
>>> Well that comment made me laugh because I point out the same when
>>> the
>>> scheduler came out ~8years ago and nobody cared about it since
>>> then.
>>>
>>> But yeah completely agree :)
>>
>> Maybe, but we need to keep in sight the fact some of these fixes may
>> be
>> good to backport. In which case re-naming exercises are best left to
>> follow.
> 
> My argument basically. It's good if fixes and other improvements are
> separated, in general, unless there is a practical / good reason not
> to.

Ah cool, I am happy to add follow up patches after the fixes.

>> Also..
>>
>>>> void drm_sched_rq_update_fifo(struct drm_sched_entity *entity,
>>>> ktime_t
>>>> ts)
>>>> {
>>>>      /*
>>>>       * Both locks need to be grabbed, one to protect from entity-
>>>>> rq
>>>> change
>>>>       * for entity from within concurrent
>>>> drm_sched_entity_select_rq
>>>> and the
>>>>       * other to update the rb tree structure.
>>>>       */
>>>>      spin_lock(&entity->rq_lock);
>>>>      spin_lock(&entity->rq->lock);
>>
>> .. I agree this is quite unredable and my initial reaction was a
>> similar
>> ugh. However.. What names would you guys suggest and for what to make
>> this better and not lessen the logic of naming each individually?
> 
> According to the documentation, drm_sched_rq.lock does not protect the
> entire runque, but "@lock: to modify the entities list."
> 
> So I would keep drm_sched_entity.rq_lock as it is, because it indeed
> protects the entire runqueue.

Agreed on entity->rq_lock.

> And drm_sched_rq.lock could be named "entities_lock" or
> "entities_list_lock" or something. That's debatable, but it should be
> something that highlights that this lock is not for locking the entire
> runque as the one in the entity apparently is.

AFAICT it also protects rq->current_entity and rq->rb_tree_root in which 
case it is a bit more tricky. Only rq->sched is outside its scope. Hm. 
Maybe just re-arrange the struct to be like:

struct drm_sched_rq {
	struct drm_gpu_scheduler	*sched;

	spinlock_t			lock;
	/* Following members are protected by the @lock: */
	struct list_head		entities;
	struct drm_sched_entity		*current_entity;
	struct rb_root_cached		rb_tree_root;
};

I have no ideas for better naming. But this would be inline with 
Christian's suggestion for tidying the order in drm_sched_entity.

I am also not sure what is the point of setting rq->current_entity in 
drm_sched_rq_select_entity_fifo().

Regards,

Tvrtko

> 
> 
> Cheers,
> P.
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>>> [...]
>>>>
>>>>
>>>> P.
>>>>
>>>>
>>>>> Then audit the code if all users of rq and priority actually
>>>>> hold the
>>>>> correct locks while reading and writing them.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> P.
>>>>>>
>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>>>>> Fixes: b37aced31eb0 ("drm/scheduler: implement a function
>>>>>>> to
>>>>>>> modify
>>>>>>> sched list")
>>>>>>> Cc: Christian König <christian.koenig at amd.com>
>>>>>>> Cc: Alex Deucher <alexander.deucher at amd.com>
>>>>>>> Cc: Luben Tuikov <ltuikov89 at gmail.com>
>>>>>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>>>>>> Cc: David Airlie <airlied at gmail.com>
>>>>>>> Cc: Daniel Vetter <daniel at ffwll.ch>
>>>>>>> Cc: dri-devel at lists.freedesktop.org
>>>>>>> Cc: <stable at vger.kernel.org> # v5.7+
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
>>>>>>>     1 file changed, 2 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> index 58c8161289fe..ae8be30472cd 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> @@ -133,8 +133,10 @@ void
>>>>>>> drm_sched_entity_modify_sched(struct
>>>>>>> drm_sched_entity *entity,
>>>>>>>     {
>>>>>>>         WARN_ON(!num_sched_list || !sched_list);
>>>>>>> +    spin_lock(&entity->rq_lock);
>>>>>>>         entity->sched_list = sched_list;
>>>>>>>         entity->num_sched_list = num_sched_list;
>>>>>>> +    spin_unlock(&entity->rq_lock);
>>>>>>>     }
>>>>>>>     EXPORT_SYMBOL(drm_sched_entity_modify_sched);
>>>
>>
> 


More information about the amd-gfx mailing list