[RFC 1/4] drm/sched: Add locking to drm_sched_entity_modify_sched
Tvrtko Ursulin
tvrtko.ursulin at igalia.com
Mon Sep 9 12:37:10 UTC 2024
On 09/09/2024 13:18, Christian König wrote:
> Am 09.09.24 um 14:13 schrieb Philipp Stanner:
>> On Mon, 2024-09-09 at 13:29 +0200, Christian König wrote:
>>> Am 09.09.24 um 11:44 schrieb Philipp Stanner:
>>>> On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
>>>>> From: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>>>
>>>>> Without the locking amdgpu currently can race
>>>>> amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(),
>>>> I would explicitly say "amdgpu's amdgpu_ctx_set_entity_priority()
>>>> races
>>>> through drm_sched_entity_modify_sched() with drm_sched_job_arm()".
>>>>
>>>> The actual issue then seems to be drm_sched_job_arm() calling
>>>> drm_sched_entity_select_rq(). I would mention that, too.
>>>>
>>>>
>>>>> leading to the
>>>>> latter accesing potentially inconsitent entity->sched_list and
>>>>> entity->num_sched_list pair.
>>>>>
>>>>> The comment on drm_sched_entity_modify_sched() however says:
>>>>>
>>>>> """
>>>>> * Note that this must be called under the same common lock for
>>>>> @entity as
>>>>> * drm_sched_job_arm() and drm_sched_entity_push_job(), or the
>>>>> driver
>>>>> needs to
>>>>> * guarantee through some other means that this is never called
>>>>> while
>>>>> new jobs
>>>>> * can be pushed to @entity.
>>>>> """
>>>>>
>>>>> It is unclear if that is referring to this race or something
>>>>> else.
>>>> That comment is indeed a bit awkward. Both
>>>> drm_sched_entity_push_job()
>>>> and drm_sched_job_arm() take rq_lock. But
>>>> drm_sched_entity_modify_sched() doesn't.
>>>>
>>>> The comment was written in 981b04d968561. Interestingly, in
>>>> drm_sched_entity_push_job(), this "common lock" is mentioned with
>>>> the
>>>> soft requirement word "should" and apparently is more about keeping
>>>> sequence numbers in order when inserting.
>>>>
>>>> I tend to think that the issue discovered by you is unrelated to
>>>> that
>>>> comment. But if no one can make sense of the comment, should it
>>>> maybe
>>>> be removed? Confusing comment is arguably worse than no comment
>>> Agree, we probably mixed up in 981b04d968561 that submission needs a
>>> common lock and that rq/priority needs to be protected by the
>>> rq_lock.
>>>
>>> There is also the big FIXME in the drm_sched_entity documentation
>>> pointing out that this is most likely not implemented correctly.
>>>
>>> I suggest to move the rq, priority and rq_lock fields together in the
>>> drm_sched_entity structure and document that rq_lock is protecting
>>> the two.
>> That could also be a great opportunity for improving the lock naming:
>
> Well that comment made me laugh because I point out the same when the
> scheduler came out ~8years ago and nobody cared about it since then.
>
> But yeah completely agree :)
Maybe, but we need to keep in sight the fact some of these fixes may be
good to backport. In which case re-naming exercises are best left to follow.
Also..
>> void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t
>> ts)
>> {
>> /*
>> * Both locks need to be grabbed, one to protect from entity->rq
>> change
>> * for entity from within concurrent drm_sched_entity_select_rq
>> and the
>> * other to update the rb tree structure.
>> */
>> spin_lock(&entity->rq_lock);
>> spin_lock(&entity->rq->lock);
.. I agree this is quite unredable and my initial reaction was a similar
ugh. However.. What names would you guys suggest and for what to make
this better and not lessen the logic of naming each individually?
Regards,
Tvrtko
>> [...]
>>
>>
>> P.
>>
>>
>>> Then audit the code if all users of rq and priority actually hold the
>>> correct locks while reading and writing them.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> P.
>>>>
>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>>> Fixes: b37aced31eb0 ("drm/scheduler: implement a function to
>>>>> modify
>>>>> sched list")
>>>>> Cc: Christian König <christian.koenig at amd.com>
>>>>> Cc: Alex Deucher <alexander.deucher at amd.com>
>>>>> Cc: Luben Tuikov <ltuikov89 at gmail.com>
>>>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>>>> Cc: David Airlie <airlied at gmail.com>
>>>>> Cc: Daniel Vetter <daniel at ffwll.ch>
>>>>> Cc: dri-devel at lists.freedesktop.org
>>>>> Cc: <stable at vger.kernel.org> # v5.7+
>>>>> ---
>>>>> drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
>>>>> 1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> index 58c8161289fe..ae8be30472cd 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> @@ -133,8 +133,10 @@ void drm_sched_entity_modify_sched(struct
>>>>> drm_sched_entity *entity,
>>>>> {
>>>>> WARN_ON(!num_sched_list || !sched_list);
>>>>> + spin_lock(&entity->rq_lock);
>>>>> entity->sched_list = sched_list;
>>>>> entity->num_sched_list = num_sched_list;
>>>>> + spin_unlock(&entity->rq_lock);
>>>>> }
>>>>> EXPORT_SYMBOL(drm_sched_entity_modify_sched);
>
More information about the amd-gfx
mailing list