[RFC 1/4] drm/sched: Add locking to drm_sched_entity_modify_sched
Christian König
christian.koenig at amd.com
Mon Sep 9 12:18:49 UTC 2024
Am 09.09.24 um 14:13 schrieb Philipp Stanner:
> On Mon, 2024-09-09 at 13:29 +0200, Christian König wrote:
>> Am 09.09.24 um 11:44 schrieb Philipp Stanner:
>>> On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>>
>>>> Without the locking amdgpu currently can race
>>>> amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(),
>>> I would explicitly say "amdgpu's amdgpu_ctx_set_entity_priority()
>>> races
>>> through drm_sched_entity_modify_sched() with drm_sched_job_arm()".
>>>
>>> The actual issue then seems to be drm_sched_job_arm() calling
>>> drm_sched_entity_select_rq(). I would mention that, too.
>>>
>>>
>>>> leading to the
>>>> latter accesing potentially inconsitent entity->sched_list and
>>>> entity->num_sched_list pair.
>>>>
>>>> The comment on drm_sched_entity_modify_sched() however says:
>>>>
>>>> """
>>>> * Note that this must be called under the same common lock for
>>>> @entity as
>>>> * drm_sched_job_arm() and drm_sched_entity_push_job(), or the
>>>> driver
>>>> needs to
>>>> * guarantee through some other means that this is never called
>>>> while
>>>> new jobs
>>>> * can be pushed to @entity.
>>>> """
>>>>
>>>> It is unclear if that is referring to this race or something
>>>> else.
>>> That comment is indeed a bit awkward. Both
>>> drm_sched_entity_push_job()
>>> and drm_sched_job_arm() take rq_lock. But
>>> drm_sched_entity_modify_sched() doesn't.
>>>
>>> The comment was written in 981b04d968561. Interestingly, in
>>> drm_sched_entity_push_job(), this "common lock" is mentioned with
>>> the
>>> soft requirement word "should" and apparently is more about keeping
>>> sequence numbers in order when inserting.
>>>
>>> I tend to think that the issue discovered by you is unrelated to
>>> that
>>> comment. But if no one can make sense of the comment, should it
>>> maybe
>>> be removed? Confusing comment is arguably worse than no comment
>> Agree, we probably mixed up in 981b04d968561 that submission needs a
>> common lock and that rq/priority needs to be protected by the
>> rq_lock.
>>
>> There is also the big FIXME in the drm_sched_entity documentation
>> pointing out that this is most likely not implemented correctly.
>>
>> I suggest to move the rq, priority and rq_lock fields together in the
>> drm_sched_entity structure and document that rq_lock is protecting
>> the two.
> That could also be a great opportunity for improving the lock naming:
Well that comment made me laugh because I point out the same when the
scheduler came out ~8years ago and nobody cared about it since then.
But yeah completely agree :)
Christian.
>
> void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
> {
> /*
> * Both locks need to be grabbed, one to protect from entity->rq change
> * for entity from within concurrent drm_sched_entity_select_rq and the
> * other to update the rb tree structure.
> */
> spin_lock(&entity->rq_lock);
> spin_lock(&entity->rq->lock);
>
> [...]
>
>
> P.
>
>
>> Then audit the code if all users of rq and priority actually hold the
>> correct locks while reading and writing them.
>>
>> Regards,
>> Christian.
>>
>>> P.
>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>> Fixes: b37aced31eb0 ("drm/scheduler: implement a function to
>>>> modify
>>>> sched list")
>>>> Cc: Christian König <christian.koenig at amd.com>
>>>> Cc: Alex Deucher <alexander.deucher at amd.com>
>>>> Cc: Luben Tuikov <ltuikov89 at gmail.com>
>>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>>> Cc: David Airlie <airlied at gmail.com>
>>>> Cc: Daniel Vetter <daniel at ffwll.ch>
>>>> Cc: dri-devel at lists.freedesktop.org
>>>> Cc: <stable at vger.kernel.org> # v5.7+
>>>> ---
>>>> drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
>>>> 1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> index 58c8161289fe..ae8be30472cd 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -133,8 +133,10 @@ void drm_sched_entity_modify_sched(struct
>>>> drm_sched_entity *entity,
>>>> {
>>>> WARN_ON(!num_sched_list || !sched_list);
>>>>
>>>> + spin_lock(&entity->rq_lock);
>>>> entity->sched_list = sched_list;
>>>> entity->num_sched_list = num_sched_list;
>>>> + spin_unlock(&entity->rq_lock);
>>>> }
>>>> EXPORT_SYMBOL(drm_sched_entity_modify_sched);
>>>>
More information about the amd-gfx
mailing list