[RFC 1/4] drm/sched: Add locking to drm_sched_entity_modify_sched

Christian König christian.koenig at amd.com
Mon Sep 9 11:29:27 UTC 2024


Am 09.09.24 um 11:44 schrieb Philipp Stanner:
> On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>
>> Without the locking amdgpu currently can race
>> amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(),
> I would explicitly say "amdgpu's amdgpu_ctx_set_entity_priority() races
> through drm_sched_entity_modify_sched() with drm_sched_job_arm()".
>
> The actual issue then seems to be drm_sched_job_arm() calling
> drm_sched_entity_select_rq(). I would mention that, too.
>
>
>> leading to the
>> latter accesing potentially inconsitent entity->sched_list and
>> entity->num_sched_list pair.
>>
>> The comment on drm_sched_entity_modify_sched() however says:
>>
>> """
>>   * Note that this must be called under the same common lock for
>> @entity as
>>   * drm_sched_job_arm() and drm_sched_entity_push_job(), or the driver
>> needs to
>>   * guarantee through some other means that this is never called while
>> new jobs
>>   * can be pushed to @entity.
>> """
>>
>> It is unclear if that is referring to this race or something else.
> That comment is indeed a bit awkward. Both drm_sched_entity_push_job()
> and drm_sched_job_arm() take rq_lock. But
> drm_sched_entity_modify_sched() doesn't.
>
> The comment was written in 981b04d968561. Interestingly, in
> drm_sched_entity_push_job(), this "common lock" is mentioned with the
> soft requirement word "should" and apparently is more about keeping
> sequence numbers in order when inserting.
>
> I tend to think that the issue discovered by you is unrelated to that
> comment. But if no one can make sense of the comment, should it maybe
> be removed? Confusing comment is arguably worse than no comment

Agree, we probably mixed up in 981b04d968561 that submission needs a 
common lock and that rq/priority needs to be protected by the rq_lock.

There is also the big FIXME in the drm_sched_entity documentation 
pointing out that this is most likely not implemented correctly.

I suggest to move the rq, priority and rq_lock fields together in the 
drm_sched_entity structure and document that rq_lock is protecting the two.

Then audit the code if all users of rq and priority actually hold the 
correct locks while reading and writing them.

Regards,
Christian.

>
> P.
>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>> Fixes: b37aced31eb0 ("drm/scheduler: implement a function to modify
>> sched list")
>> Cc: Christian König <christian.koenig at amd.com>
>> Cc: Alex Deucher <alexander.deucher at amd.com>
>> Cc: Luben Tuikov <ltuikov89 at gmail.com>
>> Cc: Matthew Brost <matthew.brost at intel.com>
>> Cc: David Airlie <airlied at gmail.com>
>> Cc: Daniel Vetter <daniel at ffwll.ch>
>> Cc: dri-devel at lists.freedesktop.org
>> Cc: <stable at vger.kernel.org> # v5.7+
>> ---
>>   drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>> b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 58c8161289fe..ae8be30472cd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -133,8 +133,10 @@ void drm_sched_entity_modify_sched(struct
>> drm_sched_entity *entity,
>>   {
>>   	WARN_ON(!num_sched_list || !sched_list);
>>   
>> +	spin_lock(&entity->rq_lock);
>>   	entity->sched_list = sched_list;
>>   	entity->num_sched_list = num_sched_list;
>> +	spin_unlock(&entity->rq_lock);
>>   }
>>   EXPORT_SYMBOL(drm_sched_entity_modify_sched);
>>   



More information about the amd-gfx mailing list