[PATCH] drm/scheduler: fix race condition in load balancer
Nirmoy
nirmodas at amd.com
Tue Jan 14 16:27:44 UTC 2020
On 1/14/20 5:23 PM, Christian König wrote:
> Am 14.01.20 um 17:20 schrieb Nirmoy:
>>
>> On 1/14/20 5:01 PM, Christian König wrote:
>>> Am 14.01.20 um 16:43 schrieb Nirmoy Das:
>>>> Jobs submitted in an entity should execute in the order those jobs
>>>> are submitted. We make sure that by checking entity->job_queue in
>>>> drm_sched_entity_select_rq() so that we don't loadbalance jobs within
>>>> an entity.
>>>>
>>>> But because we update entity->job_queue later in
>>>> drm_sched_entity_push_job(),
>>>> there remains a open window when it is possibe that entity->rq
>>>> might get
>>>> updated by drm_sched_entity_select_rq() which should not be allowed.
>>>
>>> NAK, concurrent calls to
>>> drm_sched_job_init()/drm_sched_entity_push_job() are not allowed in
>>> the first place or otherwise we mess up the fence sequence order and
>>> risk memory corruption.
>> if I am not missing something, I don't see any lock securing
>> drm_sched_job_init()/drm_sched_entity_push_job() calls in
>> amdgpu_cs_submit().
>
> See one step up in the call chain, function amdgpu_cs_ioctl().
>
> This is locking the page tables, which also makes access to the
> context and entities mutual exclusive:
>> r = amdgpu_cs_parser_bos(&parser, data);
> ...
>> r = amdgpu_cs_submit(&parser, cs);
>>
>> out:
>
> And here the page tables are unlocked again:
>> amdgpu_cs_parser_fini(&parser, r, reserved_buffers);
Okay. Then something else is going on. Let me dig more.
Thanks,
Nirmoy
>
> Regards,
> Christian.
>
>>
>>
>> Regards,
>>
>> Nirmoy
>>
>
More information about the amd-gfx
mailing list