[PATCH] drm/scheduler: fix race condition in load balancer
Christian König
christian.koenig at amd.com
Tue Jan 14 16:23:26 UTC 2020
Am 14.01.20 um 17:20 schrieb Nirmoy:
>
> On 1/14/20 5:01 PM, Christian König wrote:
>> Am 14.01.20 um 16:43 schrieb Nirmoy Das:
>>> Jobs submitted in an entity should execute in the order those jobs
>>> are submitted. We make sure that by checking entity->job_queue in
>>> drm_sched_entity_select_rq() so that we don't loadbalance jobs within
>>> an entity.
>>>
>>> But because we update entity->job_queue later in
>>> drm_sched_entity_push_job(),
>>> there remains a open window when it is possibe that entity->rq might
>>> get
>>> updated by drm_sched_entity_select_rq() which should not be allowed.
>>
>> NAK, concurrent calls to
>> drm_sched_job_init()/drm_sched_entity_push_job() are not allowed in
>> the first place or otherwise we mess up the fence sequence order and
>> risk memory corruption.
> if I am not missing something, I don't see any lock securing
> drm_sched_job_init()/drm_sched_entity_push_job() calls in
> amdgpu_cs_submit().
See one step up in the call chain, function amdgpu_cs_ioctl().
This is locking the page tables, which also makes access to the context
and entities mutual exclusive:
> r = amdgpu_cs_parser_bos(&parser, data);
...
> r = amdgpu_cs_submit(&parser, cs);
>
> out:
And here the page tables are unlocked again:
> amdgpu_cs_parser_fini(&parser, r, reserved_buffers);
Regards,
Christian.
>
>
> Regards,
>
> Nirmoy
>
More information about the amd-gfx
mailing list