[PATCH] drm/scheduler: fix race condition in load balancer
Christian König
christian.koenig at amd.com
Wed Jan 15 12:52:00 UTC 2020
Hi Nirmoy,
Am 15.01.20 um 12:04 schrieb Nirmoy:
> Hi Christian,
>
> On 1/14/20 5:01 PM, Christian König wrote:
>>
>>> Before this patch:
>>>
>>> sched_name num of many times it got scheduled
>>> ========= ==================================
>>> sdma0 314
>>> sdma1 32
>>> comp_1.0.0 56
>>> comp_1.1.0 0
>>> comp_1.1.1 0
>>> comp_1.2.0 0
>>> comp_1.2.1 0
>>> comp_1.3.0 0
>>> comp_1.3.1 0
>>>
>>> After this patch:
>>>
>>> sched_name num of many times it got scheduled
>>> ========= ==================================
>>> sdma1 243
>>> sdma0 164
>>> comp_1.0.1 14
>>> comp_1.1.0 11
>>> comp_1.1.1 10
>>> comp_1.2.0 15
>>> comp_1.2.1 14
>>> comp_1.3.0 10
>>> comp_1.3.1 10
>>
>> Well that is still rather nice to have, why does that happen?
>
> I think I know why it happens. At init all entity's rq gets assigned
> to sched_list[0]. I put some prints to check what we compare in
> drm_sched_entity_get_free_sched.
>
> It turns out most of the time it compares zero values(num_jobs(0) <
> min_jobs(0)) so most of the time 1st rq(sdma0, comp_1.0.0) was picked
> by drm_sched_entity_get_free_sched.
Well that is expected because the unit tests always does
submission,wait,submission,wait,submission,wait.... So the number of
jobs in the scheduler becomes zero in between.
> This patch was not correct , had an extra atomic_inc(num_jobs) in
> drm_sched_job_init. This probably added bit of randomness I think,
> which helped in better job distribution.
Mhm, that might not be a bad idea after all. We could rename num_jobs
into something like like score and do a +1 in drm_sched_rq_add_entity()
and a -1 in drm_sched_rq_remove_entity().
That should have pretty much the effect we want to have.
> I've updated my previous RFC patch which uses time consumed by each
> sched for load balance with a twist of ignoring previously scheduled
> sched/rq. Let me know what do you think.
I didn't had time yet to wrap my head around that in detail, but at
least of hand Luben is right that the locking looks really awkward.
And I would rather like to avoid a larger change like this for a nice to
have for testing feature.
Regards,
Christian.
>
>
> Regards,
>
> Nirmoy
>
>>
>> Christian.
>>
>>
More information about the amd-gfx
mailing list