[PATCH] drm/scheduler: fix race condition in load balancer
Nirmoy
nirmodas at amd.com
Wed Jan 15 11:04:04 UTC 2020
Hi Christian,
On 1/14/20 5:01 PM, Christian König wrote:
>
>> Before this patch:
>>
>> sched_name num of many times it got scheduled
>> ========= ==================================
>> sdma0 314
>> sdma1 32
>> comp_1.0.0 56
>> comp_1.1.0 0
>> comp_1.1.1 0
>> comp_1.2.0 0
>> comp_1.2.1 0
>> comp_1.3.0 0
>> comp_1.3.1 0
>>
>> After this patch:
>>
>> sched_name num of many times it got scheduled
>> ========= ==================================
>> sdma1 243
>> sdma0 164
>> comp_1.0.1 14
>> comp_1.1.0 11
>> comp_1.1.1 10
>> comp_1.2.0 15
>> comp_1.2.1 14
>> comp_1.3.0 10
>> comp_1.3.1 10
>
> Well that is still rather nice to have, why does that happen?
I think I know why it happens. At init all entity's rq gets assigned to
sched_list[0]. I put some prints to check what we compare in
drm_sched_entity_get_free_sched.
It turns out most of the time it compares zero values(num_jobs(0) <
min_jobs(0)) so most of the time 1st rq(sdma0, comp_1.0.0) was picked by
drm_sched_entity_get_free_sched.
This patch was not correct , had an extra atomic_inc(num_jobs) in
drm_sched_job_init. This probably added bit of randomness I think, which
helped in better job distribution.
I've updated my previous RFC patch which uses time consumed by each
sched for load balance with a twist of ignoring previously scheduled
sched/rq. Let me know what do you think.
Regards,
Nirmoy
>
> Christian.
>
>
More information about the amd-gfx
mailing list