[PATCH] drm/scheduler: fix race condition in load balancer

Wed Jan 15 11:04:04 UTC 2020

Hi Christian,

On 1/14/20 5:01 PM, Christian König wrote:
>
>> Before this patch:
>>
>> sched_name     num of many times it got scheduled
>> =========      ==================================
>> sdma0          314
>> sdma1          32
>> comp_1.0.0     56
>> comp_1.1.0     0
>> comp_1.1.1     0
>> comp_1.2.0     0
>> comp_1.2.1     0
>> comp_1.3.0     0
>> comp_1.3.1     0
>>
>> After this patch:
>>
>> sched_name     num of many times it got scheduled
>> =========      ==================================
>>   sdma1          243
>>   sdma0          164
>>   comp_1.0.1     14
>>   comp_1.1.0     11
>>   comp_1.1.1     10
>>   comp_1.2.0     15
>>   comp_1.2.1     14
>>   comp_1.3.0     10
>>   comp_1.3.1     10
>
> Well that is still rather nice to have, why does that happen?

I think I know why it happens. At init all entity's rq gets assigned to 
sched_list[0]. I put some prints to check what we compare in 
drm_sched_entity_get_free_sched.

It turns out most of the time it compares zero values(num_jobs(0) < 
min_jobs(0)) so most of the time 1st rq(sdma0, comp_1.0.0) was picked by 
drm_sched_entity_get_free_sched.

This patch was not correct , had an extra atomic_inc(num_jobs) in 
drm_sched_job_init. This probably added bit of randomness I think, which 
helped in better job distribution.

I've updated my previous RFC patch which uses time consumed by each 
sched for load balance with a twist of ignoring previously scheduled 
sched/rq. Let me know what do you think.

Regards,

Nirmoy

>
> Christian.
>
>