[PATCH] drm/scheduler: fix race condition in load balancer
Nirmoy
nirmodas at amd.com
Wed Jan 15 13:24:06 UTC 2020
>> I think I know why it happens. At init all entity's rq gets assigned
>> to sched_list[0]. I put some prints to check what we compare in
>> drm_sched_entity_get_free_sched.
>>
>> It turns out most of the time it compares zero values(num_jobs(0) <
>> min_jobs(0)) so most of the time 1st rq(sdma0, comp_1.0.0) was picked
>> by drm_sched_entity_get_free_sched.
>
> Well that is expected because the unit tests always does
> submission,wait,submission,wait,submission,wait.... So the number of
> jobs in the scheduler becomes zero in between.
Even with multiple parallel instances of amdgpu_test, I haven't seen any
improvement in the load balance.
>
>> This patch was not correct , had an extra atomic_inc(num_jobs) in
>> drm_sched_job_init. This probably added bit of randomness I think,
>> which helped in better job distribution.
>
> Mhm, that might not be a bad idea after all. We could rename num_jobs
> into something like like score and do a +1 in
> drm_sched_rq_add_entity() and a -1 in drm_sched_rq_remove_entity().
>
> That should have pretty much the effect we want to have.
That's sounds good as well. I will create a patch.
>
>> I've updated my previous RFC patch which uses time consumed by each
>> sched for load balance with a twist of ignoring previously scheduled
>> sched/rq. Let me know what do you think.
>
> I didn't had time yet to wrap my head around that in detail, but at
> least of hand Luben is right that the locking looks really awkward.
I was unable to find a better way to do the locking part. My mail client
might've missed Luben's review, can't find it :/
Regards,
Nirmoy
More information about the amd-gfx
mailing list