[PATCH 1/1] drm/amdgpu: disable gpu_sched load balancer for vcn jobs

Thu Mar 12 10:56:42 UTC 2020

On 3/12/20 9:50 AM, Christian König wrote:
> Am 11.03.20 um 21:55 schrieb Nirmoy:
>>
>> On 3/11/20 9:35 PM, Andrey Grodzovsky wrote:
>>>
>>> On 3/11/20 4:32 PM, Nirmoy wrote:
>>>>
>>>> On 3/11/20 9:02 PM, Andrey Grodzovsky wrote:
>>>>>
>>>>> On 3/11/20 4:00 PM, Andrey Grodzovsky wrote:
>>>>>>
>>>>>> On 3/11/20 4:00 PM, Nirmoy Das wrote:
>>>>>>> [SNIP]
>>>>>>> @@ -1257,6 +1258,9 @@ static int amdgpu_cs_submit(struct 
>>>>>>> amdgpu_cs_parser *p,
>>>>>>>       priority = job->base.s_priority;
>>>>>>>       drm_sched_entity_push_job(&job->base, entity);
>>>>>>>   +    if (ring->funcs->no_gpu_sched_loadbalance)
>>>>>>> + amdgpu_ctx_disable_gpu_sched_load_balance(entity);
>>>>>>> +
>>>>>>
>>>>>>
>>>>>> Why this needs to be done each time a job is submitted and not 
>>>>>> once in drm_sched_entity_init (same foramdgpu_job_submit bellow ?)
>>>>>>
>>>>>> Andrey
>>>>>
>>>>>
>>>>> My bad - not in drm_sched_entity_init but in relevant amdgpu code.
>>>>
>>>>
>>>> Hi Andrey,
>>>>
>>>> Do you mean drm_sched_job_init() or after creating VCN entities?
>>>>
>>>>
>>>> Nirmoy
>>>
>>>
>>> I guess after creating the VCN entities (has to be amdgpu specific 
>>> code) - I just don't get why it needs to be done each time job is 
>>> submitted, I mean - since you set .no_gpu_sched_loadbalance = true 
>>> anyway this is always true and so shouldn't you just initialize the 
>>> VCN entity with a schedulers list consisting of one scheduler and 
>>> that it ?
>>
>>
>> Assumption: If I understand correctly we shouldn't be doing load 
>> balance among VCN jobs in the same context. Christian, James and Leo 
>> can clarify that if I am wrong.
>>
>> But we can still do load balance of VNC jobs among multiple contexts. 
>> That load balance decision happens in drm_sched_entity_init(). If we 
>> initialize VCN entity with one scheduler then
>>
>> all entities irrespective of context gets that one scheduler which 
>> means we are not utilizing extra VNC instances.
>
> Andrey has a very good point here. So far we only looked at this from 
> the hardware requirement side that we can't change the ring after the 
> first submission any more.
>
> But it is certainly valuable to keep the extra overhead out of the hot 
> path during command submission.

>
>> Ideally we should be calling 
>> amdgpu_ctx_disable_gpu_sched_load_balance() only once after 1st call 
>> of drm_sched_entity_init() of a VCN job. I am not sure how to do that 
>> efficiently.
>>
>> Another option might be to copy the logic of 
>> drm_sched_entity_get_free_sched() and choose suitable VNC sched 
>> at/after VCN entity creation.
>
> Yes, but we should not copy the logic but rather refactor it :)
>
> Basically we need a drm_sched_pick_best() function which gets an array 
> of drm_gpu_scheduler structures and returns the one with the least 
> load on it.
>
> This function can then be used by VCN to pick one instance before 
> initializing the entity as well as a replacement for 
> drm_sched_entity_get_free_sched() to change the scheduler for load 
> balancing.

This sounds like a optimum solution here.

Thanks Andrey and Christian. I will resend with suggested changes.

>
> Regards,
> Christian.
>
>>
>>
>> Regards,
>>
>> Nirmoy
>>
>