[PATCH] drm/amd/amdgpu: vm entities should have kernel priority

Mon Jul 19 11:10:00 UTC 2021

Am 19.07.21 um 11:42 schrieb Liu, Monk:
> [AMD Official Use Only]
>
> Besides, I think our current KMD have three types of kernel sdma jobs:
> 1) adev->mman.entity, it is already a KERNEL priority entity
> 2) vm->immediate
> 3) vm->delay
>
> Do you mean now vm->immediate or delay are used as moving jobs instead of mman.entity ?

No, exactly that's the point. vm->immediate and vm->delayed are not for 
kernel paging jobs.

Those are used for userspace page table updates.

I agree that those should probably not considered guilty, but modifying 
the priority of them is not the right way of doing that.

Regards,
Christian.

>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Liu, Monk
> Sent: Monday, July 19, 2021 5:40 PM
> To: 'Christian König' <ckoenig.leichtzumerken at gmail.com>; Chen, JingWen <JingWen.Chen2 at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Chen, Horace <Horace.Chen at amd.com>
> Subject: RE: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
>
> [AMD Official Use Only]
>
> If there is move jobs clashing there we probably need to fix the bugs of those move jobs
>
> Previously I believe you also remember that we agreed to always trust kernel jobs especially paging jobs,
>
> Without set paging jobs' priority to KERNEL level how can we keep that protocol ? do you have a better idea?
>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken at gmail.com>
> Sent: Monday, July 19, 2021 4:25 PM
> To: Chen, JingWen <JingWen.Chen2 at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Chen, Horace <Horace.Chen at amd.com>; Liu, Monk <Monk.Liu at amd.com>
> Subject: Re: [PATCH] drm/amd/amdgpu: vm entities should have kernel priority
>
> Am 19.07.21 um 07:57 schrieb Jingwen Chen:
>> [Why]
>> Current vm_pte entities have NORMAL priority, in SRIOV multi-vf use
>> case, the vf flr happens first and then job time out is found.
>> There can be several jobs timeout during a very small time slice.
>> And if the innocent sdma job time out is found before the real bad
>> job, then the innocent sdma job will be set to guilty as it only has
>> NORMAL priority. This will lead to a page fault after resubmitting
>> job.
>>
>> [How]
>> sdma should always have KERNEL priority. The kernel job will always be
>> resubmitted.
> I'm not sure if that is a good idea. We intentionally didn't gave the page table updates kernel priority to avoid clashing with the move jobs.
>
> Christian.
>
>> Signed-off-by: Jingwen Chen <Jingwen.Chen2 at amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++--
>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 358316d6a38c..f7526b67cc5d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -2923,13 +2923,13 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>>    	INIT_LIST_HEAD(&vm->done);
>>    
>>    	/* create scheduler entities for page table updates */
>> -	r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_NORMAL,
>> +	r = drm_sched_entity_init(&vm->immediate, DRM_SCHED_PRIORITY_KERNEL,
>>    				  adev->vm_manager.vm_pte_scheds,
>>    				  adev->vm_manager.vm_pte_num_scheds, NULL);
>>    	if (r)
>>    		return r;
>>    
>> -	r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_NORMAL,
>> +	r = drm_sched_entity_init(&vm->delayed, DRM_SCHED_PRIORITY_KERNEL,
>>    				  adev->vm_manager.vm_pte_scheds,
>>    				  adev->vm_manager.vm_pte_num_scheds, NULL);
>>    	if (r)