[PATCH 2/2] drm/amd/amdgpu: force flush resubmit job

Thu Feb 25 09:07:44 UTC 2021

[AMD Official Use Only - Internal Distribution Only]

Yeah, that sounds better than original fix 

Thanks Christian

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Koenig, Christian <Christian.Koenig at amd.com> 
Sent: Thursday, February 25, 2021 4:08 PM
To: Chen, JingWen <JingWen.Chen2 at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Liu, Monk <Monk.Liu at amd.com>
Subject: Re: [PATCH 2/2] drm/amd/amdgpu: force flush resubmit job

Good catch, but the approach for the fix is incorrect.

The device reset count can only be incremented after taking the reset lock and stopping the scheduler, otherwise a whole bunch of different race conditions can happen.

Christian.

Am 25.02.21 um 08:56 schrieb Chen, JingWen:
> [AMD Official Use Only - Internal Distribution Only]
>
> Consider this sequence:
> 1. GPU reset begin
> 2. device reset count + 1
> 3. job id 1 scheduled with vm_need_flush=false 4. When handling this 
> job in vm_flush, amdgpu_vmid_had_gpu_reset will return true, thus this 
> job will be flush and the vmid_reset_count will be set equals to 
> device_reset_count 5. stop drm scheduler 6. GPU do real reset 7. 
> resubmit job id 1 with vm_need_flush=false 8. Job id 1 is the first 
> job to resubmit after reset. This time amdgpu_vmid_had_gpu_reset 
> returns false and the vm_need_flush==false
>
> Then no one will flush pd_addr and vmid for jobs after resubmit. In this sequence amdgpu_vmid_had_gpu_reset doesn't work.
>
>
> Best Regards,
> JingWen Chen
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig at amd.com>
> Sent: Thursday, February 25, 2021 3:46 PM
> To: Chen, JingWen <JingWen.Chen2 at amd.com>; 
> amd-gfx at lists.freedesktop.org
> Cc: Liu, Monk <Monk.Liu at amd.com>
> Subject: Re: [PATCH 2/2] drm/amd/amdgpu: force flush resubmit job
>
>
>
> Am 25.02.21 um 06:27 schrieb Jingwen Chen:
>> [Why]
>> when a job is scheduled during TDR(after device reset count increase 
>> and before drm_sched_stop), this job won't do vm_flush when resubmit 
>> itself after GPU reset done. This can lead to a page fault.
>>
>> [How]
>> Always do vm_flush for resubmit job.
> NAK, what do you think amdgpu_vmid_had_gpu_reset already does?
>
> Christian.
>
>> Signed-off-by: Jingwen Chen <Jingwen.Chen2 at amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 ++-
>>    1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index fdbe7d4e8b8b..4af2c5d15950 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -1088,7 +1088,8 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job,
>>    if (update_spm_vmid_needed && adev->gfx.rlc.funcs->update_spm_vmid)
>>    adev->gfx.rlc.funcs->update_spm_vmid(adev, job->vmid);
>>
>> -if (amdgpu_vmid_had_gpu_reset(adev, id)) {
>> +if (amdgpu_vmid_had_gpu_reset(adev, id)|| (job->base.flags & 
>> +DRM_FLAG_RESUBMIT_JOB)) {
>>    gds_switch_needed = true;
>>    vm_flush_needed = true;
>>    pasid_mapping_needed = true;