[PATCH 2/2] drm/amd/amdgpu: force flush resubmit job

Thu Feb 25 08:07:43 UTC 2021

Good catch, but the approach for the fix is incorrect.

The device reset count can only be incremented after taking the reset 
lock and stopping the scheduler, otherwise a whole bunch of different 
race conditions can happen.

Christian.

Am 25.02.21 um 08:56 schrieb Chen, JingWen:
> [AMD Official Use Only - Internal Distribution Only]
>
> Consider this sequence:
> 1. GPU reset begin
> 2. device reset count + 1
> 3. job id 1 scheduled with vm_need_flush=false
> 4. When handling this job in vm_flush, amdgpu_vmid_had_gpu_reset will return true, thus this job will be flush and the vmid_reset_count will be set equals to device_reset_count
> 5. stop drm scheduler
> 6. GPU do real reset
> 7. resubmit job id 1 with vm_need_flush=false
> 8. Job id 1 is the first job to resubmit after reset. This time amdgpu_vmid_had_gpu_reset returns false and the vm_need_flush==false
>
> Then no one will flush pd_addr and vmid for jobs after resubmit. In this sequence amdgpu_vmid_had_gpu_reset doesn't work.
>
>
> Best Regards,
> JingWen Chen
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig at amd.com>
> Sent: Thursday, February 25, 2021 3:46 PM
> To: Chen, JingWen <JingWen.Chen2 at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Liu, Monk <Monk.Liu at amd.com>
> Subject: Re: [PATCH 2/2] drm/amd/amdgpu: force flush resubmit job
>
>
>
> Am 25.02.21 um 06:27 schrieb Jingwen Chen:
>> [Why]
>> when a job is scheduled during TDR(after device reset count increase
>> and before drm_sched_stop), this job won't do vm_flush when resubmit
>> itself after GPU reset done. This can lead to a page fault.
>>
>> [How]
>> Always do vm_flush for resubmit job.
> NAK, what do you think amdgpu_vmid_had_gpu_reset already does?
>
> Christian.
>
>> Signed-off-by: Jingwen Chen <Jingwen.Chen2 at amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 ++-
>>    1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index fdbe7d4e8b8b..4af2c5d15950 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -1088,7 +1088,8 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job,
>>    if (update_spm_vmid_needed && adev->gfx.rlc.funcs->update_spm_vmid)
>>    adev->gfx.rlc.funcs->update_spm_vmid(adev, job->vmid);
>>
>> -if (amdgpu_vmid_had_gpu_reset(adev, id)) {
>> +if (amdgpu_vmid_had_gpu_reset(adev, id)||
>> +(job->base.flags & DRM_FLAG_RESUBMIT_JOB)) {
>>    gds_switch_needed = true;
>>    vm_flush_needed = true;
>>    pasid_mapping_needed = true;