[PATCH 2/3] amd/amdgpu: wait no process running in kfd before resuming device

Felix Kuehling felix.kuehling at amd.com
Tue Mar 26 15:01:32 UTC 2024


On 2024-03-26 10:53, Philip Yang wrote:
>
>
> On 2024-03-25 14:45, Felix Kuehling wrote:
>> On 2024-03-22 15:57, Zhigang Luo wrote:
>>> it will cause page fault after device recovered if there is a 
>>> process running.
>>>
>>> Signed-off-by: Zhigang Luo <Zhigang.Luo at amd.com>
>>> Change-Id: Ib1eddb56b69ecd41fe703abd169944154f48b0cd
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
>>>   1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 70261eb9b0bb..2867e9186e44 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4974,6 +4974,8 @@ static int amdgpu_device_reset_sriov(struct 
>>> amdgpu_device *adev,
>>>   retry:
>>>       amdgpu_amdkfd_pre_reset(adev);
>>>   +    amdgpu_amdkfd_wait_no_process_running(adev);
>>> +
>>
>> This waits for the processes to be terminated. What would cause the 
>> processes to be terminated? Why do the processes need to be 
>> terminated? Isn't it enough if the processes are removed from the 
>> runlist in pre-reset, so they can no longer execute on the GPU?
>
> mode 1 reset on SRIOV is much faster then BM, kgd2kfd_pre_reset sends 
> GPU reset event to user space, don't remove queues from the runlist, 
> after mode1 reset is done, there is queue still running and generate 
> vm fault because the GPU page table is gone.
>
I think seeing a page fault during the reset is not a problem. Seeing a 
page fault after the reset would be a bug. The process should not be on 
the runlist after the reset is done.

Waiting for the process to terminate first looks like a workaround, when 
the real bug is maybe that we're not updating the process state 
correctly in pre-reset. All currently running processes should be put 
into evicted state, so they are not put back on the runlist after the reset.

Regards,
   Felix


> Regards,
>
> Philip
>
>>
>> Regards,
>>   Felix
>>
>>
>>> amdgpu_device_stop_pending_resets(adev);
>>>         if (from_hypervisor)


More information about the amd-gfx mailing list