[PATCH 2/3] amd/amdgpu: wait no process running in kfd before resuming device
Felix Kuehling
felix.kuehling at amd.com
Tue Mar 26 15:01:32 UTC 2024
On 2024-03-26 10:53, Philip Yang wrote:
>
>
> On 2024-03-25 14:45, Felix Kuehling wrote:
>> On 2024-03-22 15:57, Zhigang Luo wrote:
>>> it will cause page fault after device recovered if there is a
>>> process running.
>>>
>>> Signed-off-by: Zhigang Luo <Zhigang.Luo at amd.com>
>>> Change-Id: Ib1eddb56b69ecd41fe703abd169944154f48b0cd
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
>>> 1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 70261eb9b0bb..2867e9186e44 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4974,6 +4974,8 @@ static int amdgpu_device_reset_sriov(struct
>>> amdgpu_device *adev,
>>> retry:
>>> amdgpu_amdkfd_pre_reset(adev);
>>> + amdgpu_amdkfd_wait_no_process_running(adev);
>>> +
>>
>> This waits for the processes to be terminated. What would cause the
>> processes to be terminated? Why do the processes need to be
>> terminated? Isn't it enough if the processes are removed from the
>> runlist in pre-reset, so they can no longer execute on the GPU?
>
> mode 1 reset on SRIOV is much faster then BM, kgd2kfd_pre_reset sends
> GPU reset event to user space, don't remove queues from the runlist,
> after mode1 reset is done, there is queue still running and generate
> vm fault because the GPU page table is gone.
>
I think seeing a page fault during the reset is not a problem. Seeing a
page fault after the reset would be a bug. The process should not be on
the runlist after the reset is done.
Waiting for the process to terminate first looks like a workaround, when
the real bug is maybe that we're not updating the process state
correctly in pre-reset. All currently running processes should be put
into evicted state, so they are not put back on the runlist after the reset.
Regards,
Felix
> Regards,
>
> Philip
>
>>
>> Regards,
>> Felix
>>
>>
>>> amdgpu_device_stop_pending_resets(adev);
>>> if (from_hypervisor)
More information about the amd-gfx
mailing list