[RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Wed Jan 5 18:24:42 UTC 2022
On 2022-01-05 2:59 a.m., Christian König wrote:
> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>> [AMD Official Use Only]
>>>>>
>>>>>>> See the FLR request from the hypervisor is just another source
>>>>>>> of signaling the need for a reset, similar to each job timeout
>>>>>>> on each queue. Otherwise you have a race condition between the
>>>>>>> hypervisor and the scheduler.
>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF
>>>>> FLR is about to start or was already executed, but host will do
>>>>> FLR anyway without waiting for guest too long
>>>>>
>>>> Then we have a major design issue in the SRIOV protocol and really
>>>> need to question this.
>>>>
>>>> How do you want to prevent a race between the hypervisor resetting
>>>> the hardware and the client trying the same because of a timeout?
>>>>
>>>> As far as I can see the procedure should be:
>>>> 1. We detect that a reset is necessary, either because of a fault a
>>>> timeout or signal from hypervisor.
>>>> 2. For each of those potential reset sources a work item is send to
>>>> the single workqueue.
>>>> 3. One of those work items execute first and prepares the reset.
>>>> 4. We either do the reset our self or notify the hypervisor that we
>>>> are ready for the reset.
>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>> 6. Cancel work items which might have been scheduled from other
>>>> reset sources.
>>>>
>>>> It does make sense that the hypervisor resets the hardware without
>>>> waiting for the clients for too long, but if we don't follow this
>>>> general steps we will always have a race between the different
>>>> components.
>>>
>>> Monk, just to add to this - if indeed as you say that 'FLR from
>>> hypervisor is just to notify guest the hw VF FLR is about to start
>>> or was already executed, but host will do FLR anyway without waiting
>>> for guest too long'
>>> and there is no strict waiting from the hypervisor for
>>> IDH_READY_TO_RESET to be recived from guest before starting the
>>> reset then setting in_gpu_reset and locking reset_sem from guest
>>> side is not really full proof
>>> protection from MMIO accesses by the guest - it only truly helps if
>>> hypervisor waits for that message before initiation of HW reset.
>>>
>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and
>> never has the chance to send the response back, then other VFs will
>> have to wait it reset. All the vfs will hang in this case. Or
>> sometimes the mailbox has some delay and other VFs will also wait.
>> The user of other VFs will be affected in this case.
>
> Yeah, agree completely with JingWen. The hypervisor is the one in
> charge here, not the guest.
>
> What the hypervisor should do (and it already seems to be designed
> that way) is to send the guest a message that a reset is about to
> happen and give it some time to response appropriately.
>
> The guest on the other hand then tells the hypervisor that all
> processing has stopped and it is ready to restart. If that doesn't
> happen in time the hypervisor should eliminate the guest probably
> trigger even more severe consequences, e.g. restart the whole VM etc...
>
> Christian.
So what's the end conclusion here regarding dropping this particular
patch ? Seems to me we still need to drop it to prevent driver's MMIO access
to the GPU during reset from various places in the code.
Andrey
>
>>> Andrey
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>> [AMD Official Use Only]
>>>>>
>>>>>>> See the FLR request from the hypervisor is just another source
>>>>>>> of signaling the need for a reset, similar to each job timeout
>>>>>>> on each queue. Otherwise you have a race condition between the
>>>>>>> hypervisor and the scheduler.
>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF
>>>>> FLR is about to start or was already executed, but host will do
>>>>> FLR anyway without waiting for guest too long
>>>>>
>>>>>>> In other words I strongly think that the current SRIOV reset
>>>>>>> implementation is severely broken and what Andrey is doing is
>>>>>>> actually fixing it.
>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>
>>>>> I'm afraid the patch is NAK from me, but it is welcome if the
>>>>> cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>
>>>>> Thanks
>>>>> -------------------------------------------------------------------
>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>> -------------------------------------------------------------------
>>>>> we are hiring software manager for CVS core team
>>>>> -------------------------------------------------------------------
>>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>> To: Chen, JingWen <JingWen.Chen2 at amd.com>; Christian König
>>>>> <ckoenig.leichtzumerken at gmail.com>; Grodzovsky, Andrey
>>>>> <Andrey.Grodzovsky at amd.com>; Deng, Emily <Emily.Deng at amd.com>;
>>>>> Liu, Monk <Monk.Liu at amd.com>; dri-devel at lists.freedesktop.org;
>>>>> amd-gfx at lists.freedesktop.org; Chen, Horace <Horace.Chen at amd.com>;
>>>>> Chen, JingWen <JingWen.Chen2 at amd.com>
>>>>> Cc: daniel at ffwll.ch
>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>> protection for SRIOV
>>>>>
>>>>> Hi Jingwen,
>>>>>
>>>>> well what I mean is that we need to adjust the implementation in
>>>>> amdgpu to actually match the requirements.
>>>>>
>>>>> Could be that the reset sequence is questionable in general, but I
>>>>> doubt so at least for now.
>>>>>
>>>>> See the FLR request from the hypervisor is just another source of
>>>>> signaling the need for a reset, similar to each job timeout on
>>>>> each queue. Otherwise you have a race condition between the
>>>>> hypervisor and the scheduler.
>>>>>
>>>>> Properly setting in_gpu_reset is indeed mandatory, but should
>>>>> happen at a central place and not in the SRIOV specific code.
>>>>>
>>>>> In other words I strongly think that the current SRIOV reset
>>>>> implementation is severely broken and what Andrey is doing is
>>>>> actually fixing it.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>> Hi Christian,
>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the
>>>>>> driver".
>>>>>>
>>>>>> Do you mean we should change the reset sequence in SRIOV? This
>>>>>> will be a huge change for our SRIOV solution.
>>>>>>
>>>>>> From my point of view, we can directly use
>>>>>> amdgpu_device_lock_adev
>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock
>>>>>> since no one will conflict with this thread with reset_domain
>>>>>> introduced.
>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep
>>>>>> device untouched via user space.
>>>>>>
>>>>>> Best Regards,
>>>>>> Jingwen Chen
>>>>>>
>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>> Please don't. This patch is vital to the cleanup of the reset
>>>>>>> procedure.
>>>>>>>
>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not
>>>>>>> the driver.
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs
>>>>>>>>> first, and do the flr, guest side thread may not know this and
>>>>>>>>> still try to access HW(e.g. kfd is using a lot of
>>>>>>>>> amdgpu_in_reset and reset_sem to identify the reset status).
>>>>>>>>> And this may lead to very bad result.
>>>>>>>>>
>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>> These patches look good to me. JingWen will pull these
>>>>>>>>>> patches and do some basic TDR test on sriov environment, and
>>>>>>>>>> give feedback.
>>>>>>>>>>
>>>>>>>>>> Best wishes
>>>>>>>>>> Emily Deng
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Liu, Monk <Monk.Liu at amd.com>
>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; Grodzovsky,
>>>>>>>>>>> Andrey <Andrey.Grodzovsky at amd.com>;
>>>>>>>>>>> dri-devel at lists.freedesktop.org; amd-
>>>>>>>>>>> gfx at lists.freedesktop.org;
>>>>>>>>>>> Chen, Horace <Horace.Chen at amd.com>; Chen, JingWen
>>>>>>>>>>> <JingWen.Chen2 at amd.com>; Deng, Emily <Emily.Deng at amd.com>
>>>>>>>>>>> Cc: daniel at ffwll.ch
>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU
>>>>>>>>>>> reset
>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>
>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>
>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>
>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; dri-
>>>>>>>>>>> devel at lists.freedesktop.org; amd-gfx at lists.freedesktop.org
>>>>>>>>>>> Cc: daniel at ffwll.ch; Liu, Monk <Monk.Liu at amd.com>; Chen, Horace
>>>>>>>>>>> <Horace.Chen at amd.com>
>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU
>>>>>>>>>>> reset
>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>
>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>> Since now flr work is serialized against GPU resets there
>>>>>>>>>>>> is no
>>>>>>>>>>>> need for this.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>>>>>>> Acked-by: Christian König <christian.koenig at amd.com>
>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>> 2 files changed, 22 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void
>>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>> struct amdgpu_device *adev = container_of(virt,
>>>>>>>>>>>> struct
>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>> int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>
>>>>>>>>>>>> - /* block amdgpu_gpu_recover till msg FLR COMPLETE
>>>>>>>>>>>> received,
>>>>>>>>>>>> - * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>> - * the VF FLR.
>>>>>>>>>>>> - */
>>>>>>>>>>>> - if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>> - return;
>>>>>>>>>>>> -
>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>
>>>>>>>>>>>> xgpu_ai_mailbox_trans_msg(adev,
>>>>>>>>>>>> IDH_READY_TO_RESET, 0,
>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void
>>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>> } while (timeout > 1);
>>>>>>>>>>>>
>>>>>>>>>>>> flr_done:
>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>> - up_write(&adev->reset_sem);
>>>>>>>>>>>> -
>>>>>>>>>>>> /* Trigger recovery for world switch failure if
>>>>>>>>>>>> no TDR
>>>>>>>>>>>> */
>>>>>>>>>>>> if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>> && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void
>>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>> struct amdgpu_device *adev = container_of(virt,
>>>>>>>>>>>> struct
>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>> int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>
>>>>>>>>>>>> - /* block amdgpu_gpu_recover till msg FLR COMPLETE
>>>>>>>>>>>> received,
>>>>>>>>>>>> - * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>> - * the VF FLR.
>>>>>>>>>>>> - */
>>>>>>>>>>>> - if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>> - return;
>>>>>>>>>>>> -
>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>
>>>>>>>>>>>> xgpu_nv_mailbox_trans_msg(adev,
>>>>>>>>>>>> IDH_READY_TO_RESET, 0,
>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void
>>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>> } while (timeout > 1);
>>>>>>>>>>>>
>>>>>>>>>>>> flr_done:
>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>> - up_write(&adev->reset_sem);
>>>>>>>>>>>> -
>>>>>>>>>>>> /* Trigger recovery for world switch failure if
>>>>>>>>>>>> no TDR
>>>>>>>>>>>> */
>>>>>>>>>>>> if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>> && (!amdgpu_device_has_job_running(adev) ||
>
More information about the amd-gfx
mailing list