[RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Wed Jan 5 18:24:42 UTC 2022

On 2022-01-05 2:59 a.m., Christian König wrote:
> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>> [AMD Official Use Only]
>>>>>
>>>>>>> See the FLR request from the hypervisor is just another source 
>>>>>>> of signaling the need for a reset, similar to each job timeout 
>>>>>>> on each queue. Otherwise you have a race condition between the 
>>>>>>> hypervisor and the scheduler.
>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>>>>> FLR is about to start or was already executed, but host will do 
>>>>> FLR anyway without waiting for guest too long
>>>>>
>>>> Then we have a major design issue in the SRIOV protocol and really 
>>>> need to question this.
>>>>
>>>> How do you want to prevent a race between the hypervisor resetting 
>>>> the hardware and the client trying the same because of a timeout?
>>>>
>>>> As far as I can see the procedure should be:
>>>> 1. We detect that a reset is necessary, either because of a fault a 
>>>> timeout or signal from hypervisor.
>>>> 2. For each of those potential reset sources a work item is send to 
>>>> the single workqueue.
>>>> 3. One of those work items execute first and prepares the reset.
>>>> 4. We either do the reset our self or notify the hypervisor that we 
>>>> are ready for the reset.
>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>> 6. Cancel work items which might have been scheduled from other 
>>>> reset sources.
>>>>
>>>> It does make sense that the hypervisor resets the hardware without 
>>>> waiting for the clients for too long, but if we don't follow this 
>>>> general steps we will always have a race between the different 
>>>> components.
>>>
>>> Monk, just to add to this - if indeed as you say that 'FLR from 
>>> hypervisor is just to notify guest the hw VF FLR is about to start 
>>> or was already executed, but host will do FLR anyway without waiting 
>>> for guest too long'
>>> and there is no strict waiting from the hypervisor for 
>>> IDH_READY_TO_RESET to be recived from guest before starting the 
>>> reset then setting in_gpu_reset and locking reset_sem from guest 
>>> side is not really full proof
>>> protection from MMIO accesses by the guest - it only truly helps if 
>>> hypervisor waits for that message before initiation of HW reset.
>>>
>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and 
>> never has the chance to send the response back, then other VFs will 
>> have to wait it reset. All the vfs will hang in this case. Or 
>> sometimes the mailbox has some delay and other VFs will also wait. 
>> The user of other VFs will be affected in this case.
>
> Yeah, agree completely with JingWen. The hypervisor is the one in 
> charge here, not the guest.
>
> What the hypervisor should do (and it already seems to be designed 
> that way) is to send the guest a message that a reset is about to 
> happen and give it some time to response appropriately.
>
> The guest on the other hand then tells the hypervisor that all 
> processing has stopped and it is ready to restart. If that doesn't 
> happen in time the hypervisor should eliminate the guest probably 
> trigger even more severe consequences, e.g. restart the whole VM etc...
>
> Christian.

So what's the end conclusion here regarding dropping this particular 
patch ? Seems to me we still need to drop it to prevent driver's MMIO access
to the GPU during reset from various places in the code.

Andrey

>
>>> Andrey
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>> [AMD Official Use Only]
>>>>>
>>>>>>> See the FLR request from the hypervisor is just another source 
>>>>>>> of signaling the need for a reset, similar to each job timeout 
>>>>>>> on each queue. Otherwise you have a race condition between the 
>>>>>>> hypervisor and the scheduler.
>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>>>>> FLR is about to start or was already executed, but host will do 
>>>>> FLR anyway without waiting for guest too long
>>>>>
>>>>>>> In other words I strongly think that the current SRIOV reset 
>>>>>>> implementation is severely broken and what Andrey is doing is 
>>>>>>> actually fixing it.
>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>
>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the 
>>>>> cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>
>>>>> Thanks
>>>>> -------------------------------------------------------------------
>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>> -------------------------------------------------------------------
>>>>> we are hiring software manager for CVS core team
>>>>> -------------------------------------------------------------------
>>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>> To: Chen, JingWen <JingWen.Chen2 at amd.com>; Christian König 
>>>>> <ckoenig.leichtzumerken at gmail.com>; Grodzovsky, Andrey 
>>>>> <Andrey.Grodzovsky at amd.com>; Deng, Emily <Emily.Deng at amd.com>; 
>>>>> Liu, Monk <Monk.Liu at amd.com>; dri-devel at lists.freedesktop.org; 
>>>>> amd-gfx at lists.freedesktop.org; Chen, Horace <Horace.Chen at amd.com>; 
>>>>> Chen, JingWen <JingWen.Chen2 at amd.com>
>>>>> Cc: daniel at ffwll.ch
>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>>> protection for SRIOV
>>>>>
>>>>> Hi Jingwen,
>>>>>
>>>>> well what I mean is that we need to adjust the implementation in 
>>>>> amdgpu to actually match the requirements.
>>>>>
>>>>> Could be that the reset sequence is questionable in general, but I 
>>>>> doubt so at least for now.
>>>>>
>>>>> See the FLR request from the hypervisor is just another source of 
>>>>> signaling the need for a reset, similar to each job timeout on 
>>>>> each queue. Otherwise you have a race condition between the 
>>>>> hypervisor and the scheduler.
>>>>>
>>>>> Properly setting in_gpu_reset is indeed mandatory, but should 
>>>>> happen at a central place and not in the SRIOV specific code.
>>>>>
>>>>> In other words I strongly think that the current SRIOV reset 
>>>>> implementation is severely broken and what Andrey is doing is 
>>>>> actually fixing it.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>> Hi Christian,
>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the 
>>>>>> driver".
>>>>>>
>>>>>> Do you mean we should change the reset sequence in SRIOV? This 
>>>>>> will be a huge change for our SRIOV solution.
>>>>>>
>>>>>>    From my point of view, we can directly use 
>>>>>> amdgpu_device_lock_adev
>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock 
>>>>>> since no one will conflict with this thread with reset_domain 
>>>>>> introduced.
>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep 
>>>>>> device untouched via user space.
>>>>>>
>>>>>> Best Regards,
>>>>>> Jingwen Chen
>>>>>>
>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>> Please don't. This patch is vital to the cleanup of the reset 
>>>>>>> procedure.
>>>>>>>
>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not 
>>>>>>> the driver.
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs 
>>>>>>>>> first, and do the flr, guest side thread may not know this and 
>>>>>>>>> still try to access HW(e.g. kfd is using a lot of 
>>>>>>>>> amdgpu_in_reset and reset_sem to identify the reset status). 
>>>>>>>>> And this may lead to very bad result.
>>>>>>>>>
>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>> These patches look good to me. JingWen will pull these 
>>>>>>>>>> patches and do some basic TDR test on sriov environment, and 
>>>>>>>>>> give feedback.
>>>>>>>>>>
>>>>>>>>>> Best wishes
>>>>>>>>>> Emily Deng
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Liu, Monk <Monk.Liu at amd.com>
>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; Grodzovsky,
>>>>>>>>>>> Andrey <Andrey.Grodzovsky at amd.com>;
>>>>>>>>>>> dri-devel at lists.freedesktop.org; amd- 
>>>>>>>>>>> gfx at lists.freedesktop.org;
>>>>>>>>>>> Chen, Horace <Horace.Chen at amd.com>; Chen, JingWen
>>>>>>>>>>> <JingWen.Chen2 at amd.com>; Deng, Emily <Emily.Deng at amd.com>
>>>>>>>>>>> Cc: daniel at ffwll.ch
>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>>>>> reset
>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>
>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>
>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>
>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> ----------------------------------------------------------------- 
>>>>>>>>>>>
>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>> ----------------------------------------------------------------- 
>>>>>>>>>>>
>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>> ----------------------------------------------------------------- 
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; dri-
>>>>>>>>>>> devel at lists.freedesktop.org; amd-gfx at lists.freedesktop.org
>>>>>>>>>>> Cc: daniel at ffwll.ch; Liu, Monk <Monk.Liu at amd.com>; Chen, Horace
>>>>>>>>>>> <Horace.Chen at amd.com>
>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>>>>> reset
>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>
>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there 
>>>>>>>>>>>> is no
>>>>>>>>>>>> need for this.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>>>>>>> Acked-by: Christian König <christian.koenig at amd.com>
>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void 
>>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>>>>> struct
>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>
>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>>>>> received,
>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>> -     */
>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>> -        return;
>>>>>>>>>>>> -
>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>
>>>>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, 
>>>>>>>>>>>> IDH_READY_TO_RESET, 0,
>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void 
>>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>
>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>> -
>>>>>>>>>>>>           /* Trigger recovery for world switch failure if 
>>>>>>>>>>>> no TDR
>>>>>>>>>>>> */
>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void 
>>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>>>>> struct
>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>
>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>>>>> received,
>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>> -     */
>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>> -        return;
>>>>>>>>>>>> -
>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>
>>>>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, 
>>>>>>>>>>>> IDH_READY_TO_RESET, 0,
>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void 
>>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>
>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>> -
>>>>>>>>>>>>           /* Trigger recovery for world switch failure if 
>>>>>>>>>>>> no TDR
>>>>>>>>>>>> */
>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||
>