[PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Thu Apr 8 08:22:06 UTC 2021

Hi Andrey,

Am 07.04.21 um 21:44 schrieb Andrey Grodzovsky:
>
>
> On 2021-04-07 6:28 a.m., Christian König wrote:
>> Hi Andrey,
>>
>> Am 06.04.21 um 23:22 schrieb Andrey Grodzovsky:
>>>
>>> Hey Christian, Denis, see bellow -
>>>
>>> On 2021-04-06 6:34 a.m., Christian König wrote:
>>>> Hi Andrey,
>>>>
>>>> well good question. My job is to watch over the implementation and 
>>>> design and while I always help I can adjust anybodies schedule.
>>>>
>>>> Is the patch to print a warning when the hardware is accessed 
>>>> without holding the locks merged yet? If not then that would 
>>>> probably be a good starting point.
>>>
>>>
>>> It's merged into amd-staging-drm-next and since I work on 
>>> drm-misc-next I will cherry-pick it into there.
>>>
>>
>> Ok good to know, I haven't tracked that one further.
>>
>>>
>>>>
>>>> Then we would need to unify this with the SRCU to make sure that we 
>>>> have both the reset lock as well as block the hotplug code from 
>>>> reusing the MMIO space.
>>>
>>> In my understanding there is a significant difference between 
>>> handling of GPU reset and unplug - while GPU reset use case requires 
>>> any HW accessing code to block and wait for the reset to finish and 
>>> then proceed, hot-unplug
>>> is permanent and hence no need to wait and proceed but rather abort 
>>> at once.
>>>
>>
>> Yes, absolutely correct.
>>
>>> This why I think that in any place we already check for device reset 
>>> we should also add a check for hot-unplug but the handling would be 
>>> different
>>> in that for hot-unplug we would abort instead of keep waiting.
>>>
>>
>> Yes, that's the rough picture in my head as well.
>>
>> Essentially Daniels patch of having an 
>> amdgpu_device_hwaccess_begin()/_end() was the right approach. You 
>> just can't do it in the top level IOCTL handler, but rather need it 
>> somewhere between front end and backend.
>
>
> Can you point me to what patch was it ? Can't find.
>

What I mean was the approach in patch #3 in this series where he 
replaced the down_read/up_read with amdgpu_read_lock()/amdgpu_read_unlock().

I would just not call it amdgpu_read_lock()/amdgpu_read_unlock(), but 
something more descriptive.

Regards,
Christian.

>
>>
>>> Similar to handling device reset for unplug we obviously also need 
>>> to stop and block any MMIO accesses once device is unplugged and, as 
>>> Daniel Vetter mentioned - we have to do it before finishing 
>>> pci_remove (early device fini)
>>> and not later (when last device reference is dropped from user 
>>> space) in order to prevent reuse of MMIO space we still access by 
>>> other hot plugging devices. As in device reset case we need to 
>>> cancel all delay works, stop drm schedule, complete all unfinished 
>>> fences(both HW and scheduler fences). While you stated strong 
>>> objection to force signalling scheduler fences from GPU reset, quote:
>>>
>>> "you can't signal the dma_fence waiting. Waiting for a dma_fence 
>>> also means you wait for the GPU reset to finish. When we would 
>>> signal the dma_fence during the GPU reset then we would run into 
>>> memory corruption because the hardware jobs running after the GPU 
>>> reset would access memory which is already freed."
>>> To my understating this is a key difference with hot-unplug, the 
>>> device is gone, all those concerns are irrelevant and hence we can 
>>> actually force signal scheduler fences (setting and error to them 
>>> before) to force completion of any
>>> waiting clients such as possibly IOCTLs or async page flips e.t.c.
>>>
>>
>> Yes, absolutely correct. That's what I also mentioned to Daniel. When 
>> we are able to nuke the device and any memory access it might do we 
>> can also signal the fences.
>>
>>> Beyond blocking all delayed works and scheduler threads we also need 
>>> to guarantee no  IOCTL can access MMIO post device unplug OR in 
>>> flight IOCTLs are done before we finish pci_remove 
>>> (amdgpu_pci_remove for us).
>>> For this I suggest we do something like what we worked on with 
>>> Takashi Iwai the ALSA maintainer recently when he helped 
>>> implementing PCI BARs move support for snd_hda_intel. Take a look at
>>> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=cbaa324799718e2b828a8c7b5b001dd896748497 
>>> and
>>> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=e36365d9ab5bbc30bdc221ab4b3437de34492440
>>> We also had same issue there, how to prevent MMIO accesses while the 
>>> BARs are migrating. What was done there is a refcount was added to 
>>> count all IOCTLs in flight, for any in flight  IOCTL the BAR 
>>> migration handler would
>>> block for the refcount to drop to 0 before it would proceed, for any 
>>> later IOCTL it stops and wait if device is in migration state. We 
>>> even don't need the wait part, nothing to wait for, we just return 
>>> with -ENODEV for this case.
>>>
>>
>> This is essentially what the DRM SRCU is doing as well.
>>
>> For the hotplug case we could do this in the toplevel since we can 
>> signal the fence and don't need to block memory management.
>
>
> To make SRCU 'wait for' all IOCTLs in flight we would need to wrap 
> every IOCTL ( practically - just drm_ioctl function) with 
> drm_dev_enter/drm_dev_exit - can we do it ?
>
>
>>
>> But I'm not sure, maybe we should handle it the same way as reset or 
>> maybe we should have it at the top level.
>
>
> If by top level you mean checking for device unplugged and bailing out 
> at the entry to IOCTL or right at start of any work_item/timer 
> function we have then seems to me it's better and more clear. Once we 
> flushed all of them in flight there is no reason for them to execute 
> any more when device is unplugged.
>
> Andrey
>
>
>>
>> Regards,
>> Christian.
>>
>>> The above approach should allow us to wait for all the IOCTLs in 
>>> flight, together with stopping scheduler threads and cancelling and 
>>> flushing all in flight work items and timers i think It should give 
>>> as full solution for the hot-unplug case
>>> of preventing any MMIO accesses post device pci_remove.
>>>
>>> Let me know what you think guys.
>>>
>>> Andrey
>>>
>>>
>>>>
>>>> And then testing, testing, testing to see if we have missed something.
>>>>
>>>> Christian.
>>>>
>>>> Am 05.04.21 um 19:58 schrieb Andrey Grodzovsky:
>>>>>
>>>>> Denis, Christian, are there any updates in the plan on how to move 
>>>>> on with this ? As you know I need very similar code for my 
>>>>> up-streaming of device hot-unplug. My latest solution 
>>>>> (https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) 
>>>>> was not acceptable because of low level guards on the register 
>>>>> accessors level which was hurting performance. Basically I need a 
>>>>> way to prevent any MMIO write accesses from kernel driver after 
>>>>> device is removed (UMD accesses are taken care of by page faulting 
>>>>> dummy page). We are using now hot-unplug code for Freemont program 
>>>>> and so up-streaming became more of a priority then before. This 
>>>>> MMIO access issue is currently my main blocker from up-streaming. 
>>>>> Is there any way I can assist in pushing this on ?
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2021-03-18 5:51 a.m., Christian König wrote:
>>>>>> Am 18.03.21 um 10:30 schrieb Li, Dennis:
>>>>>>>
>>>>>>> >>> The GPU reset doesn't complete the fences we wait for. It 
>>>>>>> only completes the hardware fences as part of the reset.
>>>>>>>
>>>>>>> >>> So waiting for a fence while holding the reset lock is 
>>>>>>> illegal and needs to be avoided.
>>>>>>>
>>>>>>> I understood your concern. It is more complex for DRM GFX, 
>>>>>>> therefore I abandon adding lock protection for DRM ioctls now. 
>>>>>>> Maybe we can try to add all kernel  dma_fence waiting in a list, 
>>>>>>> and signal all in recovery threads. Do you have same concern for 
>>>>>>> compute cases?
>>>>>>>
>>>>>>
>>>>>> Yes, compute (KFD) is even harder to handle.
>>>>>>
>>>>>> See you can't signal the dma_fence waiting. Waiting for a 
>>>>>> dma_fence also means you wait for the GPU reset to finish.
>>>>>>
>>>>>> When we would signal the dma_fence during the GPU reset then we 
>>>>>> would run into memory corruption because the hardware jobs 
>>>>>> running after the GPU reset would access memory which is already 
>>>>>> freed.
>>>>>>
>>>>>>> >>> Lockdep also complains about this when it is used correctly. 
>>>>>>> The only reason it doesn't complain here is because you use an 
>>>>>>> atomic+wait_event instead of a locking primitive.
>>>>>>>
>>>>>>> Agree. This approach will escape the monitor of lockdep.  Its 
>>>>>>> goal is to block other threads when GPU recovery thread start. 
>>>>>>> But I couldn’t find a better method to solve this problem. Do 
>>>>>>> you have some suggestion?
>>>>>>>
>>>>>>
>>>>>> Well, completely abandon those change here.
>>>>>>
>>>>>> What we need to do is to identify where hardware access happens 
>>>>>> and then insert taking the read side of the GPU reset lock so 
>>>>>> that we don't wait for a dma_fence or allocate memory, but still 
>>>>>> protect the hardware from concurrent access and reset.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Best Regards
>>>>>>>
>>>>>>> Dennis Li
>>>>>>>
>>>>>>> *From:* Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>> *Sent:* Thursday, March 18, 2021 4:59 PM
>>>>>>> *To:* Li, Dennis <Dennis.Li at amd.com>; 
>>>>>>> amd-gfx at lists.freedesktop.org; Deucher, Alexander 
>>>>>>> <Alexander.Deucher at amd.com>; Kuehling, Felix 
>>>>>>> <Felix.Kuehling at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>
>>>>>>> *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to 
>>>>>>> enhance its stability
>>>>>>>
>>>>>>> Exactly that's what you don't seem to understand.
>>>>>>>
>>>>>>> The GPU reset doesn't complete the fences we wait for. It only 
>>>>>>> completes the hardware fences as part of the reset.
>>>>>>>
>>>>>>> So waiting for a fence while holding the reset lock is illegal 
>>>>>>> and needs to be avoided.
>>>>>>>
>>>>>>> Lockdep also complains about this when it is used correctly. The 
>>>>>>> only reason it doesn't complain here is because you use an 
>>>>>>> atomic+wait_event instead of a locking primitive.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>>
>>>>>>> *Von:*Li, Dennis <Dennis.Li at amd.com <mailto:Dennis.Li at amd.com>>
>>>>>>> *Gesendet:* Donnerstag, 18. März 2021 09:28
>>>>>>> *An:* Koenig, Christian <Christian.Koenig at amd.com 
>>>>>>> <mailto:Christian.Koenig at amd.com>>; 
>>>>>>> amd-gfx at lists.freedesktop.org 
>>>>>>> <mailto:amd-gfx at lists.freedesktop.org> 
>>>>>>> <amd-gfx at lists.freedesktop.org 
>>>>>>> <mailto:amd-gfx at lists.freedesktop.org>>; Deucher, Alexander 
>>>>>>> <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>; 
>>>>>>> Kuehling, Felix <Felix.Kuehling at amd.com 
>>>>>>> <mailto:Felix.Kuehling at amd.com>>; Zhang, Hawking 
>>>>>>> <Hawking.Zhang at amd.com <mailto:Hawking.Zhang at amd.com>>
>>>>>>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to 
>>>>>>> enhance its stability
>>>>>>>
>>>>>>> >>> Those two steps need to be exchanged or otherwise it is 
>>>>>>> possible that new delayed work items etc are started before the 
>>>>>>> lock is taken.
>>>>>>> What about adding check for adev->in_gpu_reset in work item? If 
>>>>>>> exchange the two steps, it maybe introduce the deadlock.  For 
>>>>>>> example, the user thread hold the read lock and waiting for the 
>>>>>>> fence, if recovery thread try to hold write lock and then 
>>>>>>> complete fences, in this case, recovery thread will always be 
>>>>>>> blocked.
>>>>>>>
>>>>>>>
>>>>>>> Best Regards
>>>>>>> Dennis Li
>>>>>>> -----Original Message-----
>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com 
>>>>>>> <mailto:Christian.Koenig at amd.com>>
>>>>>>> Sent: Thursday, March 18, 2021 3:54 PM
>>>>>>> To: Li, Dennis <Dennis.Li at amd.com <mailto:Dennis.Li at amd.com>>; 
>>>>>>> amd-gfx at lists.freedesktop.org 
>>>>>>> <mailto:amd-gfx at lists.freedesktop.org>; Deucher, Alexander 
>>>>>>> <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>; 
>>>>>>> Kuehling, Felix <Felix.Kuehling at amd.com 
>>>>>>> <mailto:Felix.Kuehling at amd.com>>; Zhang, Hawking 
>>>>>>> <Hawking.Zhang at amd.com <mailto:Hawking.Zhang at amd.com>>
>>>>>>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance 
>>>>>>> its stability
>>>>>>>
>>>>>>> Am 18.03.21 um 08:23 schrieb Dennis Li:
>>>>>>> > We have defined two variables in_gpu_reset and reset_sem in 
>>>>>>> adev object. The atomic type variable in_gpu_reset is used to 
>>>>>>> avoid recovery thread reenter and make lower functions return 
>>>>>>> more earlier when recovery start, but couldn't block recovery 
>>>>>>> thread when it access hardware. The r/w semaphore reset_sem is 
>>>>>>> used to solve these synchronization issues between recovery 
>>>>>>> thread and other threads.
>>>>>>> >
>>>>>>> > The original solution locked registers' access in lower 
>>>>>>> functions, which will introduce following issues:
>>>>>>> >
>>>>>>> > 1) many lower functions are used in both recovery thread and 
>>>>>>> others. Firstly we must harvest these functions, it is easy to 
>>>>>>> miss someones. Secondly these functions need select which lock 
>>>>>>> (read lock or write lock) will be used, according to the thread 
>>>>>>> it is running in. If the thread context isn't considered, the 
>>>>>>> added lock will easily introduce deadlock. Besides that, in most 
>>>>>>> time, developer easily forget to add locks for new functions.
>>>>>>> >
>>>>>>> > 2) performance drop. More lower functions are more frequently 
>>>>>>> called.
>>>>>>> >
>>>>>>> > 3) easily introduce false positive lockdep complaint, because 
>>>>>>> write lock has big range in recovery thread, but low level 
>>>>>>> functions will hold read lock may be protected by other locks in 
>>>>>>> other threads.
>>>>>>> >
>>>>>>> > Therefore the new solution will try to add lock protection for 
>>>>>>> ioctls of kfd. Its goal is that there are no threads except for 
>>>>>>> recovery thread or its children (for xgmi) to access hardware 
>>>>>>> when doing GPU reset and resume. So refine recovery thread as 
>>>>>>> the following:
>>>>>>> >
>>>>>>> > Step 0: atomic_cmpxchg(&adev->in_gpu_reset, 0, 1)
>>>>>>> >     1). if failed, it means system had a recovery thread 
>>>>>>> running, current thread exit directly;
>>>>>>> >     2). if success, enter recovery thread;
>>>>>>> >
>>>>>>> > Step 1: cancel all delay works, stop drm schedule, complete 
>>>>>>> all unreceived fences and so on. It try to stop or pause other 
>>>>>>> threads.
>>>>>>> >
>>>>>>> > Step 2: call down_write(&adev->reset_sem) to hold write lock, 
>>>>>>> which will block recovery thread until other threads release 
>>>>>>> read locks.
>>>>>>>
>>>>>>> Those two steps need to be exchanged or otherwise it is possible 
>>>>>>> that new delayed work items etc are started before the lock is 
>>>>>>> taken.
>>>>>>>
>>>>>>> Just to make it clear until this is fixed the whole patch set is 
>>>>>>> a NAK.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> >
>>>>>>> > Step 3: normally, there is only recovery threads running to 
>>>>>>> access hardware, it is safe to do gpu reset now.
>>>>>>> >
>>>>>>> > Step 4: do post gpu reset, such as call all ips' resume functions;
>>>>>>> >
>>>>>>> > Step 5: atomic set adev->in_gpu_reset as 0, wake up other 
>>>>>>> threads and release write lock. Recovery thread exit normally.
>>>>>>> >
>>>>>>> > Other threads call the amdgpu_read_lock to synchronize with 
>>>>>>> recovery thread. If it finds that in_gpu_reset is 1, it should 
>>>>>>> release read lock if it has holden one, and then blocks itself 
>>>>>>> to wait for recovery finished event. If thread successfully hold 
>>>>>>> read lock and in_gpu_reset is 0, it continues. It will exit 
>>>>>>> normally or be stopped by recovery thread in step 1.
>>>>>>> >
>>>>>>> > Dennis Li (4):
>>>>>>> >    drm/amdgpu: remove reset lock from low level functions
>>>>>>> >    drm/amdgpu: refine the GPU recovery sequence
>>>>>>> >    drm/amdgpu: instead of using down/up_read directly
>>>>>>> >    drm/amdkfd: add reset lock protection for kfd entry functions
>>>>>>> >
>>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu.h |   6 +
>>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 173 
>>>>>>> +++++++++++++-----
>>>>>>> > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c |   8 -
>>>>>>> > drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c |   4 +-
>>>>>>> > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |   9 +-
>>>>>>> > drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |   5 +-
>>>>>>> > drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |   5 +-
>>>>>>> > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 172 ++++++++++++++++-
>>>>>>> > drivers/gpu/drm/amd/amdkfd/kfd_priv.h |   3 +-
>>>>>>> > drivers/gpu/drm/amd/amdkfd/kfd_process.c |   4 +
>>>>>>> > .../amd/amdkfd/kfd_process_queue_manager.c |  17 ++
>>>>>>> >   12 files changed, 345 insertions(+), 75 deletions(-)
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx at lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210408/189ae6c0/attachment-0001.htm>