[PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Thu Apr 8 16:08:02 UTC 2021
On 2021-04-08 4:32 a.m., Christian König wrote:
>
>
> Am 08.04.21 um 10:22 schrieb Christian König:
>> Hi Andrey,
>>
>> Am 07.04.21 um 21:44 schrieb Andrey Grodzovsky:
>>>
>>>
>>> On 2021-04-07 6:28 a.m., Christian König wrote:
>>>> Hi Andrey,
>>>>
>>>> Am 06.04.21 um 23:22 schrieb Andrey Grodzovsky:
>>>>>
>>>>> Hey Christian, Denis, see bellow -
>>>>>
>>>>> On 2021-04-06 6:34 a.m., Christian König wrote:
>>>>>> Hi Andrey,
>>>>>>
>>>>>> well good question. My job is to watch over the implementation
>>>>>> and design and while I always help I can adjust anybodies schedule.
>>>>>>
>>>>>> Is the patch to print a warning when the hardware is accessed
>>>>>> without holding the locks merged yet? If not then that would
>>>>>> probably be a good starting point.
>>>>>
>>>>>
>>>>> It's merged into amd-staging-drm-next and since I work on
>>>>> drm-misc-next I will cherry-pick it into there.
>>>>>
>>>>
>>>> Ok good to know, I haven't tracked that one further.
>>>>
>>>>>
>>>>>>
>>>>>> Then we would need to unify this with the SRCU to make sure that
>>>>>> we have both the reset lock as well as block the hotplug code
>>>>>> from reusing the MMIO space.
>>>>>
>>>>> In my understanding there is a significant difference between
>>>>> handling of GPU reset and unplug - while GPU reset use case
>>>>> requires any HW accessing code to block and wait for the reset to
>>>>> finish and then proceed, hot-unplug
>>>>> is permanent and hence no need to wait and proceed but rather
>>>>> abort at once.
>>>>>
>>>>
>>>> Yes, absolutely correct.
>>>>
>>>>> This why I think that in any place we already check for device
>>>>> reset we should also add a check for hot-unplug but the handling
>>>>> would be different
>>>>> in that for hot-unplug we would abort instead of keep waiting.
>>>>>
>>>>
>>>> Yes, that's the rough picture in my head as well.
>>>>
>>>> Essentially Daniels patch of having an
>>>> amdgpu_device_hwaccess_begin()/_end() was the right approach. You
>>>> just can't do it in the top level IOCTL handler, but rather need it
>>>> somewhere between front end and backend.
>>>
>>>
>>> Can you point me to what patch was it ? Can't find.
>>>
>>
>> What I mean was the approach in patch #3 in this series where he
>> replaced the down_read/up_read with
>> amdgpu_read_lock()/amdgpu_read_unlock().
>>
>> I would just not call it amdgpu_read_lock()/amdgpu_read_unlock(), but
>> something more descriptive.
>>
>> Regards,
>> Christian.
>>
>>>
>>>>
>>>>> Similar to handling device reset for unplug we obviously also need
>>>>> to stop and block any MMIO accesses once device is unplugged and,
>>>>> as Daniel Vetter mentioned - we have to do it before finishing
>>>>> pci_remove (early device fini)
>>>>> and not later (when last device reference is dropped from user
>>>>> space) in order to prevent reuse of MMIO space we still access by
>>>>> other hot plugging devices. As in device reset case we need to
>>>>> cancel all delay works, stop drm schedule, complete all unfinished
>>>>> fences(both HW and scheduler fences). While you stated strong
>>>>> objection to force signalling scheduler fences from GPU reset, quote:
>>>>>
>>>>> "you can't signal the dma_fence waiting. Waiting for a dma_fence
>>>>> also means you wait for the GPU reset to finish. When we would
>>>>> signal the dma_fence during the GPU reset then we would run into
>>>>> memory corruption because the hardware jobs running after the GPU
>>>>> reset would access memory which is already freed."
>>>>> To my understating this is a key difference with hot-unplug, the
>>>>> device is gone, all those concerns are irrelevant and hence we can
>>>>> actually force signal scheduler fences (setting and error to them
>>>>> before) to force completion of any
>>>>> waiting clients such as possibly IOCTLs or async page flips e.t.c.
>>>>>
>>>>
>>>> Yes, absolutely correct. That's what I also mentioned to Daniel.
>>>> When we are able to nuke the device and any memory access it might
>>>> do we can also signal the fences.
>>>>
>>>>> Beyond blocking all delayed works and scheduler threads we also
>>>>> need to guarantee no IOCTL can access MMIO post device unplug OR
>>>>> in flight IOCTLs are done before we finish pci_remove
>>>>> (amdgpu_pci_remove for us).
>>>>> For this I suggest we do something like what we worked on with
>>>>> Takashi Iwai the ALSA maintainer recently when he helped
>>>>> implementing PCI BARs move support for snd_hda_intel. Take a look at
>>>>> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=cbaa324799718e2b828a8c7b5b001dd896748497
>>>>> and
>>>>> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=e36365d9ab5bbc30bdc221ab4b3437de34492440
>>>>> We also had same issue there, how to prevent MMIO accesses while
>>>>> the BARs are migrating. What was done there is a refcount was
>>>>> added to count all IOCTLs in flight, for any in flight IOCTL the
>>>>> BAR migration handler would
>>>>> block for the refcount to drop to 0 before it would proceed, for
>>>>> any later IOCTL it stops and wait if device is in migration state.
>>>>> We even don't need the wait part, nothing to wait for, we just
>>>>> return with -ENODEV for this case.
>>>>>
>>>>
>>>> This is essentially what the DRM SRCU is doing as well.
>>>>
>>>> For the hotplug case we could do this in the toplevel since we can
>>>> signal the fence and don't need to block memory management.
>>>
>>>
>>> To make SRCU 'wait for' all IOCTLs in flight we would need to wrap
>>> every IOCTL ( practically - just drm_ioctl function) with
>>> drm_dev_enter/drm_dev_exit - can we do it ?
>>>
>
> Sorry totally missed this question.
>
> Yes, exactly that. As discussed for the hotplug case we can do this.
Thinking more about it - assuming we are treating synchronize_srcu as a
'wait for completion' of any in flight {drm_dev_enter, drm_dev_exit}
scope, some of those scopes might do dma_fence_wait inside. Since we
haven't force signaled the fences yet we will end up a deadlock. We have
to signal all the various fences before doing the 'wait for'. But we
can't signal the fences before setting 'dev->unplugged = true' to reject
further CS and other stuff which might create more fences we were
supposed-to force signal and now missed them. Effectively setting
'dev->unplugged = true' and doing synchronize_srcu in one call like
drm_dev_unplug does without signalling all the fences in the device in
between these two steps looks luck a possible deadlock to me - what do
you think ?
Andrey
>
>>>
>>>>
>>>> But I'm not sure, maybe we should handle it the same way as reset
>>>> or maybe we should have it at the top level.
>>>
>>>
>>> If by top level you mean checking for device unplugged and bailing
>>> out at the entry to IOCTL or right at start of any work_item/timer
>>> function we have then seems to me it's better and more clear. Once
>>> we flushed all of them in flight there is no reason for them to
>>> execute any more when device is unplugged.
>>>
>
> Well I'm open to both approaches. I just think having
> drm_dev_enter/exit on each work item would be more defensive in case
> we forgot to cancel/sync one.
>
> Christian.
>
>>> Andrey
>>>
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> The above approach should allow us to wait for all the IOCTLs in
>>>>> flight, together with stopping scheduler threads and cancelling
>>>>> and flushing all in flight work items and timers i think It should
>>>>> give as full solution for the hot-unplug case
>>>>> of preventing any MMIO accesses post device pci_remove.
>>>>>
>>>>> Let me know what you think guys.
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>> And then testing, testing, testing to see if we have missed
>>>>>> something.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>> Am 05.04.21 um 19:58 schrieb Andrey Grodzovsky:
>>>>>>>
>>>>>>> Denis, Christian, are there any updates in the plan on how to
>>>>>>> move on with this ? As you know I need very similar code for my
>>>>>>> up-streaming of device hot-unplug. My latest solution
>>>>>>> (https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html)
>>>>>>> was not acceptable because of low level guards on the register
>>>>>>> accessors level which was hurting performance. Basically I need
>>>>>>> a way to prevent any MMIO write accesses from kernel driver
>>>>>>> after device is removed (UMD accesses are taken care of by page
>>>>>>> faulting dummy page). We are using now hot-unplug code for
>>>>>>> Freemont program and so up-streaming became more of a priority
>>>>>>> then before. This MMIO access issue is currently my main blocker
>>>>>>> from up-streaming. Is there any way I can assist in pushing this
>>>>>>> on ?
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>> On 2021-03-18 5:51 a.m., Christian König wrote:
>>>>>>>> Am 18.03.21 um 10:30 schrieb Li, Dennis:
>>>>>>>>>
>>>>>>>>> >>> The GPU reset doesn't complete the fences we wait for. It
>>>>>>>>> only completes the hardware fences as part of the reset.
>>>>>>>>>
>>>>>>>>> >>> So waiting for a fence while holding the reset lock is
>>>>>>>>> illegal and needs to be avoided.
>>>>>>>>>
>>>>>>>>> I understood your concern. It is more complex for DRM GFX,
>>>>>>>>> therefore I abandon adding lock protection for DRM ioctls now.
>>>>>>>>> Maybe we can try to add all kernel dma_fence waiting in a
>>>>>>>>> list, and signal all in recovery threads. Do you have same
>>>>>>>>> concern for compute cases?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, compute (KFD) is even harder to handle.
>>>>>>>>
>>>>>>>> See you can't signal the dma_fence waiting. Waiting for a
>>>>>>>> dma_fence also means you wait for the GPU reset to finish.
>>>>>>>>
>>>>>>>> When we would signal the dma_fence during the GPU reset then we
>>>>>>>> would run into memory corruption because the hardware jobs
>>>>>>>> running after the GPU reset would access memory which is
>>>>>>>> already freed.
>>>>>>>>
>>>>>>>>> >>> Lockdep also complains about this when it is used
>>>>>>>>> correctly. The only reason it doesn't complain here is because
>>>>>>>>> you use an atomic+wait_event instead of a locking primitive.
>>>>>>>>>
>>>>>>>>> Agree. This approach will escape the monitor of lockdep. Its
>>>>>>>>> goal is to block other threads when GPU recovery thread start.
>>>>>>>>> But I couldn’t find a better method to solve this problem. Do
>>>>>>>>> you have some suggestion?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Well, completely abandon those change here.
>>>>>>>>
>>>>>>>> What we need to do is to identify where hardware access happens
>>>>>>>> and then insert taking the read side of the GPU reset lock so
>>>>>>>> that we don't wait for a dma_fence or allocate memory, but
>>>>>>>> still protect the hardware from concurrent access and reset.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>> Best Regards
>>>>>>>>>
>>>>>>>>> Dennis Li
>>>>>>>>>
>>>>>>>>> *From:* Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>> *Sent:* Thursday, March 18, 2021 4:59 PM
>>>>>>>>> *To:* Li, Dennis <Dennis.Li at amd.com>;
>>>>>>>>> amd-gfx at lists.freedesktop.org; Deucher, Alexander
>>>>>>>>> <Alexander.Deucher at amd.com>; Kuehling, Felix
>>>>>>>>> <Felix.Kuehling at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>
>>>>>>>>> *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to
>>>>>>>>> enhance its stability
>>>>>>>>>
>>>>>>>>> Exactly that's what you don't seem to understand.
>>>>>>>>>
>>>>>>>>> The GPU reset doesn't complete the fences we wait for. It only
>>>>>>>>> completes the hardware fences as part of the reset.
>>>>>>>>>
>>>>>>>>> So waiting for a fence while holding the reset lock is illegal
>>>>>>>>> and needs to be avoided.
>>>>>>>>>
>>>>>>>>> Lockdep also complains about this when it is used correctly.
>>>>>>>>> The only reason it doesn't complain here is because you use an
>>>>>>>>> atomic+wait_event instead of a locking primitive.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> *Von:*Li, Dennis <Dennis.Li at amd.com <mailto:Dennis.Li at amd.com>>
>>>>>>>>> *Gesendet:* Donnerstag, 18. März 2021 09:28
>>>>>>>>> *An:* Koenig, Christian <Christian.Koenig at amd.com
>>>>>>>>> <mailto:Christian.Koenig at amd.com>>;
>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>> <mailto:amd-gfx at lists.freedesktop.org>
>>>>>>>>> <amd-gfx at lists.freedesktop.org
>>>>>>>>> <mailto:amd-gfx at lists.freedesktop.org>>; Deucher, Alexander
>>>>>>>>> <Alexander.Deucher at amd.com
>>>>>>>>> <mailto:Alexander.Deucher at amd.com>>; Kuehling, Felix
>>>>>>>>> <Felix.Kuehling at amd.com <mailto:Felix.Kuehling at amd.com>>;
>>>>>>>>> Zhang, Hawking <Hawking.Zhang at amd.com
>>>>>>>>> <mailto:Hawking.Zhang at amd.com>>
>>>>>>>>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to
>>>>>>>>> enhance its stability
>>>>>>>>>
>>>>>>>>> >>> Those two steps need to be exchanged or otherwise it is
>>>>>>>>> possible that new delayed work items etc are started before
>>>>>>>>> the lock is taken.
>>>>>>>>> What about adding check for adev->in_gpu_reset in work item?
>>>>>>>>> If exchange the two steps, it maybe introduce the deadlock.
>>>>>>>>> For example, the user thread hold the read lock and waiting
>>>>>>>>> for the fence, if recovery thread try to hold write lock and
>>>>>>>>> then complete fences, in this case, recovery thread will
>>>>>>>>> always be blocked.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best Regards
>>>>>>>>> Dennis Li
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com
>>>>>>>>> <mailto:Christian.Koenig at amd.com>>
>>>>>>>>> Sent: Thursday, March 18, 2021 3:54 PM
>>>>>>>>> To: Li, Dennis <Dennis.Li at amd.com <mailto:Dennis.Li at amd.com>>;
>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>> <mailto:amd-gfx at lists.freedesktop.org>; Deucher, Alexander
>>>>>>>>> <Alexander.Deucher at amd.com
>>>>>>>>> <mailto:Alexander.Deucher at amd.com>>; Kuehling, Felix
>>>>>>>>> <Felix.Kuehling at amd.com <mailto:Felix.Kuehling at amd.com>>;
>>>>>>>>> Zhang, Hawking <Hawking.Zhang at amd.com
>>>>>>>>> <mailto:Hawking.Zhang at amd.com>>
>>>>>>>>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to
>>>>>>>>> enhance its stability
>>>>>>>>>
>>>>>>>>> Am 18.03.21 um 08:23 schrieb Dennis Li:
>>>>>>>>> > We have defined two variables in_gpu_reset and reset_sem in
>>>>>>>>> adev object. The atomic type variable in_gpu_reset is used to
>>>>>>>>> avoid recovery thread reenter and make lower functions return
>>>>>>>>> more earlier when recovery start, but couldn't block recovery
>>>>>>>>> thread when it access hardware. The r/w semaphore reset_sem is
>>>>>>>>> used to solve these synchronization issues between recovery
>>>>>>>>> thread and other threads.
>>>>>>>>> >
>>>>>>>>> > The original solution locked registers' access in lower
>>>>>>>>> functions, which will introduce following issues:
>>>>>>>>> >
>>>>>>>>> > 1) many lower functions are used in both recovery thread and
>>>>>>>>> others. Firstly we must harvest these functions, it is easy to
>>>>>>>>> miss someones. Secondly these functions need select which lock
>>>>>>>>> (read lock or write lock) will be used, according to the
>>>>>>>>> thread it is running in. If the thread context isn't
>>>>>>>>> considered, the added lock will easily introduce deadlock.
>>>>>>>>> Besides that, in most time, developer easily forget to add
>>>>>>>>> locks for new functions.
>>>>>>>>> >
>>>>>>>>> > 2) performance drop. More lower functions are more
>>>>>>>>> frequently called.
>>>>>>>>> >
>>>>>>>>> > 3) easily introduce false positive lockdep complaint,
>>>>>>>>> because write lock has big range in recovery thread, but low
>>>>>>>>> level functions will hold read lock may be protected by other
>>>>>>>>> locks in other threads.
>>>>>>>>> >
>>>>>>>>> > Therefore the new solution will try to add lock protection
>>>>>>>>> for ioctls of kfd. Its goal is that there are no threads
>>>>>>>>> except for recovery thread or its children (for xgmi) to
>>>>>>>>> access hardware when doing GPU reset and resume. So refine
>>>>>>>>> recovery thread as the following:
>>>>>>>>> >
>>>>>>>>> > Step 0: atomic_cmpxchg(&adev->in_gpu_reset, 0, 1)
>>>>>>>>> > 1). if failed, it means system had a recovery thread
>>>>>>>>> running, current thread exit directly;
>>>>>>>>> > 2). if success, enter recovery thread;
>>>>>>>>> >
>>>>>>>>> > Step 1: cancel all delay works, stop drm schedule, complete
>>>>>>>>> all unreceived fences and so on. It try to stop or pause other
>>>>>>>>> threads.
>>>>>>>>> >
>>>>>>>>> > Step 2: call down_write(&adev->reset_sem) to hold write
>>>>>>>>> lock, which will block recovery thread until other threads
>>>>>>>>> release read locks.
>>>>>>>>>
>>>>>>>>> Those two steps need to be exchanged or otherwise it is
>>>>>>>>> possible that new delayed work items etc are started before
>>>>>>>>> the lock is taken.
>>>>>>>>>
>>>>>>>>> Just to make it clear until this is fixed the whole patch set
>>>>>>>>> is a NAK.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> >
>>>>>>>>> > Step 3: normally, there is only recovery threads running to
>>>>>>>>> access hardware, it is safe to do gpu reset now.
>>>>>>>>> >
>>>>>>>>> > Step 4: do post gpu reset, such as call all ips' resume
>>>>>>>>> functions;
>>>>>>>>> >
>>>>>>>>> > Step 5: atomic set adev->in_gpu_reset as 0, wake up other
>>>>>>>>> threads and release write lock. Recovery thread exit normally.
>>>>>>>>> >
>>>>>>>>> > Other threads call the amdgpu_read_lock to synchronize with
>>>>>>>>> recovery thread. If it finds that in_gpu_reset is 1, it should
>>>>>>>>> release read lock if it has holden one, and then blocks itself
>>>>>>>>> to wait for recovery finished event. If thread successfully
>>>>>>>>> hold read lock and in_gpu_reset is 0, it continues. It will
>>>>>>>>> exit normally or be stopped by recovery thread in step 1.
>>>>>>>>> >
>>>>>>>>> > Dennis Li (4):
>>>>>>>>> > drm/amdgpu: remove reset lock from low level functions
>>>>>>>>> > drm/amdgpu: refine the GPU recovery sequence
>>>>>>>>> > drm/amdgpu: instead of using down/up_read directly
>>>>>>>>> > drm/amdkfd: add reset lock protection for kfd entry functions
>>>>>>>>> >
>>>>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6 +
>>>>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +-
>>>>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 173
>>>>>>>>> +++++++++++++-----
>>>>>>>>> > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 8 -
>>>>>>>>> > drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 4 +-
>>>>>>>>> > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 9 +-
>>>>>>>>> > drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 5 +-
>>>>>>>>> > drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 5 +-
>>>>>>>>> > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 172 ++++++++++++++++-
>>>>>>>>> > drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 +-
>>>>>>>>> > drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 +
>>>>>>>>> > .../amd/amdkfd/kfd_process_queue_manager.c | 17 ++
>>>>>>>>> > 12 files changed, 345 insertions(+), 75 deletions(-)
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210408/76a9d0e1/attachment-0001.htm>
More information about the amd-gfx
mailing list