[PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Christian König
ckoenig.leichtzumerken at gmail.com
Thu Apr 8 08:32:45 UTC 2021
Am 08.04.21 um 10:22 schrieb Christian König:
> Hi Andrey,
>
> Am 07.04.21 um 21:44 schrieb Andrey Grodzovsky:
>>
>>
>> On 2021-04-07 6:28 a.m., Christian König wrote:
>>> Hi Andrey,
>>>
>>> Am 06.04.21 um 23:22 schrieb Andrey Grodzovsky:
>>>>
>>>> Hey Christian, Denis, see bellow -
>>>>
>>>> On 2021-04-06 6:34 a.m., Christian König wrote:
>>>>> Hi Andrey,
>>>>>
>>>>> well good question. My job is to watch over the implementation and
>>>>> design and while I always help I can adjust anybodies schedule.
>>>>>
>>>>> Is the patch to print a warning when the hardware is accessed
>>>>> without holding the locks merged yet? If not then that would
>>>>> probably be a good starting point.
>>>>
>>>>
>>>> It's merged into amd-staging-drm-next and since I work on
>>>> drm-misc-next I will cherry-pick it into there.
>>>>
>>>
>>> Ok good to know, I haven't tracked that one further.
>>>
>>>>
>>>>>
>>>>> Then we would need to unify this with the SRCU to make sure that
>>>>> we have both the reset lock as well as block the hotplug code from
>>>>> reusing the MMIO space.
>>>>
>>>> In my understanding there is a significant difference between
>>>> handling of GPU reset and unplug - while GPU reset use case
>>>> requires any HW accessing code to block and wait for the reset to
>>>> finish and then proceed, hot-unplug
>>>> is permanent and hence no need to wait and proceed but rather abort
>>>> at once.
>>>>
>>>
>>> Yes, absolutely correct.
>>>
>>>> This why I think that in any place we already check for device
>>>> reset we should also add a check for hot-unplug but the handling
>>>> would be different
>>>> in that for hot-unplug we would abort instead of keep waiting.
>>>>
>>>
>>> Yes, that's the rough picture in my head as well.
>>>
>>> Essentially Daniels patch of having an
>>> amdgpu_device_hwaccess_begin()/_end() was the right approach. You
>>> just can't do it in the top level IOCTL handler, but rather need it
>>> somewhere between front end and backend.
>>
>>
>> Can you point me to what patch was it ? Can't find.
>>
>
> What I mean was the approach in patch #3 in this series where he
> replaced the down_read/up_read with
> amdgpu_read_lock()/amdgpu_read_unlock().
>
> I would just not call it amdgpu_read_lock()/amdgpu_read_unlock(), but
> something more descriptive.
>
> Regards,
> Christian.
>
>>
>>>
>>>> Similar to handling device reset for unplug we obviously also need
>>>> to stop and block any MMIO accesses once device is unplugged and,
>>>> as Daniel Vetter mentioned - we have to do it before finishing
>>>> pci_remove (early device fini)
>>>> and not later (when last device reference is dropped from user
>>>> space) in order to prevent reuse of MMIO space we still access by
>>>> other hot plugging devices. As in device reset case we need to
>>>> cancel all delay works, stop drm schedule, complete all unfinished
>>>> fences(both HW and scheduler fences). While you stated strong
>>>> objection to force signalling scheduler fences from GPU reset, quote:
>>>>
>>>> "you can't signal the dma_fence waiting. Waiting for a dma_fence
>>>> also means you wait for the GPU reset to finish. When we would
>>>> signal the dma_fence during the GPU reset then we would run into
>>>> memory corruption because the hardware jobs running after the GPU
>>>> reset would access memory which is already freed."
>>>> To my understating this is a key difference with hot-unplug, the
>>>> device is gone, all those concerns are irrelevant and hence we can
>>>> actually force signal scheduler fences (setting and error to them
>>>> before) to force completion of any
>>>> waiting clients such as possibly IOCTLs or async page flips e.t.c.
>>>>
>>>
>>> Yes, absolutely correct. That's what I also mentioned to Daniel.
>>> When we are able to nuke the device and any memory access it might
>>> do we can also signal the fences.
>>>
>>>> Beyond blocking all delayed works and scheduler threads we also
>>>> need to guarantee no IOCTL can access MMIO post device unplug OR
>>>> in flight IOCTLs are done before we finish pci_remove
>>>> (amdgpu_pci_remove for us).
>>>> For this I suggest we do something like what we worked on with
>>>> Takashi Iwai the ALSA maintainer recently when he helped
>>>> implementing PCI BARs move support for snd_hda_intel. Take a look at
>>>> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=cbaa324799718e2b828a8c7b5b001dd896748497
>>>> and
>>>> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=e36365d9ab5bbc30bdc221ab4b3437de34492440
>>>> We also had same issue there, how to prevent MMIO accesses while
>>>> the BARs are migrating. What was done there is a refcount was added
>>>> to count all IOCTLs in flight, for any in flight IOCTL the BAR
>>>> migration handler would
>>>> block for the refcount to drop to 0 before it would proceed, for
>>>> any later IOCTL it stops and wait if device is in migration state.
>>>> We even don't need the wait part, nothing to wait for, we just
>>>> return with -ENODEV for this case.
>>>>
>>>
>>> This is essentially what the DRM SRCU is doing as well.
>>>
>>> For the hotplug case we could do this in the toplevel since we can
>>> signal the fence and don't need to block memory management.
>>
>>
>> To make SRCU 'wait for' all IOCTLs in flight we would need to wrap
>> every IOCTL ( practically - just drm_ioctl function) with
>> drm_dev_enter/drm_dev_exit - can we do it ?
>>
Sorry totally missed this question.
Yes, exactly that. As discussed for the hotplug case we can do this.
>>
>>>
>>> But I'm not sure, maybe we should handle it the same way as reset or
>>> maybe we should have it at the top level.
>>
>>
>> If by top level you mean checking for device unplugged and bailing
>> out at the entry to IOCTL or right at start of any work_item/timer
>> function we have then seems to me it's better and more clear. Once we
>> flushed all of them in flight there is no reason for them to execute
>> any more when device is unplugged.
>>
Well I'm open to both approaches. I just think having drm_dev_enter/exit
on each work item would be more defensive in case we forgot to
cancel/sync one.
Christian.
>> Andrey
>>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>> The above approach should allow us to wait for all the IOCTLs in
>>>> flight, together with stopping scheduler threads and cancelling and
>>>> flushing all in flight work items and timers i think It should give
>>>> as full solution for the hot-unplug case
>>>> of preventing any MMIO accesses post device pci_remove.
>>>>
>>>> Let me know what you think guys.
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>> And then testing, testing, testing to see if we have missed something.
>>>>>
>>>>> Christian.
>>>>>
>>>>> Am 05.04.21 um 19:58 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>> Denis, Christian, are there any updates in the plan on how to
>>>>>> move on with this ? As you know I need very similar code for my
>>>>>> up-streaming of device hot-unplug. My latest solution
>>>>>> (https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html)
>>>>>> was not acceptable because of low level guards on the register
>>>>>> accessors level which was hurting performance. Basically I need a
>>>>>> way to prevent any MMIO write accesses from kernel driver after
>>>>>> device is removed (UMD accesses are taken care of by page
>>>>>> faulting dummy page). We are using now hot-unplug code for
>>>>>> Freemont program and so up-streaming became more of a priority
>>>>>> then before. This MMIO access issue is currently my main blocker
>>>>>> from up-streaming. Is there any way I can assist in pushing this on ?
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>> On 2021-03-18 5:51 a.m., Christian König wrote:
>>>>>>> Am 18.03.21 um 10:30 schrieb Li, Dennis:
>>>>>>>>
>>>>>>>> >>> The GPU reset doesn't complete the fences we wait for. It
>>>>>>>> only completes the hardware fences as part of the reset.
>>>>>>>>
>>>>>>>> >>> So waiting for a fence while holding the reset lock is
>>>>>>>> illegal and needs to be avoided.
>>>>>>>>
>>>>>>>> I understood your concern. It is more complex for DRM GFX,
>>>>>>>> therefore I abandon adding lock protection for DRM ioctls now.
>>>>>>>> Maybe we can try to add all kernel dma_fence waiting in a
>>>>>>>> list, and signal all in recovery threads. Do you have same
>>>>>>>> concern for compute cases?
>>>>>>>>
>>>>>>>
>>>>>>> Yes, compute (KFD) is even harder to handle.
>>>>>>>
>>>>>>> See you can't signal the dma_fence waiting. Waiting for a
>>>>>>> dma_fence also means you wait for the GPU reset to finish.
>>>>>>>
>>>>>>> When we would signal the dma_fence during the GPU reset then we
>>>>>>> would run into memory corruption because the hardware jobs
>>>>>>> running after the GPU reset would access memory which is already
>>>>>>> freed.
>>>>>>>
>>>>>>>> >>> Lockdep also complains about this when it is used
>>>>>>>> correctly. The only reason it doesn't complain here is because
>>>>>>>> you use an atomic+wait_event instead of a locking primitive.
>>>>>>>>
>>>>>>>> Agree. This approach will escape the monitor of lockdep. Its
>>>>>>>> goal is to block other threads when GPU recovery thread start.
>>>>>>>> But I couldn’t find a better method to solve this problem. Do
>>>>>>>> you have some suggestion?
>>>>>>>>
>>>>>>>
>>>>>>> Well, completely abandon those change here.
>>>>>>>
>>>>>>> What we need to do is to identify where hardware access happens
>>>>>>> and then insert taking the read side of the GPU reset lock so
>>>>>>> that we don't wait for a dma_fence or allocate memory, but still
>>>>>>> protect the hardware from concurrent access and reset.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Best Regards
>>>>>>>>
>>>>>>>> Dennis Li
>>>>>>>>
>>>>>>>> *From:* Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>> *Sent:* Thursday, March 18, 2021 4:59 PM
>>>>>>>> *To:* Li, Dennis <Dennis.Li at amd.com>;
>>>>>>>> amd-gfx at lists.freedesktop.org; Deucher, Alexander
>>>>>>>> <Alexander.Deucher at amd.com>; Kuehling, Felix
>>>>>>>> <Felix.Kuehling at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>
>>>>>>>> *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to
>>>>>>>> enhance its stability
>>>>>>>>
>>>>>>>> Exactly that's what you don't seem to understand.
>>>>>>>>
>>>>>>>> The GPU reset doesn't complete the fences we wait for. It only
>>>>>>>> completes the hardware fences as part of the reset.
>>>>>>>>
>>>>>>>> So waiting for a fence while holding the reset lock is illegal
>>>>>>>> and needs to be avoided.
>>>>>>>>
>>>>>>>> Lockdep also complains about this when it is used correctly.
>>>>>>>> The only reason it doesn't complain here is because you use an
>>>>>>>> atomic+wait_event instead of a locking primitive.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> *Von:*Li, Dennis <Dennis.Li at amd.com <mailto:Dennis.Li at amd.com>>
>>>>>>>> *Gesendet:* Donnerstag, 18. März 2021 09:28
>>>>>>>> *An:* Koenig, Christian <Christian.Koenig at amd.com
>>>>>>>> <mailto:Christian.Koenig at amd.com>>;
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> <mailto:amd-gfx at lists.freedesktop.org>
>>>>>>>> <amd-gfx at lists.freedesktop.org
>>>>>>>> <mailto:amd-gfx at lists.freedesktop.org>>; Deucher, Alexander
>>>>>>>> <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>;
>>>>>>>> Kuehling, Felix <Felix.Kuehling at amd.com
>>>>>>>> <mailto:Felix.Kuehling at amd.com>>; Zhang, Hawking
>>>>>>>> <Hawking.Zhang at amd.com <mailto:Hawking.Zhang at amd.com>>
>>>>>>>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to
>>>>>>>> enhance its stability
>>>>>>>>
>>>>>>>> >>> Those two steps need to be exchanged or otherwise it is
>>>>>>>> possible that new delayed work items etc are started before the
>>>>>>>> lock is taken.
>>>>>>>> What about adding check for adev->in_gpu_reset in work item? If
>>>>>>>> exchange the two steps, it maybe introduce the deadlock. For
>>>>>>>> example, the user thread hold the read lock and waiting for the
>>>>>>>> fence, if recovery thread try to hold write lock and then
>>>>>>>> complete fences, in this case, recovery thread will always be
>>>>>>>> blocked.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best Regards
>>>>>>>> Dennis Li
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com
>>>>>>>> <mailto:Christian.Koenig at amd.com>>
>>>>>>>> Sent: Thursday, March 18, 2021 3:54 PM
>>>>>>>> To: Li, Dennis <Dennis.Li at amd.com <mailto:Dennis.Li at amd.com>>;
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> <mailto:amd-gfx at lists.freedesktop.org>; Deucher, Alexander
>>>>>>>> <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>;
>>>>>>>> Kuehling, Felix <Felix.Kuehling at amd.com
>>>>>>>> <mailto:Felix.Kuehling at amd.com>>; Zhang, Hawking
>>>>>>>> <Hawking.Zhang at amd.com <mailto:Hawking.Zhang at amd.com>>
>>>>>>>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to
>>>>>>>> enhance its stability
>>>>>>>>
>>>>>>>> Am 18.03.21 um 08:23 schrieb Dennis Li:
>>>>>>>> > We have defined two variables in_gpu_reset and reset_sem in
>>>>>>>> adev object. The atomic type variable in_gpu_reset is used to
>>>>>>>> avoid recovery thread reenter and make lower functions return
>>>>>>>> more earlier when recovery start, but couldn't block recovery
>>>>>>>> thread when it access hardware. The r/w semaphore reset_sem is
>>>>>>>> used to solve these synchronization issues between recovery
>>>>>>>> thread and other threads.
>>>>>>>> >
>>>>>>>> > The original solution locked registers' access in lower
>>>>>>>> functions, which will introduce following issues:
>>>>>>>> >
>>>>>>>> > 1) many lower functions are used in both recovery thread and
>>>>>>>> others. Firstly we must harvest these functions, it is easy to
>>>>>>>> miss someones. Secondly these functions need select which lock
>>>>>>>> (read lock or write lock) will be used, according to the thread
>>>>>>>> it is running in. If the thread context isn't considered, the
>>>>>>>> added lock will easily introduce deadlock. Besides that, in
>>>>>>>> most time, developer easily forget to add locks for new functions.
>>>>>>>> >
>>>>>>>> > 2) performance drop. More lower functions are more frequently
>>>>>>>> called.
>>>>>>>> >
>>>>>>>> > 3) easily introduce false positive lockdep complaint, because
>>>>>>>> write lock has big range in recovery thread, but low level
>>>>>>>> functions will hold read lock may be protected by other locks
>>>>>>>> in other threads.
>>>>>>>> >
>>>>>>>> > Therefore the new solution will try to add lock protection
>>>>>>>> for ioctls of kfd. Its goal is that there are no threads except
>>>>>>>> for recovery thread or its children (for xgmi) to access
>>>>>>>> hardware when doing GPU reset and resume. So refine recovery
>>>>>>>> thread as the following:
>>>>>>>> >
>>>>>>>> > Step 0: atomic_cmpxchg(&adev->in_gpu_reset, 0, 1)
>>>>>>>> > 1). if failed, it means system had a recovery thread
>>>>>>>> running, current thread exit directly;
>>>>>>>> > 2). if success, enter recovery thread;
>>>>>>>> >
>>>>>>>> > Step 1: cancel all delay works, stop drm schedule, complete
>>>>>>>> all unreceived fences and so on. It try to stop or pause other
>>>>>>>> threads.
>>>>>>>> >
>>>>>>>> > Step 2: call down_write(&adev->reset_sem) to hold write lock,
>>>>>>>> which will block recovery thread until other threads release
>>>>>>>> read locks.
>>>>>>>>
>>>>>>>> Those two steps need to be exchanged or otherwise it is
>>>>>>>> possible that new delayed work items etc are started before the
>>>>>>>> lock is taken.
>>>>>>>>
>>>>>>>> Just to make it clear until this is fixed the whole patch set
>>>>>>>> is a NAK.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> >
>>>>>>>> > Step 3: normally, there is only recovery threads running to
>>>>>>>> access hardware, it is safe to do gpu reset now.
>>>>>>>> >
>>>>>>>> > Step 4: do post gpu reset, such as call all ips' resume
>>>>>>>> functions;
>>>>>>>> >
>>>>>>>> > Step 5: atomic set adev->in_gpu_reset as 0, wake up other
>>>>>>>> threads and release write lock. Recovery thread exit normally.
>>>>>>>> >
>>>>>>>> > Other threads call the amdgpu_read_lock to synchronize with
>>>>>>>> recovery thread. If it finds that in_gpu_reset is 1, it should
>>>>>>>> release read lock if it has holden one, and then blocks itself
>>>>>>>> to wait for recovery finished event. If thread successfully
>>>>>>>> hold read lock and in_gpu_reset is 0, it continues. It will
>>>>>>>> exit normally or be stopped by recovery thread in step 1.
>>>>>>>> >
>>>>>>>> > Dennis Li (4):
>>>>>>>> > drm/amdgpu: remove reset lock from low level functions
>>>>>>>> > drm/amdgpu: refine the GPU recovery sequence
>>>>>>>> > drm/amdgpu: instead of using down/up_read directly
>>>>>>>> > drm/amdkfd: add reset lock protection for kfd entry functions
>>>>>>>> >
>>>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6 +
>>>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +-
>>>>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 173
>>>>>>>> +++++++++++++-----
>>>>>>>> > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 8 -
>>>>>>>> > drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 4 +-
>>>>>>>> > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 9 +-
>>>>>>>> > drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 5 +-
>>>>>>>> > drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 5 +-
>>>>>>>> > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 172 ++++++++++++++++-
>>>>>>>> > drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 +-
>>>>>>>> > drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 +
>>>>>>>> > .../amd/amdkfd/kfd_process_queue_manager.c | 17 ++
>>>>>>>> > 12 files changed, 345 insertions(+), 75 deletions(-)
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210408/345004f4/attachment-0001.htm>
More information about the amd-gfx
mailing list