[PATCH v4 00/14] RFC Support hot device unplug in amdgpu

Tue Feb 9 15:34:32 UTC 2021

On 2/9/21 4:50 AM, Daniel Vetter wrote:
> On Mon, Feb 08, 2021 at 11:01:14PM -0500, Andrey Grodzovsky wrote:
>> On 2/8/21 2:27 AM, Daniel Vetter wrote:
>>> On Mon, Feb 8, 2021 at 6:59 AM Andrey Grodzovsky
>>> <Andrey.Grodzovsky at amd.com> wrote:
>>>> On 1/20/21 10:59 AM, Daniel Vetter wrote:
>>>>> On Wed, Jan 20, 2021 at 3:20 PM Andrey Grodzovsky
>>>>> <Andrey.Grodzovsky at amd.com> wrote:
>>>>>> On 1/20/21 4:05 AM, Daniel Vetter wrote:
>>>>>>> On Tue, Jan 19, 2021 at 01:18:15PM -0500, Andrey Grodzovsky wrote:
>>>>>>>> On 1/19/21 1:08 PM, Daniel Vetter wrote:
>>>>>>>>> On Tue, Jan 19, 2021 at 6:31 PM Andrey Grodzovsky
>>>>>>>>> <Andrey.Grodzovsky at amd.com> wrote:
>>>>>>>>>> On 1/19/21 9:16 AM, Daniel Vetter wrote:
>>>>>>>>>>> On Mon, Jan 18, 2021 at 04:01:09PM -0500, Andrey Grodzovsky wrote:
>>>>>>>>>>>> Until now extracting a card either by physical extraction (e.g. eGPU with
>>>>>>>>>>>> thunderbolt connection or by emulation through  syfs -> /sys/bus/pci/devices/device_id/remove)
>>>>>>>>>>>> would cause random crashes in user apps. The random crashes in apps were
>>>>>>>>>>>> mostly due to the app having mapped a device backed BO into its address
>>>>>>>>>>>> space was still trying to access the BO while the backing device was gone.
>>>>>>>>>>>> To answer this first problem Christian suggested to fix the handling of mapped
>>>>>>>>>>>> memory in the clients when the device goes away by forcibly unmap all buffers the
>>>>>>>>>>>> user processes has by clearing their respective VMAs mapping the device BOs.
>>>>>>>>>>>> Then when the VMAs try to fill in the page tables again we check in the fault
>>>>>>>>>>>> handlerif the device is removed and if so, return an error. This will generate a
>>>>>>>>>>>> SIGBUS to the application which can then cleanly terminate.This indeed was done
>>>>>>>>>>>> but this in turn created a problem of kernel OOPs were the OOPSes were due to the
>>>>>>>>>>>> fact that while the app was terminating because of the SIGBUSit would trigger use
>>>>>>>>>>>> after free in the driver by calling to accesses device structures that were already
>>>>>>>>>>>> released from the pci remove sequence.This was handled by introducing a 'flush'
>>>>>>>>>>>> sequence during device removal were we wait for drm file reference to drop to 0
>>>>>>>>>>>> meaning all user clients directly using this device terminated.
>>>>>>>>>>>>
>>>>>>>>>>>> v2:
>>>>>>>>>>>> Based on discussions in the mailing list with Daniel and Pekka [1] and based on the document
>>>>>>>>>>>> produced by Pekka from those discussions [2] the whole approach with returning SIGBUS and
>>>>>>>>>>>> waiting for all user clients having CPU mapping of device BOs to die was dropped.
>>>>>>>>>>>> Instead as per the document suggestion the device structures are kept alive until
>>>>>>>>>>>> the last reference to the device is dropped by user client and in the meanwhile all existing and new CPU mappings of the BOs
>>>>>>>>>>>> belonging to the device directly or by dma-buf import are rerouted to per user
>>>>>>>>>>>> process dummy rw page.Also, I skipped the 'Requirements for KMS UAPI' section of [2]
>>>>>>>>>>>> since i am trying to get the minimal set of requirements that still give useful solution
>>>>>>>>>>>> to work and this is the'Requirements for Render and Cross-Device UAPI' section and so my
>>>>>>>>>>>> test case is removing a secondary device, which is render only and is not involved
>>>>>>>>>>>> in KMS.
>>>>>>>>>>>>
>>>>>>>>>>>> v3:
>>>>>>>>>>>> More updates following comments from v2 such as removing loop to find DRM file when rerouting
>>>>>>>>>>>> page faults to dummy page,getting rid of unnecessary sysfs handling refactoring and moving
>>>>>>>>>>>> prevention of GPU recovery post device unplug from amdgpu to scheduler layer.
>>>>>>>>>>>> On top of that added unplug support for the IOMMU enabled system.
>>>>>>>>>>>>
>>>>>>>>>>>> v4:
>>>>>>>>>>>> Drop last sysfs hack and use sysfs default attribute.
>>>>>>>>>>>> Guard against write accesses after device removal to avoid modifying released memory.
>>>>>>>>>>>> Update dummy pages handling to on demand allocation and release through drm managed framework.
>>>>>>>>>>>> Add return value to scheduler job TO handler (by Luben Tuikov) and use this in amdgpu for prevention
>>>>>>>>>>>> of GPU recovery post device unplug
>>>>>>>>>>>> Also rebase on top of drm-misc-mext instead of amd-staging-drm-next
>>>>>>>>>>>>
>>>>>>>>>>>> With these patches I am able to gracefully remove the secondary card using sysfs remove hook while glxgears
>>>>>>>>>>>> is running off of secondary card (DRI_PRIME=1) without kernel oopses or hangs and keep working
>>>>>>>>>>>> with the primary card or soft reset the device without hangs or oopses
>>>>>>>>>>>>
>>>>>>>>>>>> TODOs for followup work:
>>>>>>>>>>>> Convert AMDGPU code to use devm (for hw stuff) and drmm (for sw stuff and allocations) (Daniel)
>>>>>>>>>>>> Support plugging the secondary device back after unplug - currently still experiencing HW error on plugging back.
>>>>>>>>>>>> Add support for 'Requirements for KMS UAPI' section of [2] - unplugging primary, display connected card.
>>>>>>>>>>>>
>>>>>>>>>>>> [1] - Discussions during v3 of the patchset https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Famd-gfx%2Fmsg55576.html&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C49a8be16944441aca3ce08d8cce01c88%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637484610239738569%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PkvEkJt2zQ9G0dpiGYtSuVLpziwQ0FklkjdMLV2ZOnE%3D&reserved=0
>>>>>>>>>>>> [2] - drm/doc: device hot-unplug for userspace https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Fdri-devel%2Fmsg259755.html&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C49a8be16944441aca3ce08d8cce01c88%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637484610239738569%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=akobkV%2FtjcDck7sHlGbgE5n1FCyGmcklhu%2Bx3goGRuw%3D&reserved=0
>>>>>>>>>>>> [3] - Related gitlab ticket https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1081&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C49a8be16944441aca3ce08d8cce01c88%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637484610239738569%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=P0O5VHLkRczar2XHXyM8UysK%2BaD4hpUuoT%2FCBCnOW2g%3D&reserved=0
>>>>>>>>>>> btw have you tried this out with some of the igts we have? core_hotunplug
>>>>>>>>>>> is the one I'm thinking of. Might be worth to extend this for amdgpu
>>>>>>>>>>> specific stuff (like run some batches on it while hotunplugging).
>>>>>>>>>> No, I mostly used just running glxgears while testing which covers already
>>>>>>>>>> exported/imported dma-buf case and a few manually hacked tests in libdrm amdgpu
>>>>>>>>>> test suite
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Since there's so many corner cases we need to test here (shared dma-buf,
>>>>>>>>>>> shared dma_fence) I think it would make sense to have a shared testcase
>>>>>>>>>>> across drivers.
>>>>>>>>>> Not familiar with IGT too much, is there an easy way to setup shared dma bufs
>>>>>>>>>> and fences
>>>>>>>>>> use cases there or you mean I need to add them now ?
>>>>>>>>> We do have test infrastructure for all of that, but the hotunplug test
>>>>>>>>> doesn't have that yet I think.
>>>>>>>>>
>>>>>>>>>>> Only specific thing would be some hooks to keep the gpu
>>>>>>>>>>> busy in some fashion while we yank the driver.
>>>>>>>>>> Do you mean like staring X and some active rendering on top (like glxgears)
>>>>>>>>>> automatically from within IGT ?
>>>>>>>>> Nope, igt is meant to be bare metal testing so you don't have to drag
>>>>>>>>> the entire winsys around (which in a wayland world, is not really good
>>>>>>>>> for driver testing anyway, since everything is different). We use this
>>>>>>>>> for our pre-merge ci for drm/i915.
>>>>>>>> So i keep it busy by X/glxgers which is manual operation. What you suggest
>>>>>>>> then is some client within IGT which opens the device and starts submitting jobs
>>>>>>>> (which is much like what libdrm amdgpu tests already do) ? And this
>>>>>>>> part is the amdgou specific code I just need to port from libdrm to here ?
>>>>>>> Yup. For i915 tests we have an entire library already for small workloads,
>>>>>>> including some that just spin forever (useful for reset testing and could
>>>>>>> also come handy for unload testing).
>>>>>>> -Daniel
>>>>>> Does it mean I would have to drag in the entire infrastructure code from
>>>>>> within libdrm amdgpu code that allows for command submissions through
>>>>>> our IOCTLs ?
>>>>> No it's perfectly fine to use libdrm in igt tests, we do that too. I
>>>>> just mean we have some additional helpers to submit specific workloads
>>>>> for intel gpu, like rendercpy to move data with the 3d engine (just
>>>>> using copy engines only isn't good enough sometimes for testing), or
>>>>> the special hanging batchbuffers we use for reset testing, or in
>>>>> general for having precise control over race conditions and things
>>>>> like that.
>>>>>
>>>>> One thing that was somewhat annoying for i915 but shouldn't be a
>>>>> problem for amdgpu is that igt builds on intel. So we have stub
>>>>> functions for libdrm-intel, since libdrm-intel doesn't build on arm.
>>>>> Shouldn't be a problem for you.
>>>>> -Daniel
>>>> Tested with igt hot-unplug test. Passed unbind_rebind, unplug-rescan,
>>>> hot-unbind-rebind and hotunplug-rescan
>>>> if disabling the rescan part as I don't support plug-back for now. Also added
>>>> command submission for amdgpu.
>>>> Attached a draft of submitting workload while unbinding the driver or simulating
>>>> detach. Catched 2 issues with unpug if command submission in flight  during
>>>> unplug -
>>>> (unsignaled fence causing a hang in amdgpu_cs_sync and hitting a BUG_ON in
>>>> gfx_v9_0_ring_emit_patch_cond_exec whic is expected i guess).
>>>> Guess glxgears command submissions is at a much slower rate so this was missed.
>>>> Is that what you meant for this test ?
>>> Yup. Would be good if you can submit this one for inclusion.
>>> -Daniel
>>
>> Will do together with exported dma-buf test once I do it.
>>
>> P.S How am i supposed to do exported fence test. Exporting a fence from
>> device A, importing it into device B, unplugging
>> device A then signaling the fence from device B - this supposed to call a
>> fence cb which was registered
>> by the exporter which by now is dead and hence will cause a 'use after free' ?
> Yeah in the end we'd need 2 hw devices for testing full fence
> functionality. A useful intermediate step would be to just export the
> fence (either as sync_file, which I think amdgpu doesn't support because
> no android egl support in mesa) or drm_syncobj (which you can do as
> standalone fd too iirc), and then just using the fence a bit from
> userspace (like wait on it or get its status) after the device is
> unplugged.
>
> I think this should cover most of the cross-driver issues that fences
> bring in, and I think for the other problems we can worry once we spot.
> -Daniel


OK, will write up all the tests and submit a merge request for all of them 
together to IGT gitlab

Andrey


>
>> Andrey
>>
>>>> Andrey
>>>>
>>>>
>>>>>> Andrey
>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> But just to get it started
>>>>>>>>>>> you can throw in entirely amdgpu specific subtests and just share some of
>>>>>>>>>>> the test code.
>>>>>>>>>>> -Daniel
>>>>>>>>>> Im general, I wasn't aware of this test suite and looks like it does what i test
>>>>>>>>>> among other stuff.
>>>>>>>>>> I will definitely  try to run with it although the rescan part will not work as
>>>>>>>>>> plugging
>>>>>>>>>> the device back is in my TODO list and not part of the scope for this patchset
>>>>>>>>>> and so I will
>>>>>>>>>> probably comment the re-scan section out while testing.
>>>>>>>>> amd gem has been using libdrm-amd thus far iirc, but for things like
>>>>>>>>> this I think it'd be worth to at least consider switching. Display
>>>>>>>>> team has already started to use some of the test and contribute stuff
>>>>>>>>> (I think the VRR testcase is from amd).
>>>>>>>>> -Daniel
>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> Andrey Grodzovsky (13):
>>>>>>>>>>>>         drm/ttm: Remap all page faults to per process dummy page.
>>>>>>>>>>>>         drm: Unamp the entire device address space on device unplug
>>>>>>>>>>>>         drm/ttm: Expose ttm_tt_unpopulate for driver use
>>>>>>>>>>>>         drm/sched: Cancel and flush all oustatdning jobs before finish.
>>>>>>>>>>>>         drm/amdgpu: Split amdgpu_device_fini into early and late
>>>>>>>>>>>>         drm/amdgpu: Add early fini callback
>>>>>>>>>>>>         drm/amdgpu: Register IOMMU topology notifier per device.
>>>>>>>>>>>>         drm/amdgpu: Fix a bunch of sdma code crash post device unplug
>>>>>>>>>>>>         drm/amdgpu: Remap all page faults to per process dummy page.
>>>>>>>>>>>>         dmr/amdgpu: Move some sysfs attrs creation to default_attr
>>>>>>>>>>>>         drm/amdgpu: Guard against write accesses after device removal
>>>>>>>>>>>>         drm/sched: Make timeout timer rearm conditional.
>>>>>>>>>>>>         drm/amdgpu: Prevent any job recoveries after device is unplugged.
>>>>>>>>>>>>
>>>>>>>>>>>> Luben Tuikov (1):
>>>>>>>>>>>>         drm/scheduler: Job timeout handler returns status
>>>>>>>>>>>>
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu.h               |  11 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c      |  17 +--
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_device.c        | 149 ++++++++++++++++++++--
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c           |  20 ++-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c         |  15 ++-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c          |   2 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h          |   1 +
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c           |   9 ++
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c       |  25 ++--
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c           |  26 ++--
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h           |   3 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_job.c           |  19 ++-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c           |  12 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_object.c        |  10 ++
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_object.h        |   2 +
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c           |  53 +++++---
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h           |   3 +
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c           |   1 +
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c          |  70 ++++++++++
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h          |  52 +-------
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c           |  21 ++-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c            |   8 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c      |  14 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/cik_ih.c               |   2 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/cz_ih.c                |   2 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/iceland_ih.c           |   2 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/navi10_ih.c            |   2 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/psp_v11_0.c            |  16 +--
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/psp_v12_0.c            |   8 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/psp_v3_1.c             |   8 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/si_ih.c                |   2 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/tonga_ih.c             |   2 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/vega10_ih.c            |   2 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  12 +-
>>>>>>>>>>>>        drivers/gpu/drm/amd/include/amd_shared.h          |   2 +
>>>>>>>>>>>>        drivers/gpu/drm/drm_drv.c                         |   3 +
>>>>>>>>>>>>        drivers/gpu/drm/etnaviv/etnaviv_sched.c           |  10 +-
>>>>>>>>>>>>        drivers/gpu/drm/lima/lima_sched.c                 |   4 +-
>>>>>>>>>>>>        drivers/gpu/drm/panfrost/panfrost_job.c           |   9 +-
>>>>>>>>>>>>        drivers/gpu/drm/scheduler/sched_main.c            |  18 ++-
>>>>>>>>>>>>        drivers/gpu/drm/ttm/ttm_bo_vm.c                   |  82 +++++++++++-
>>>>>>>>>>>>        drivers/gpu/drm/ttm/ttm_tt.c                      |   1 +
>>>>>>>>>>>>        drivers/gpu/drm/v3d/v3d_sched.c                   |  32 ++---
>>>>>>>>>>>>        include/drm/gpu_scheduler.h                       |  17 ++-
>>>>>>>>>>>>        include/drm/ttm/ttm_bo_api.h                      |   2 +
>>>>>>>>>>>>        45 files changed, 583 insertions(+), 198 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> 2.7.4
>>>>>>>>>>>>
>>>