[PATCH v4 07/14] drm/amdgpu: Register IOMMU topology notifier per device.

Thu Jan 21 10:48:40 UTC 2021

On Wed, Jan 20, 2021 at 8:16 PM Andrey Grodzovsky
<Andrey.Grodzovsky at amd.com> wrote:
>
>
> On 1/20/21 3:38 AM, Daniel Vetter wrote:
> > On Wed, Jan 20, 2021 at 5:21 AM Andrey Grodzovsky
> > <Andrey.Grodzovsky at amd.com> wrote:
> >>
> >> On 1/19/21 5:01 PM, Daniel Vetter wrote:
> >>> On Tue, Jan 19, 2021 at 10:22 PM Andrey Grodzovsky
> >>> <Andrey.Grodzovsky at amd.com> wrote:
> >>>> On 1/19/21 8:45 AM, Daniel Vetter wrote:
> >>>>
> >>>> On Tue, Jan 19, 2021 at 09:48:03AM +0100, Christian König wrote:
> >>>>
> >>>> Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:
> >>>>
> >>>> Handle all DMA IOMMU gropup related dependencies before the
> >>>> group is removed.
> >>>>
> >>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> >>>> ---
> >>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  5 ++++
> >>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 46 ++++++++++++++++++++++++++++++
> >>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   |  2 +-
> >>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h   |  1 +
> >>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 10 +++++++
> >>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  2 ++
> >>>>     6 files changed, 65 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>>> index 478a7d8..2953420 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>>> @@ -51,6 +51,7 @@
> >>>>     #include <linux/dma-fence.h>
> >>>>     #include <linux/pci.h>
> >>>>     #include <linux/aer.h>
> >>>> +#include <linux/notifier.h>
> >>>>     #include <drm/ttm/ttm_bo_api.h>
> >>>>     #include <drm/ttm/ttm_bo_driver.h>
> >>>> @@ -1041,6 +1042,10 @@ struct amdgpu_device {
> >>>>     bool                            in_pci_err_recovery;
> >>>>     struct pci_saved_state          *pci_state;
> >>>> +
> >>>> + struct notifier_block nb;
> >>>> + struct blocking_notifier_head notifier;
> >>>> + struct list_head device_bo_list;
> >>>>     };
> >>>>     static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>> index 45e23e3..e99f4f1 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>> @@ -70,6 +70,8 @@
> >>>>     #include <drm/task_barrier.h>
> >>>>     #include <linux/pm_runtime.h>
> >>>> +#include <linux/iommu.h>
> >>>> +
> >>>>     MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
> >>>>     MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
> >>>>     MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
> >>>> @@ -3200,6 +3202,39 @@ static const struct attribute *amdgpu_dev_attributes[] = {
> >>>>     };
> >>>> +static int amdgpu_iommu_group_notifier(struct notifier_block *nb,
> >>>> +     unsigned long action, void *data)
> >>>> +{
> >>>> + struct amdgpu_device *adev = container_of(nb, struct amdgpu_device, nb);
> >>>> + struct amdgpu_bo *bo = NULL;
> >>>> +
> >>>> + /*
> >>>> + * Following is a set of IOMMU group dependencies taken care of before
> >>>> + * device's IOMMU group is removed
> >>>> + */
> >>>> + if (action == IOMMU_GROUP_NOTIFY_DEL_DEVICE) {
> >>>> +
> >>>> + spin_lock(&ttm_bo_glob.lru_lock);
> >>>> + list_for_each_entry(bo, &adev->device_bo_list, bo) {
> >>>> + if (bo->tbo.ttm)
> >>>> + ttm_tt_unpopulate(bo->tbo.bdev, bo->tbo.ttm);
> >>>> + }
> >>>> + spin_unlock(&ttm_bo_glob.lru_lock);
> >>>>
> >>>> That approach won't work. ttm_tt_unpopulate() might sleep on an IOMMU lock.
> >>>>
> >>>> You need to use a mutex here or even better make sure you can access the
> >>>> device_bo_list without a lock in this moment.
> >>>>
> >>>> I'd also be worried about the notifier mutex getting really badly in the
> >>>> way.
> >>>>
> >>>> Plus I'm worried why we even need this, it sounds a bit like papering over
> >>>> the iommu subsystem. Assuming we clean up all our iommu mappings in our
> >>>> device hotunplug/unload code, why do we still need to have an additional
> >>>> iommu notifier on top, with all kinds of additional headaches? The iommu
> >>>> shouldn't clean up before the devices in its group have cleaned up.
> >>>>
> >>>> I think we need more info here on what the exact problem is first.
> >>>> -Daniel
> >>>>
> >>>>
> >>>> Originally I experienced the  crash bellow on IOMMU enabled device, it happens post device removal from PCI topology -
> >>>> during shutting down of user client holding last reference to drm device file (X in my case).
> >>>> The crash is because by the time I get to this point struct device->iommu_group pointer is NULL
> >>>> already since the IOMMU group for the device is unset during PCI removal. So this contradicts what you said above
> >>>> that the iommu shouldn't clean up before the devices in its group have cleaned up.
> >>>> So instead of guessing when is the right place to place all IOMMU related cleanups it makes sense
> >>>> to get notification from IOMMU subsystem in the form of event IOMMU_GROUP_NOTIFY_DEL_DEVICE
> >>>> and use that place to do all the relevant cleanups.
> >>> Yeah that goes boom, but you shouldn't need this special iommu cleanup
> >>> handler. Making sure that all the dma-api mappings are gone needs to
> >>> be done as part of the device hotunplug, you can't delay that to the
> >>> last drm_device cleanup.
> >>>
> >>> So I most of the patch here with pulling that out (should be outright
> >>> removed from the final release code even) is good, just not yet how
> >>> you call that new code. Probably these bits (aside from walking all
> >>> buffers and unpopulating the tt) should be done from the early_free
> >>> callback you're adding.
> >>>
> >>> Also what I just realized: For normal unload you need to make sure the
> >>> hw is actually stopped first, before we unmap buffers. Otherwise
> >>> driver unload will likely result in wedged hw, probably not what you
> >>> want for debugging.
> >>> -Daniel
> >> Since device removal from IOMMU group and this hook in particular
> >> takes place before call to amdgpu_pci_remove essentially it means
> >> that for IOMMU use case the entire amdgpu_device_fini_hw function
> >> shouold be called here to stop the HW instead from amdgpu_pci_remove.
> > The crash you showed was on final drm_close, which should happen after
> > device removal, so that's clearly buggy. If the iommu subsystem
> > removes stuff before the driver could clean up already, then I think
> > that's an iommu bug or dma-api bug. Just plain using dma_map/unmap and
> > friends really shouldn't need notifier hacks like you're implementing
> > here. Can you pls show me a backtrace where dma_unmap_sg blows up when
> > it's put into the pci_driver remove callback?
>
>
> It's not blowing up and has the same effect as using this notifier because setting
> of device->iommu_group pointer to NULL takes place at the same call stack but
> after amdgpu_pci_remove is called (see pci_stop_and_remove_bus_device).
> But i think that using notifier callback is better then just sticking
> the cleanup code in amdgpu_pci_remove because this is IOMMU specific cleanup and
> also coupled
> in the code to the place where device->iommu_group is unset.

Notifiers are a locking pain, plus dma_unmap_* is really just plain
normal driver cleanup work. If you want neat&automatic cleanup, look
at devm_ family of functions. There's imo really no reason to have
this notifier, and only reasons against it.

I think intermediately the cleanest solution is to put each cleanup
into a corresponding hw_fini callback (early_free is maybe a bit
ambiguous in what exactly it means).
-Daniel

> Andrey
>
>
> >
> >> Looking at this from another perspective, AFAIK on each new device probing
> >> either due to PCI bus rescan or driver reload we are resetting the ASIC before doing
> >> any init operations (assuming we successfully gained MMIO access) and so maybe
> >> your concern is not an issue ?
> > Reset on probe is too late. The problem is that if you just remove the
> > driver, your device is doing dma at that moment. And you kinda have to
> > stop that before you free the mappings/memory. Of course when the
> > device is actually hotunplugged, then dma is guaranteed to have
> > stopped already. I'm not sure whether disabling the pci device is
> > enough to make sure no more dma happens, could be that's enough.
> > -Daniel
> >
> >> Andrey
> >>
> >>
> >>>> Andrey
> >>>>
> >>>>
> >>>> [  123.810074 <   28.126960>] BUG: kernel NULL pointer dereference, address: 00000000000000c8
> >>>> [  123.810080 <    0.000006>] #PF: supervisor read access in kernel mode
> >>>> [  123.810082 <    0.000002>] #PF: error_code(0x0000) - not-present page
> >>>> [  123.810085 <    0.000003>] PGD 0 P4D 0
> >>>> [  123.810089 <    0.000004>] Oops: 0000 [#1] SMP NOPTI
> >>>> [  123.810094 <    0.000005>] CPU: 5 PID: 1418 Comm: Xorg:shlo4 Tainted: G           O      5.9.0-rc2-dev+ #59
> >>>> [  123.810096 <    0.000002>] Hardware name: System manufacturer System Product Name/PRIME X470-PRO, BIOS 4406 02/28/2019
> >>>> [  123.810105 <    0.000009>] RIP: 0010:iommu_get_dma_domain+0x10/0x20
> >>>> [  123.810108 <    0.000003>] Code: b0 48 c7 87 98 00 00 00 00 00 00 00 31 c0 c3 b8 f4 ff ff ff eb a6 0f 1f 40 00 0f 1f 44 00 00 48 8b 87 d0 02 00 00 55 48 89 e5 <48> 8b 80 c8 00 00 00 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48
> >>>> [  123.810111 <    0.000003>] RSP: 0018:ffffa2e201f7f980 EFLAGS: 00010246
> >>>> [  123.810114 <    0.000003>] RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000
> >>>> [  123.810116 <    0.000002>] RDX: 0000000000001000 RSI: 00000000bf5cb000 RDI: ffff93c259dc60b0
> >>>> [  123.810117 <    0.000001>] RBP: ffffa2e201f7f980 R08: 0000000000000000 R09: 0000000000000000
> >>>> [  123.810119 <    0.000002>] R10: ffffa2e201f7faf0 R11: 0000000000000001 R12: 00000000bf5cb000
> >>>> [  123.810121 <    0.000002>] R13: 0000000000001000 R14: ffff93c24cef9c50 R15: ffff93c256c05688
> >>>> [  123.810124 <    0.000003>] FS:  00007f5e5e8d3700(0000) GS:ffff93c25e940000(0000) knlGS:0000000000000000
> >>>> [  123.810126 <    0.000002>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>> [  123.810128 <    0.000002>] CR2: 00000000000000c8 CR3: 000000027fe0a000 CR4: 00000000003506e0
> >>>> [  123.810130 <    0.000002>] Call Trace:
> >>>> [  123.810136 <    0.000006>]  __iommu_dma_unmap+0x2e/0x100
> >>>> [  123.810141 <    0.000005>]  ? kfree+0x389/0x3a0
> >>>> [  123.810144 <    0.000003>]  iommu_dma_unmap_page+0xe/0x10
> >>>> [  123.810149 <    0.000005>] dma_unmap_page_attrs+0x4d/0xf0
> >>>> [  123.810159 <    0.000010>]  ? ttm_bo_del_from_lru+0x8e/0xb0 [ttm]
> >>>> [  123.810165 <    0.000006>] ttm_unmap_and_unpopulate_pages+0x8e/0xc0 [ttm]
> >>>> [  123.810252 <    0.000087>] amdgpu_ttm_tt_unpopulate+0xaa/0xd0 [amdgpu]
> >>>> [  123.810258 <    0.000006>]  ttm_tt_unpopulate+0x59/0x70 [ttm]
> >>>> [  123.810264 <    0.000006>]  ttm_tt_destroy+0x6a/0x70 [ttm]
> >>>> [  123.810270 <    0.000006>] ttm_bo_cleanup_memtype_use+0x36/0xa0 [ttm]
> >>>> [  123.810276 <    0.000006>]  ttm_bo_put+0x1e7/0x400 [ttm]
> >>>> [  123.810358 <    0.000082>]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
> >>>> [  123.810440 <    0.000082>] amdgpu_gem_object_free+0x37/0x50 [amdgpu]
> >>>> [  123.810459 <    0.000019>]  drm_gem_object_free+0x35/0x40 [drm]
> >>>> [  123.810476 <    0.000017>] drm_gem_object_handle_put_unlocked+0x9d/0xd0 [drm]
> >>>> [  123.810494 <    0.000018>] drm_gem_object_release_handle+0x74/0x90 [drm]
> >>>> [  123.810511 <    0.000017>]  ? drm_gem_object_handle_put_unlocked+0xd0/0xd0 [drm]
> >>>> [  123.810516 <    0.000005>]  idr_for_each+0x4d/0xd0
> >>>> [  123.810534 <    0.000018>]  drm_gem_release+0x20/0x30 [drm]
> >>>> [  123.810550 <    0.000016>]  drm_file_free+0x251/0x2a0 [drm]
> >>>> [  123.810567 <    0.000017>] drm_close_helper.isra.14+0x61/0x70 [drm]
> >>>> [  123.810583 <    0.000016>]  drm_release+0x6a/0xe0 [drm]
> >>>> [  123.810588 <    0.000005>]  __fput+0xa2/0x250
> >>>> [  123.810592 <    0.000004>]  ____fput+0xe/0x10
> >>>> [  123.810595 <    0.000003>]  task_work_run+0x6c/0xa0
> >>>> [  123.810600 <    0.000005>]  do_exit+0x376/0xb60
> >>>> [  123.810604 <    0.000004>]  do_group_exit+0x43/0xa0
> >>>> [  123.810608 <    0.000004>]  get_signal+0x18b/0x8e0
> >>>> [  123.810612 <    0.000004>]  ? do_futex+0x595/0xc20
> >>>> [  123.810617 <    0.000005>]  arch_do_signal+0x34/0x880
> >>>> [  123.810620 <    0.000003>]  ? check_preempt_curr+0x50/0x60
> >>>> [  123.810623 <    0.000003>]  ? ttwu_do_wakeup+0x1e/0x160
> >>>> [  123.810626 <    0.000003>]  ? ttwu_do_activate+0x61/0x70
> >>>> [  123.810630 <    0.000004>] exit_to_user_mode_prepare+0x124/0x1b0
> >>>> [  123.810635 <    0.000005>] syscall_exit_to_user_mode+0x31/0x170
> >>>> [  123.810639 <    0.000004>]  do_syscall_64+0x43/0x80
> >>>>
> >>>>
> >>>> Andrey
> >>>>
> >>>>
> >>>>
> >>>> Christian.
> >>>>
> >>>> +
> >>>> + if (adev->irq.ih.use_bus_addr)
> >>>> + amdgpu_ih_ring_fini(adev, &adev->irq.ih);
> >>>> + if (adev->irq.ih1.use_bus_addr)
> >>>> + amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
> >>>> + if (adev->irq.ih2.use_bus_addr)
> >>>> + amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
> >>>> +
> >>>> + amdgpu_gart_dummy_page_fini(adev);
> >>>> + }
> >>>> +
> >>>> + return NOTIFY_OK;
> >>>> +}
> >>>> +
> >>>> +
> >>>>     /**
> >>>>      * amdgpu_device_init - initialize the driver
> >>>>      *
> >>>> @@ -3304,6 +3339,8 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> >>>>     INIT_WORK(&adev->xgmi_reset_work, amdgpu_device_xgmi_reset_func);
> >>>> + INIT_LIST_HEAD(&adev->device_bo_list);
> >>>> +
> >>>>     adev->gfx.gfx_off_req_count = 1;
> >>>>     adev->pm.ac_power = power_supply_is_system_supplied() > 0;
> >>>> @@ -3575,6 +3612,15 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> >>>>     if (amdgpu_device_cache_pci_state(adev->pdev))
> >>>>     pci_restore_state(pdev);
> >>>> + BLOCKING_INIT_NOTIFIER_HEAD(&adev->notifier);
> >>>> + adev->nb.notifier_call = amdgpu_iommu_group_notifier;
> >>>> +
> >>>> + if (adev->dev->iommu_group) {
> >>>> + r = iommu_group_register_notifier(adev->dev->iommu_group, &adev->nb);
> >>>> + if (r)
> >>>> + goto failed;
> >>>> + }
> >>>> +
> >>>>     return 0;
> >>>>     failed:
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> >>>> index 0db9330..486ad6d 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> >>>> @@ -92,7 +92,7 @@ static int amdgpu_gart_dummy_page_init(struct amdgpu_device *adev)
> >>>>      *
> >>>>      * Frees the dummy page used by the driver (all asics).
> >>>>      */
> >>>> -static void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
> >>>> +void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
> >>>>     {
> >>>>     if (!adev->dummy_page_addr)
> >>>>     return;
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
> >>>> index afa2e28..5678d9c 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
> >>>> @@ -61,6 +61,7 @@ int amdgpu_gart_table_vram_pin(struct amdgpu_device *adev);
> >>>>     void amdgpu_gart_table_vram_unpin(struct amdgpu_device *adev);
> >>>>     int amdgpu_gart_init(struct amdgpu_device *adev);
> >>>>     void amdgpu_gart_fini(struct amdgpu_device *adev);
> >>>> +void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev);
> >>>>     int amdgpu_gart_unbind(struct amdgpu_device *adev, uint64_t offset,
> >>>>           int pages);
> >>>>     int amdgpu_gart_map(struct amdgpu_device *adev, uint64_t offset,
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> >>>> index 6cc9919..4a1de69 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> >>>> @@ -94,6 +94,10 @@ static void amdgpu_bo_destroy(struct ttm_buffer_object *tbo)
> >>>>     }
> >>>>     amdgpu_bo_unref(&bo->parent);
> >>>> + spin_lock(&ttm_bo_glob.lru_lock);
> >>>> + list_del(&bo->bo);
> >>>> + spin_unlock(&ttm_bo_glob.lru_lock);
> >>>> +
> >>>>     kfree(bo->metadata);
> >>>>     kfree(bo);
> >>>>     }
> >>>> @@ -613,6 +617,12 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev,
> >>>>     if (bp->type == ttm_bo_type_device)
> >>>>     bo->flags &= ~AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED;
> >>>> + INIT_LIST_HEAD(&bo->bo);
> >>>> +
> >>>> + spin_lock(&ttm_bo_glob.lru_lock);
> >>>> + list_add_tail(&bo->bo, &adev->device_bo_list);
> >>>> + spin_unlock(&ttm_bo_glob.lru_lock);
> >>>> +
> >>>>     return 0;
> >>>>     fail_unreserve:
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> >>>> index 9ac3756..5ae8555 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> >>>> @@ -110,6 +110,8 @@ struct amdgpu_bo {
> >>>>     struct list_head shadow_list;
> >>>>     struct kgd_mem                  *kfd_bo;
> >>>> +
> >>>> + struct list_head bo;
> >>>>     };
> >>>>     static inline struct amdgpu_bo *ttm_to_amdgpu_bo(struct ttm_buffer_object *tbo)
> >>>
> >
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch