[PATCH] drm/amdgpu: fix reset domain xgmi hive info reference leak

Kim, Jonathan Jonathan.Kim at amd.com
Thu Aug 11 15:34:59 UTC 2022


[Public]

> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> Sent: August 11, 2022 11:19 AM
> To: amd-gfx at lists.freedesktop.org; Kim, Jonathan <Jonathan.Kim at amd.com>
> Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference
> leak
>
> Am 2022-08-11 um 09:42 schrieb Jonathan Kim:
> > When an xgmi node is added to the hive, it takes another hive
> > reference for its reset domain.
> >
> > This extra reference was not dropped on device removal from the
> > hive so drop it.
> >
> > Signed-off-by: Jonathan Kim <jonathan.kim at amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 3 +++
> >   1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> > index 1b108d03e785..560bf1c98f08 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> > @@ -731,6 +731,9 @@ int amdgpu_xgmi_remove_device(struct
> amdgpu_device *adev)
> >     mutex_unlock(&hive->hive_lock);
> >
> >     amdgpu_put_xgmi_hive(hive);
> > +   /* device is removed from the hive so remove its reset domain
> reference */
> > +   if (adev->reset_domain && adev->reset_domain == hive-
> >reset_domain)
> > +           amdgpu_put_xgmi_hive(hive);
>
> This is some messed up reference counting. If you need an extra
> reference from the reset_domain to the hive, that should be owned by the
> reset_domain and dropped when the reset_domain is destroyed. And it's
> only one reference for the reset_domain, not one reference per adev in
> the reset_domain.

Cc'ing Andrey.

What you're saying seems to make more sense to me, but what I got from an offline conversation with Andrey
was that the reset domain reference per device was intentional.
Maybe Andrey can comment here.

>
> What you're doing here looks like every adev that's in a reset_domain of
> its hive has two references to the hive. And if you're dropping the
> extra reference here, it still leaves the reset_domain with a dangling
> pointer to a hive that may no longer exist. So this extra reference is
> kind of pointless.

Yes.  Currently one reference is fetched from the device's lifetime on the hive and the other is from the
per-device reset domain.

Snippet from amdgpu_device_ip_init:
        /**
         * In case of XGMI grab extra reference for reset domain for this device
         */
        if (adev->gmc.xgmi.num_physical_nodes > 1) {
                if (amdgpu_xgmi_add_device(adev) == 0) { <- [JK] reference is fetched here
                        struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev); <- [JK] then here again

                        if (!hive->reset_domain ||
                            !amdgpu_reset_get_reset_domain(hive->reset_domain)) {
                                r = -ENOENT;
                                goto init_failed;
                        }

                        /* Drop the early temporary reset domain we created for device */
                        amdgpu_reset_put_reset_domain(adev->reset_domain);
                        adev->reset_domain = hive->reset_domain;
                }
        }

One of these never gets dropped so a leak happens.
So either the extra reference has to be dropped on device removal from the hive or from what you've mentioned,
the reset_domain reference fetch should be fixed to grab at the hive/reset_domain level.

Thanks,

Jon

>
> Regards,
>    Felix
>
>
> >     adev->hive = NULL;
> >
> >     if (atomic_dec_return(&hive->number_devices) == 0) {


More information about the amd-gfx mailing list