[PATCH] drm/amdgpu: fix reset domain xgmi hive info reference leak

Andrey Grodzovsky andrey.grodzovsky at amd.com
Thu Aug 11 16:43:16 UTC 2022


On 2022-08-11 11:34, Kim, Jonathan wrote:
> [Public]
>
>> -----Original Message-----
>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
>> Sent: August 11, 2022 11:19 AM
>> To: amd-gfx at lists.freedesktop.org; Kim, Jonathan <Jonathan.Kim at amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference
>> leak
>>
>> Am 2022-08-11 um 09:42 schrieb Jonathan Kim:
>>> When an xgmi node is added to the hive, it takes another hive
>>> reference for its reset domain.
>>>
>>> This extra reference was not dropped on device removal from the
>>> hive so drop it.
>>>
>>> Signed-off-by: Jonathan Kim <jonathan.kim at amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 3 +++
>>>    1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>> index 1b108d03e785..560bf1c98f08 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>> @@ -731,6 +731,9 @@ int amdgpu_xgmi_remove_device(struct
>> amdgpu_device *adev)
>>>      mutex_unlock(&hive->hive_lock);
>>>
>>>      amdgpu_put_xgmi_hive(hive);
>>> +   /* device is removed from the hive so remove its reset domain
>> reference */
>>> +   if (adev->reset_domain && adev->reset_domain == hive-
>>> reset_domain)
>>> +           amdgpu_put_xgmi_hive(hive);
>> This is some messed up reference counting. If you need an extra
>> reference from the reset_domain to the hive, that should be owned by the
>> reset_domain and dropped when the reset_domain is destroyed. And it's
>> only one reference for the reset_domain, not one reference per adev in
>> the reset_domain.
> Cc'ing Andrey.
>
> What you're saying seems to make more sense to me, but what I got from an offline conversation with Andrey
> was that the reset domain reference per device was intentional.
> Maybe Andrey can comment here.
>
>> What you're doing here looks like every adev that's in a reset_domain of
>> its hive has two references to the hive. And if you're dropping the
>> extra reference here, it still leaves the reset_domain with a dangling
>> pointer to a hive that may no longer exist. So this extra reference is
>> kind of pointless.


reset_domain doesn't have any references to the hive, the hive has a 
reference to reset_domain


> Yes.  Currently one reference is fetched from the device's lifetime on the hive and the other is from the
> per-device reset domain.
>
> Snippet from amdgpu_device_ip_init:
>          /**
>           * In case of XGMI grab extra reference for reset domain for this device
>           */
>          if (adev->gmc.xgmi.num_physical_nodes > 1) {
>                  if (amdgpu_xgmi_add_device(adev) == 0) { <- [JK] reference is fetched here


amdgpu_xgmi_add_device calls  amdgpu_get_xgmi_hive and only on the first 
time amdgpu_get_xgmi_hive is called and hive is actually allocated and 
initialized  will we proceed
to creating the reset domain either from scratch (first creation of the 
hive) or by taking reference from adev (see [1])



[1] - 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c#L394

>                          struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev); <- [JK] then here again


So here I don't see how an extra reference to reset_domain is taken if 
amdgpu_get_xgmi_hive returns early since the hive already created and 
exists in the global hive container ?

Johantan - can u please show the exact flow how recount leak on 
reset_domain is happening ?

Andrey


>
>                          if (!hive->reset_domain ||
>                              !amdgpu_reset_get_reset_domain(hive->reset_domain)) {
>                                  r = -ENOENT;
>                                  goto init_failed;
>                          }
>
>                          /* Drop the early temporary reset domain we created for device */
>                          amdgpu_reset_put_reset_domain(adev->reset_domain);
>                          adev->reset_domain = hive->reset_domain;
>                  }
>          }
>
> One of these never gets dropped so a leak happens.
> So either the extra reference has to be dropped on device removal from the hive or from what you've mentioned,
> the reset_domain reference fetch should be fixed to grab at the hive/reset_domain level.
>
> Thanks,
>
> Jon
>
>> Regards,
>>     Felix
>>
>>
>>>      adev->hive = NULL;
>>>
>>>      if (atomic_dec_return(&hive->number_devices) == 0) {


More information about the amd-gfx mailing list