[PATCH] drm/amdgpu: fix reset domain xgmi hive info reference leak
Felix Kuehling
felix.kuehling at amd.com
Thu Aug 11 15:18:49 UTC 2022
Am 2022-08-11 um 09:42 schrieb Jonathan Kim:
> When an xgmi node is added to the hive, it takes another hive
> reference for its reset domain.
>
> This extra reference was not dropped on device removal from the
> hive so drop it.
>
> Signed-off-by: Jonathan Kim <jonathan.kim at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> index 1b108d03e785..560bf1c98f08 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> @@ -731,6 +731,9 @@ int amdgpu_xgmi_remove_device(struct amdgpu_device *adev)
> mutex_unlock(&hive->hive_lock);
>
> amdgpu_put_xgmi_hive(hive);
> + /* device is removed from the hive so remove its reset domain reference */
> + if (adev->reset_domain && adev->reset_domain == hive->reset_domain)
> + amdgpu_put_xgmi_hive(hive);
This is some messed up reference counting. If you need an extra
reference from the reset_domain to the hive, that should be owned by the
reset_domain and dropped when the reset_domain is destroyed. And it's
only one reference for the reset_domain, not one reference per adev in
the reset_domain.
What you're doing here looks like every adev that's in a reset_domain of
its hive has two references to the hive. And if you're dropping the
extra reference here, it still leaves the reset_domain with a dangling
pointer to a hive that may no longer exist. So this extra reference is
kind of pointless.
Regards,
Felix
> adev->hive = NULL;
>
> if (atomic_dec_return(&hive->number_devices) == 0) {
More information about the amd-gfx
mailing list