[PATCH] drm/amdgpu: fix reset domain xgmi hive info reference leak

Felix Kuehling felix.kuehling at amd.com
Thu Aug 11 15:18:49 UTC 2022


Am 2022-08-11 um 09:42 schrieb Jonathan Kim:
> When an xgmi node is added to the hive, it takes another hive
> reference for its reset domain.
>
> This extra reference was not dropped on device removal from the
> hive so drop it.
>
> Signed-off-by: Jonathan Kim <jonathan.kim at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> index 1b108d03e785..560bf1c98f08 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> @@ -731,6 +731,9 @@ int amdgpu_xgmi_remove_device(struct amdgpu_device *adev)
>   	mutex_unlock(&hive->hive_lock);
>   
>   	amdgpu_put_xgmi_hive(hive);
> +	/* device is removed from the hive so remove its reset domain reference */
> +	if (adev->reset_domain && adev->reset_domain == hive->reset_domain)
> +		amdgpu_put_xgmi_hive(hive);

This is some messed up reference counting. If you need an extra 
reference from the reset_domain to the hive, that should be owned by the 
reset_domain and dropped when the reset_domain is destroyed. And it's 
only one reference for the reset_domain, not one reference per adev in 
the reset_domain.

What you're doing here looks like every adev that's in a reset_domain of 
its hive has two references to the hive. And if you're dropping the 
extra reference here, it still leaves the reset_domain with a dangling 
pointer to a hive that may no longer exist. So this extra reference is 
kind of pointless.

Regards,
   Felix


>   	adev->hive = NULL;
>   
>   	if (atomic_dec_return(&hive->number_devices) == 0) {


More information about the amd-gfx mailing list