[Patch v3 3/4] drm/amdkfd: refactor runtime pm for baco

Felix Kuehling felix.kuehling at amd.com
Fri Feb 7 21:49:57 UTC 2020


One more nit-pick and one error-handling problem inline.

On 2020-02-06 7:09 p.m., Rajneesh Bhardwaj wrote:
> So far the kfd driver implemented same routines for runtime and system
> wide suspend and resume (s2idle or mem). During system wide suspend the
> kfd aquires an atomic lock that prevents any more user processes to
> create queues and interact with kfd driver and amd gpu. This mechanism
> created problem when amdgpu device is runtime suspended with BACO
> enabled. Any application that relies on kfd driver fails to load because
> the driver reports a locked kfd device since gpu is runtime suspended.
>
> However, in an ideal case, when gpu is runtime  suspended the kfd driver
> should be able to:
>
>   - auto resume amdgpu driver whenever a client requests compute service
>   - prevent runtime suspend for amdgpu  while kfd is in use
>
> This change refactors the amdgpu and amdkfd drivers to support BACO and
> runtime power management.
>
> Reviewed-by: Oak Zeng <oak.zeng at amd.com>
> Reviewed-by: Felix Kuehling <felix.kuehling at amd.com>
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 +++----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 ++---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +--
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 29 +++++++++-------
>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h      |  1 +
>   drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 40 ++++++++++++++++++++--
>   6 files changed, 68 insertions(+), 26 deletions(-)
>
[snip]
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 98dcbb96b2e2..6d6c25fe2677 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -31,6 +31,7 @@
>   #include <linux/compat.h>
>   #include <linux/mman.h>
>   #include <linux/file.h>
> +#include <linux/pm_runtime.h>
>   #include "amdgpu_amdkfd.h"
>   #include "amdgpu.h"
>   
> @@ -527,6 +528,16 @@ static void kfd_process_destroy_pdds(struct kfd_process *p)
>   		kfree(pdd->qpd.doorbell_bitmap);
>   		idr_destroy(&pdd->alloc_idr);
>   
> +		/*
> +		 * before destroying pdd, make sure to report availability
> +		 * for auto suspend
> +		 */
> +		if (pdd->runtime_inuse) {
> +			pm_runtime_mark_last_busy(pdd->dev->ddev->dev);
> +			pm_runtime_put_autosuspend(pdd->dev->ddev->dev);
> +			pdd->runtime_inuse = false;
> +		}
> +
>   		kfree(pdd);
>   	}
>   }
> @@ -844,6 +855,7 @@ struct kfd_process_device *kfd_create_process_device_data(struct kfd_dev *dev,
>   	pdd->process = p;
>   	pdd->bound = PDD_UNBOUND;
>   	pdd->already_dequeued = false;
> +	pdd->runtime_inuse = false;
>   	list_add(&pdd->per_device_list, &p->per_device_data);
>   
>   	/* Init idr used for memory handle translation */
> @@ -933,15 +945,39 @@ struct kfd_process_device *kfd_bind_process_to_device(struct kfd_dev *dev,
>   		return ERR_PTR(-ENOMEM);
>   	}
>   
> +	/*
> +	 * signal runtime-pm system to auto resume and prevent
> +	 * further runtime suspend once device pdd is created until
> +	 * pdd is destroyed.
> +	 */
> +	if (!pdd->runtime_inuse) {
> +		err = pm_runtime_get_sync(dev->ddev->dev);
> +		if (err < 0)
> +			return ERR_PTR(err);
> +	}
> +
>   	err = kfd_iommu_bind_process_to_device(pdd);
>   	if (err)
> -		return ERR_PTR(err);
> +		goto out;
>   
>   	err = kfd_process_device_init_vm(pdd, NULL);
>   	if (err)
> -		return ERR_PTR(err);
> +		goto out;
> +
> +	if (!err)

This "if" is also redundant. If there was an error, you already did goto 
out. pdd->runtime_inuse should be set whenever we return successfully 
from this function, so logically there should be no extra "if".


> +		/*
> +		 * make sure that runtime_usage counter is incremented
> +		 * just once per pdd
> +		 */
> +		pdd->runtime_inuse = true;
>   
>   	return pdd;
> +
> +out:
> +	/* balance runpm reference count and exit with error */

I think you need an "if (!pdd->runtime_inuse)" here. If this function 
didn't call pm_runtime_get_sync above, you shouldn't do the cleanup 
below. Otherwise you risk getting unbalanced usage counters. In other 
words, you need to use the same condition for pm_runtime_get_sync and 
the cleanup.

Regards,
   Felix


> +	pm_runtime_mark_last_busy(dev->ddev->dev);
> +	pm_runtime_put_autosuspend(dev->ddev->dev);
> +	return ERR_PTR(err);
>   }
>   
>   struct kfd_process_device *kfd_get_first_process_device_data(


More information about the amd-gfx mailing list