[Patch v3 3/4] drm/amdkfd: refactor runtime pm for baco
Felix Kuehling
felix.kuehling at amd.com
Fri Feb 7 21:49:57 UTC 2020
One more nit-pick and one error-handling problem inline.
On 2020-02-06 7:09 p.m., Rajneesh Bhardwaj wrote:
> So far the kfd driver implemented same routines for runtime and system
> wide suspend and resume (s2idle or mem). During system wide suspend the
> kfd aquires an atomic lock that prevents any more user processes to
> create queues and interact with kfd driver and amd gpu. This mechanism
> created problem when amdgpu device is runtime suspended with BACO
> enabled. Any application that relies on kfd driver fails to load because
> the driver reports a locked kfd device since gpu is runtime suspended.
>
> However, in an ideal case, when gpu is runtime suspended the kfd driver
> should be able to:
>
> - auto resume amdgpu driver whenever a client requests compute service
> - prevent runtime suspend for amdgpu while kfd is in use
>
> This change refactors the amdgpu and amdkfd drivers to support BACO and
> runtime power management.
>
> Reviewed-by: Oak Zeng <oak.zeng at amd.com>
> Reviewed-by: Felix Kuehling <felix.kuehling at amd.com>
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 +++----
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 8 ++---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +--
> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 29 +++++++++-------
> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 +
> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 40 ++++++++++++++++++++--
> 6 files changed, 68 insertions(+), 26 deletions(-)
>
[snip]
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 98dcbb96b2e2..6d6c25fe2677 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -31,6 +31,7 @@
> #include <linux/compat.h>
> #include <linux/mman.h>
> #include <linux/file.h>
> +#include <linux/pm_runtime.h>
> #include "amdgpu_amdkfd.h"
> #include "amdgpu.h"
>
> @@ -527,6 +528,16 @@ static void kfd_process_destroy_pdds(struct kfd_process *p)
> kfree(pdd->qpd.doorbell_bitmap);
> idr_destroy(&pdd->alloc_idr);
>
> + /*
> + * before destroying pdd, make sure to report availability
> + * for auto suspend
> + */
> + if (pdd->runtime_inuse) {
> + pm_runtime_mark_last_busy(pdd->dev->ddev->dev);
> + pm_runtime_put_autosuspend(pdd->dev->ddev->dev);
> + pdd->runtime_inuse = false;
> + }
> +
> kfree(pdd);
> }
> }
> @@ -844,6 +855,7 @@ struct kfd_process_device *kfd_create_process_device_data(struct kfd_dev *dev,
> pdd->process = p;
> pdd->bound = PDD_UNBOUND;
> pdd->already_dequeued = false;
> + pdd->runtime_inuse = false;
> list_add(&pdd->per_device_list, &p->per_device_data);
>
> /* Init idr used for memory handle translation */
> @@ -933,15 +945,39 @@ struct kfd_process_device *kfd_bind_process_to_device(struct kfd_dev *dev,
> return ERR_PTR(-ENOMEM);
> }
>
> + /*
> + * signal runtime-pm system to auto resume and prevent
> + * further runtime suspend once device pdd is created until
> + * pdd is destroyed.
> + */
> + if (!pdd->runtime_inuse) {
> + err = pm_runtime_get_sync(dev->ddev->dev);
> + if (err < 0)
> + return ERR_PTR(err);
> + }
> +
> err = kfd_iommu_bind_process_to_device(pdd);
> if (err)
> - return ERR_PTR(err);
> + goto out;
>
> err = kfd_process_device_init_vm(pdd, NULL);
> if (err)
> - return ERR_PTR(err);
> + goto out;
> +
> + if (!err)
This "if" is also redundant. If there was an error, you already did goto
out. pdd->runtime_inuse should be set whenever we return successfully
from this function, so logically there should be no extra "if".
> + /*
> + * make sure that runtime_usage counter is incremented
> + * just once per pdd
> + */
> + pdd->runtime_inuse = true;
>
> return pdd;
> +
> +out:
> + /* balance runpm reference count and exit with error */
I think you need an "if (!pdd->runtime_inuse)" here. If this function
didn't call pm_runtime_get_sync above, you shouldn't do the cleanup
below. Otherwise you risk getting unbalanced usage counters. In other
words, you need to use the same condition for pm_runtime_get_sync and
the cleanup.
Regards,
Felix
> + pm_runtime_mark_last_busy(dev->ddev->dev);
> + pm_runtime_put_autosuspend(dev->ddev->dev);
> + return ERR_PTR(err);
> }
>
> struct kfd_process_device *kfd_get_first_process_device_data(
More information about the amd-gfx
mailing list