[PATCH v2 2/2] drm/amdkfd: pause autosuspend when creating pdd

Fri Oct 11 16:21:25 UTC 2024

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> Sent: Friday, October 11, 2024 11:17 AM
> To: Joshi, Mukul <Mukul.Joshi at amd.com>; Li, Yunxiang (Teddy)
> <Yunxiang.Li at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com>; Gui, Jack <Jack.Gui at amd.com>
> Subject: Re: [PATCH v2 2/2] drm/amdkfd: pause autosuspend when creating pdd
>
>
> On 2024-10-11 11:07, Joshi, Mukul wrote:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> >> -----Original Message-----
> >> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
> >> Yunxiang Li
> >> Sent: Thursday, October 10, 2024 12:18 PM
> >> To: amd-gfx at lists.freedesktop.org
> >> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian
> >> <Christian.Koenig at amd.com>; Gui, Jack <Jack.Gui at amd.com>; Li,
> >> Yunxiang
> >> (Teddy) <Yunxiang.Li at amd.com>
> >> Subject: [PATCH v2 2/2] drm/amdkfd: pause autosuspend when creating
> >> pdd
> >>
> >> When using MES creating a pdd will require talking to the GPU to
> >> setup the relevant context. The code here forgot to wake up the GPU
> >> in case it was in suspend, this causes KVM to EFAULT for passthrough GPU
> for example.
> >>
> >> Also, change the other place where we pause suspend to use the
> >> cleaner pm_runtime_resume_and_get helper.
> >>
> >> Fixes: cc009e613de6 ("drm/amdkfd: Add KFD support for soc21 v3")
> >> Signed-off-by: Yunxiang Li <Yunxiang.Li at amd.com>
> >> ---
> >>
> >> It is unclear to me if kfd_process_destroy_pdds also have this
> >> problem, or is freeing gtt mem guaranteed to succeed even with the GPU in
> suspend.
> >>
> >>  drivers/gpu/drm/amd/amdkfd/kfd_process.c | 13 +++++++++----
> >>  1 file changed, 9 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >> index c9ce95bc3ec8e..c3d510f1dabfa 100644
> >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >> @@ -1633,12 +1633,19 @@ struct kfd_process_device
> >> *kfd_create_process_device_data(struct kfd_node *dev,
> >>       atomic64_set(&pdd->evict_duration_counter, 0);
> >>
> >>       if (dev->kfd->shared_resources.enable_mes) {
> >> +             retval = pm_runtime_resume_and_get(bdev);
> >> +             if (retval < 0) {
> >> +                     pr_err("failed to stop autosuspend\n");
> >> +                     goto err_free_pdd;
> >> +             }
> >
> > I am not 100% sure if it makes sense to resume here and then put it
> > back on suspend queue after the allocation. And then again call
> pm_runtime_resume in kfd_bind_process_to_device().
> > Maybe you can set pdd->runtime_inuse = true here itself and not call the
> pm_runtime_put_autosuspend here.
> >
> > I will probably let @Kuehling, Felix comment on this.
>
> I don't think we want to set pdd->runtime_inuse here. That would make all GPUs
> like they're always in use, no matter if the application is actually using them. I think
> this part of the change is OK.
>
Ok thanks. I got confused and thought we would call kfd_bind_to_process() everytime.
But that’s not the case.

> >
> > Regards,
> > Mukul
> >
> >>               retval = amdgpu_amdkfd_alloc_gtt_mem(adev,
> >>                                               AMDGPU_MES_PROC_CTX_SIZE,
> >>                                               &pdd->proc_ctx_bo,
> >>                                               &pdd->proc_ctx_gpu_addr,
> >>                                               &pdd->proc_ctx_cpu_ptr,
> >>                                               false);
> >> +             pm_runtime_mark_last_busy(bdev);
> >> +             pm_runtime_put_autosuspend(bdev);
> >>               if (retval) {
> >>                       dev_err(bdev,
> >>                               "failed to allocate process context
> >> bo\n"); @@ -
> >> 1768,11 +1775,9 @@ struct kfd_process_device
> >> *kfd_bind_process_to_device(struct kfd_node *dev,
> >>        * pdd is destroyed.
> >>        */
> >>       if (!pdd->runtime_inuse) {
> >> -             err = pm_runtime_get_sync(bdev);
> >> -             if (err < 0) {
> >> -                     pm_runtime_put_autosuspend(adev_to_drm(dev->adev)-
> >>> dev);
> >> +             err = pm_runtime_resume_and_get(bdev);
>
> I got confused here, looking at the number of interfaces available in the pm_runtime
> header file. There seem to be many ways to do the same thing. There may be
> subtle difference between them. It's not obvious to me that this change is correct or
> harmless.
>
I think it is OK. The function description above pm_runtime_get_sync() says to consider using
pm_runtime_resume_and_get() instead.

Regards,
Mukul

> Regards,
>   Felix
>
>
> >> +             if (err < 0)
> >>                       return ERR_PTR(err);
> >> -             }
> >>       }
> >>
> >>       /*
> >> --
> >> 2.34.1
> >