[PATCH v2 2/2] drm/amdkfd: pause autosuspend when creating pdd

Felix Kuehling felix.kuehling at amd.com
Fri Oct 11 15:17:13 UTC 2024


On 2024-10-11 11:07, Joshi, Mukul wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Yunxiang
>> Li
>> Sent: Thursday, October 10, 2024 12:18 PM
>> To: amd-gfx at lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian
>> <Christian.Koenig at amd.com>; Gui, Jack <Jack.Gui at amd.com>; Li, Yunxiang
>> (Teddy) <Yunxiang.Li at amd.com>
>> Subject: [PATCH v2 2/2] drm/amdkfd: pause autosuspend when creating pdd
>>
>> When using MES creating a pdd will require talking to the GPU to setup the relevant
>> context. The code here forgot to wake up the GPU in case it was in suspend, this
>> causes KVM to EFAULT for passthrough GPU for example.
>>
>> Also, change the other place where we pause suspend to use the cleaner
>> pm_runtime_resume_and_get helper.
>>
>> Fixes: cc009e613de6 ("drm/amdkfd: Add KFD support for soc21 v3")
>> Signed-off-by: Yunxiang Li <Yunxiang.Li at amd.com>
>> ---
>>
>> It is unclear to me if kfd_process_destroy_pdds also have this problem, or is freeing
>> gtt mem guaranteed to succeed even with the GPU in suspend.
>>
>>  drivers/gpu/drm/amd/amdkfd/kfd_process.c | 13 +++++++++----
>>  1 file changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> index c9ce95bc3ec8e..c3d510f1dabfa 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> @@ -1633,12 +1633,19 @@ struct kfd_process_device
>> *kfd_create_process_device_data(struct kfd_node *dev,
>>       atomic64_set(&pdd->evict_duration_counter, 0);
>>
>>       if (dev->kfd->shared_resources.enable_mes) {
>> +             retval = pm_runtime_resume_and_get(bdev);
>> +             if (retval < 0) {
>> +                     pr_err("failed to stop autosuspend\n");
>> +                     goto err_free_pdd;
>> +             }
> 
> I am not 100% sure if it makes sense to resume here and then put it back on suspend queue after the
> allocation. And then again call pm_runtime_resume in kfd_bind_process_to_device().
> Maybe you can set pdd->runtime_inuse = true here itself and not call the pm_runtime_put_autosuspend here.
> 
> I will probably let @Kuehling, Felix comment on this.

I don't think we want to set pdd->runtime_inuse here. That would make all GPUs like they're always in use, no matter if the application is actually using them. I think this part of the change is OK.

> 
> Regards,
> Mukul
> 
>>               retval = amdgpu_amdkfd_alloc_gtt_mem(adev,
>>                                               AMDGPU_MES_PROC_CTX_SIZE,
>>                                               &pdd->proc_ctx_bo,
>>                                               &pdd->proc_ctx_gpu_addr,
>>                                               &pdd->proc_ctx_cpu_ptr,
>>                                               false);
>> +             pm_runtime_mark_last_busy(bdev);
>> +             pm_runtime_put_autosuspend(bdev);
>>               if (retval) {
>>                       dev_err(bdev,
>>                               "failed to allocate process context bo\n"); @@ -
>> 1768,11 +1775,9 @@ struct kfd_process_device
>> *kfd_bind_process_to_device(struct kfd_node *dev,
>>        * pdd is destroyed.
>>        */
>>       if (!pdd->runtime_inuse) {
>> -             err = pm_runtime_get_sync(bdev);
>> -             if (err < 0) {
>> -                     pm_runtime_put_autosuspend(adev_to_drm(dev->adev)-
>>> dev);
>> +             err = pm_runtime_resume_and_get(bdev);

I got confused here, looking at the number of interfaces available in the pm_runtime header file. There seem to be many ways to do the same thing. There may be subtle difference between them. It's not obvious to me that this change is correct or harmless.

Regards,
  Felix


>> +             if (err < 0)
>>                       return ERR_PTR(err);
>> -             }
>>       }
>>
>>       /*
>> --
>> 2.34.1
> 


More information about the amd-gfx mailing list