[PATCH v3 2/2] drm/amdkfd: pause autosuspend when creating pdd

Felix Kuehling felix.kuehling at amd.com
Wed Dec 4 23:44:01 UTC 2024


On 2024-12-04 18:36, Felix Kuehling wrote:
>
> On 2024-12-03 09:30, Yunxiang Li wrote:
>> When using MES creating a pdd will require talking to the GPU to setup
>> the relevant context. The code here forgot to wake up the GPU in case it
>> was in suspend, this causes KVM to EFAULT for passthrough GPU for
>> example. This issue can be masked if the GPU was woken up by other
>> things (e.g. opening the KMS node) first and have not yet gone to sleep.
>>
>> Fixes: cc009e613de6 ("drm/amdkfd: Add KFD support for soc21 v3")
>> Signed-off-by: Yunxiang Li <Yunxiang.Li at amd.com>
>> ---
>> v3: remove the cleanup in kfd_bind_process_to_device and document why
>> this issue doesn't always happen
>>
>>   drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> index 555a892fcf963..c81c020af75d1 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> @@ -1635,12 +1635,19 @@ struct kfd_process_device 
>> *kfd_create_process_device_data(struct kfd_node *dev,
>>       atomic64_set(&pdd->evict_duration_counter, 0);
>>         if (dev->kfd->shared_resources.enable_mes) {
>> +        retval = pm_runtime_resume_and_get(bdev);
>> +        if (retval < 0) {
>> +            pr_err("failed to stop autosuspend\n");
>> +            goto err_free_pdd;
>> +        }
>>           retval = amdgpu_amdkfd_alloc_gtt_mem(adev,
>>                           AMDGPU_MES_PROC_CTX_SIZE,
>>                           &pdd->proc_ctx_bo,
>>                           &pdd->proc_ctx_gpu_addr,
>>                           &pdd->proc_ctx_cpu_ptr,
>>                           false);
>
> As far as I can see from grepping the code, this BO is never used. It 
> is allocated here and freed in kfd_process_destroy_pdds, and that's it.
>
> I see a different proc_ctx_bo allocation in amdgpu_mes_create_process 
> but I don't see that function being called anywhere. Either my grep-Fu 
> is getting rusty, or there is some dead code and data structures 
> surrounding MES here.
>
> So unless I'm missing something, we can just remove this proc_ctx_bo 
> completely.

OK, I was missing that proc_ctx_gpu_addr is used in add_queue_mes.

One other suggestion would be to do the allocation of proc_ctx_bo in a 
lazy fashion when the first queue is created in a process. Then it would 
naturally happen after the GPU has been taken out of runtime-PM in 
kfd_bind_process_to_device. This would avoid unnecessarily waking up all 
GPUs in the system every time a KFD process is started.

Regards,
   Felix


>
> Regards,
>   Felix
>
>
>
>> +        pm_runtime_mark_last_busy(bdev);
>> +        pm_runtime_put_autosuspend(bdev);
>>           if (retval) {
>>               dev_err(bdev,
>>                   "failed to allocate process context bo\n");


More information about the amd-gfx mailing list