[PATCH v3 2/2] drm/amdkfd: pause autosuspend when creating pdd
Felix Kuehling
felix.kuehling at amd.com
Wed Dec 4 23:44:01 UTC 2024
On 2024-12-04 18:36, Felix Kuehling wrote:
>
> On 2024-12-03 09:30, Yunxiang Li wrote:
>> When using MES creating a pdd will require talking to the GPU to setup
>> the relevant context. The code here forgot to wake up the GPU in case it
>> was in suspend, this causes KVM to EFAULT for passthrough GPU for
>> example. This issue can be masked if the GPU was woken up by other
>> things (e.g. opening the KMS node) first and have not yet gone to sleep.
>>
>> Fixes: cc009e613de6 ("drm/amdkfd: Add KFD support for soc21 v3")
>> Signed-off-by: Yunxiang Li <Yunxiang.Li at amd.com>
>> ---
>> v3: remove the cleanup in kfd_bind_process_to_device and document why
>> this issue doesn't always happen
>>
>> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> index 555a892fcf963..c81c020af75d1 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> @@ -1635,12 +1635,19 @@ struct kfd_process_device
>> *kfd_create_process_device_data(struct kfd_node *dev,
>> atomic64_set(&pdd->evict_duration_counter, 0);
>> if (dev->kfd->shared_resources.enable_mes) {
>> + retval = pm_runtime_resume_and_get(bdev);
>> + if (retval < 0) {
>> + pr_err("failed to stop autosuspend\n");
>> + goto err_free_pdd;
>> + }
>> retval = amdgpu_amdkfd_alloc_gtt_mem(adev,
>> AMDGPU_MES_PROC_CTX_SIZE,
>> &pdd->proc_ctx_bo,
>> &pdd->proc_ctx_gpu_addr,
>> &pdd->proc_ctx_cpu_ptr,
>> false);
>
> As far as I can see from grepping the code, this BO is never used. It
> is allocated here and freed in kfd_process_destroy_pdds, and that's it.
>
> I see a different proc_ctx_bo allocation in amdgpu_mes_create_process
> but I don't see that function being called anywhere. Either my grep-Fu
> is getting rusty, or there is some dead code and data structures
> surrounding MES here.
>
> So unless I'm missing something, we can just remove this proc_ctx_bo
> completely.
OK, I was missing that proc_ctx_gpu_addr is used in add_queue_mes.
One other suggestion would be to do the allocation of proc_ctx_bo in a
lazy fashion when the first queue is created in a process. Then it would
naturally happen after the GPU has been taken out of runtime-PM in
kfd_bind_process_to_device. This would avoid unnecessarily waking up all
GPUs in the system every time a KFD process is started.
Regards,
Felix
>
> Regards,
> Felix
>
>
>
>> + pm_runtime_mark_last_busy(bdev);
>> + pm_runtime_put_autosuspend(bdev);
>> if (retval) {
>> dev_err(bdev,
>> "failed to allocate process context bo\n");
More information about the amd-gfx
mailing list