[PATCH v3 2/2] drm/amdkfd: pause autosuspend when creating pdd

Felix Kuehling felix.kuehling at amd.com
Wed Dec 4 23:36:29 UTC 2024


On 2024-12-03 09:30, Yunxiang Li wrote:
> When using MES creating a pdd will require talking to the GPU to setup
> the relevant context. The code here forgot to wake up the GPU in case it
> was in suspend, this causes KVM to EFAULT for passthrough GPU for
> example. This issue can be masked if the GPU was woken up by other
> things (e.g. opening the KMS node) first and have not yet gone to sleep.
>
> Fixes: cc009e613de6 ("drm/amdkfd: Add KFD support for soc21 v3")
> Signed-off-by: Yunxiang Li <Yunxiang.Li at amd.com>
> ---
> v3: remove the cleanup in kfd_bind_process_to_device and document why
> this issue doesn't always happen
>
>   drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 +++++++
>   1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 555a892fcf963..c81c020af75d1 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -1635,12 +1635,19 @@ struct kfd_process_device *kfd_create_process_device_data(struct kfd_node *dev,
>   	atomic64_set(&pdd->evict_duration_counter, 0);
>   
>   	if (dev->kfd->shared_resources.enable_mes) {
> +		retval = pm_runtime_resume_and_get(bdev);
> +		if (retval < 0) {
> +			pr_err("failed to stop autosuspend\n");
> +			goto err_free_pdd;
> +		}
>   		retval = amdgpu_amdkfd_alloc_gtt_mem(adev,
>   						AMDGPU_MES_PROC_CTX_SIZE,
>   						&pdd->proc_ctx_bo,
>   						&pdd->proc_ctx_gpu_addr,
>   						&pdd->proc_ctx_cpu_ptr,
>   						false);

As far as I can see from grepping the code, this BO is never used. It is 
allocated here and freed in kfd_process_destroy_pdds, and that's it.

I see a different proc_ctx_bo allocation in amdgpu_mes_create_process 
but I don't see that function being called anywhere. Either my grep-Fu 
is getting rusty, or there is some dead code and data structures 
surrounding MES here.

So unless I'm missing something, we can just remove this proc_ctx_bo 
completely.

Regards,
   Felix



> +		pm_runtime_mark_last_busy(bdev);
> +		pm_runtime_put_autosuspend(bdev);
>   		if (retval) {
>   			dev_err(bdev,
>   				"failed to allocate process context bo\n");


More information about the amd-gfx mailing list