[PATCH] drm/amdkfd: fix mes set shader debugger process management

Liu, Shaoyun Shaoyun.Liu at amd.com
Tue Dec 12 21:00:18 UTC 2023


[AMD Official Use Only - General]

Does this requires the  new MES FW for this process_ctx_flush  requirement ?  Can driver side add logic to guaranty when  call SET_SHADER_DEBUGGER, the process address  is always valid ?

Regards
Shaoyun.liu


-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Eric Huang
Sent: Tuesday, December 12, 2023 12:49 PM
To: Kim, Jonathan <Jonathan.Kim at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Wong, Alice <Shiwei.Wong at amd.com>; Kuehling, Felix <Felix.Kuehling at amd.com>; Kasiviswanathan, Harish <Harish.Kasiviswanathan at amd.com>
Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management


On 2023-12-11 16:16, Jonathan Kim wrote:
> MES provides the driver a call to explicitly flush stale process
> memory within the MES to avoid a race condition that results in a
> fatal memory violation.
>
> When SET_SHADER_DEBUGGER is called, the driver passes a memory address
> that represents a process context address MES uses to keep track of
> future per-process calls.
>
> Normally, MES will purge its process context list when the last queue
> has been removed.  The driver, however, can call SET_SHADER_DEBUGGER
> regardless of whether a queue has been added or not.
>
> If SET_SHADER_DEBUGGER has been called with no queues as the last call
> prior to process termination, the passed process context address will
> still reside within MES.
>
> On a new process call to SET_SHADER_DEBUGGER, the driver may end up
> passing an identical process context address value (based on
> per-process gpu memory address) to MES but is now pointing to a new
> allocated buffer object during KFD process creation.  Since the MES is
> unaware of this, access of the passed address points to the stale
> object within MES and triggers a fatal memory violation.
>
> The solution is for KFD to explicitly flush the process context
> address from MES on process termination.
>
> Note that the flush call and the MES debugger calls use the same MES
> interface but are separated as KFD calls to avoid conflicting with
> each other.
>
> Signed-off-by: Jonathan Kim <jonathan.kim at amd.com>
> Tested-by: Alice Wong <shiwei.wong at amd.com>
Reviewed-by: Eric Huang <jinhuieric.huang at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c       | 31 +++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h       | 10 +++---
>   .../amd/amdkfd/kfd_process_queue_manager.c    |  1 +
>   drivers/gpu/drm/amd/include/mes_v11_api_def.h |  3 +-
>   4 files changed, 40 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> index e544b823abf6..e98de23250dc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> @@ -916,6 +916,11 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device *adev,
>       op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
>       op_input.set_shader_debugger.process_context_addr = process_context_addr;
>       op_input.set_shader_debugger.flags.u32all = flags;
> +
> +     /* use amdgpu mes_flush_shader_debugger instead */
> +     if (op_input.set_shader_debugger.flags.process_ctx_flush)
> +             return -EINVAL;
> +
>       op_input.set_shader_debugger.spi_gdbg_per_vmid_cntl = spi_gdbg_per_vmid_cntl;
>       memcpy(op_input.set_shader_debugger.tcp_watch_cntl, tcp_watch_cntl,
>                       sizeof(op_input.set_shader_debugger.tcp_watch_cntl));
> @@ -935,6 +940,32 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device *adev,
>       return r;
>   }
>
> +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device *adev,
> +                                  uint64_t process_context_addr) {
> +     struct mes_misc_op_input op_input = {0};
> +     int r;
> +
> +     if (!adev->mes.funcs->misc_op) {
> +             DRM_ERROR("mes flush shader debugger is not supported!\n");
> +             return -EINVAL;
> +     }
> +
> +     op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
> +     op_input.set_shader_debugger.process_context_addr = process_context_addr;
> +     op_input.set_shader_debugger.flags.process_ctx_flush = true;
> +
> +     amdgpu_mes_lock(&adev->mes);
> +
> +     r = adev->mes.funcs->misc_op(&adev->mes, &op_input);
> +     if (r)
> +             DRM_ERROR("failed to set_shader_debugger\n");
> +
> +     amdgpu_mes_unlock(&adev->mes);
> +
> +     return r;
> +}
> +
>   static void
>   amdgpu_mes_ring_to_queue_props(struct amdgpu_device *adev,
>                              struct amdgpu_ring *ring,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> index 894b9b133000..7d4f93fea937 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> @@ -296,9 +296,10 @@ struct mes_misc_op_input {
>                       uint64_t process_context_addr;
>                       union {
>                               struct {
> -                                     uint64_t single_memop : 1;
> -                                     uint64_t single_alu_op : 1;
> -                                     uint64_t reserved: 30;
> +                                     uint32_t single_memop : 1;
> +                                     uint32_t single_alu_op : 1;
> +                                     uint32_t reserved: 29;
> +                                     uint32_t process_ctx_flush: 1;
>                               };
>                               uint32_t u32all;
>                       } flags;
> @@ -374,7 +375,8 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device *adev,
>                               const uint32_t *tcp_watch_cntl,
>                               uint32_t flags,
>                               bool trap_en);
> -
> +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device *adev,
> +                             uint64_t process_context_addr);
>   int amdgpu_mes_add_ring(struct amdgpu_device *adev, int gang_id,
>                       int queue_type, int idx,
>                       struct amdgpu_mes_ctx_data *ctx_data, diff --git
> a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> index 77f493262e05..8e55e78fce4e 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> @@ -87,6 +87,7 @@ void kfd_process_dequeue_from_device(struct kfd_process_device *pdd)
>               return;
>
>       dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
> +     amdgpu_mes_flush_shader_debugger(dev->adev, pdd->proc_ctx_gpu_addr);
>       pdd->already_dequeued = true;
>   }
>
> diff --git a/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> b/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> index 1fbfd1aa987e..ec5b9ab67c5e 100644
> --- a/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> +++ b/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> @@ -572,7 +572,8 @@ struct SET_SHADER_DEBUGGER {
>               struct {
>                       uint32_t single_memop : 1;  /* SQ_DEBUG.single_memop */
>                       uint32_t single_alu_op : 1; /* SQ_DEBUG.single_alu_op */
> -                     uint32_t reserved : 30;
> +                     uint32_t reserved : 29;
> +                     uint32_t process_ctx_flush : 1;
>               };
>               uint32_t u32all;
>       } flags;



More information about the amd-gfx mailing list