[PATCH] drm/amdkfd: sever xgmi io link if host driver has disable sharing

Kim, Jonathan Jonathan.Kim at amd.com
Wed Oct 16 19:08:00 UTC 2024


[Public]

Messed up James' email in Tested-by tag.  CC'ing James.

> -----Original Message-----
> From: Kim, Jonathan <Jonathan.Kim at amd.com>
> Sent: Wednesday, October 16, 2024 11:59 AM
> To: amd-gfx at lists.freedesktop.org
> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan at amd.com>; Kuehling, Felix
> <Felix.Kuehling at amd.com>; Kim, Jonathan <Jonathan.Kim at amd.com>; Kim,
> Jonathan <Jonathan.Kim at amd.com>; James Yao <yiqing at yao.amd.com>
> Subject: [PATCH] drm/amdkfd: sever xgmi io link if host driver has disable sharing
>
> From: Jonathan Kim <Jonathan.Kim at amd.com>
>
> Host drivers can create partial hives per guest by disabling xgmi sharing
> between certain peers in the main hive.
> Typically, these partial hives are fully connected per guest session.
> In the event that the host makes a mistake by adding a non-shared node
> to a guest session, have the KFD reflect sharing disabled by severing
> the IO link.
>
> Signed-off-by: Jonathan Kim <jonathan.kim at amd.com>
> Tested-by: James Yao <yiqing at yao.amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 17 +++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h |  2 ++
>  drivers/gpu/drm/amd/amdkfd/kfd_crat.c    |  3 +++
>  3 files changed, 22 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> index fcdbcff57632..1d50f327eb08 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> @@ -801,6 +801,23 @@ int amdgpu_xgmi_get_num_links(struct amdgpu_device
> *adev,
>       return  -EINVAL;
>  }
>
> +bool amdgpu_xgmi_get_is_sharing_enabled(struct amdgpu_device *adev,
> +                                     struct amdgpu_device *peer_adev)
> +{
> +     struct psp_xgmi_topology_info *top = &adev->psp.xgmi_context.top_info;
> +     int i;
> +
> +     /* Sharing should always be enabled for non-SRIOV. */
> +     if (!amdgpu_sriov_vf(adev))
> +             return true;
> +
> +     for (i = 0 ; i < top->num_nodes; ++i)
> +             if (top->nodes[i].node_id == peer_adev->gmc.xgmi.node_id)
> +                     return !!top->nodes[i].is_sharing_enabled;
> +
> +     return false;
> +}
> +
>  /*
>   * Devices that support extended data require the entire hive to initialize with
>   * the shared memory buffer flag set.
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> index 41d5f97fc77a..8cc7ab38db7c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> @@ -66,6 +66,8 @@ int amdgpu_xgmi_get_hops_count(struct amdgpu_device
> *adev,
>               struct amdgpu_device *peer_adev);
>  int amdgpu_xgmi_get_num_links(struct amdgpu_device *adev,
>               struct amdgpu_device *peer_adev);
> +bool amdgpu_xgmi_get_is_sharing_enabled(struct amdgpu_device *adev,
> +                                     struct amdgpu_device *peer_adev);
>  uint64_t amdgpu_xgmi_get_relative_phy_addr(struct amdgpu_device *adev,
>                                          uint64_t addr);
>  static inline bool amdgpu_xgmi_same_hive(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> index 48caecf7e72e..723f1220e1cc 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> @@ -28,6 +28,7 @@
>  #include "kfd_topology.h"
>  #include "amdgpu.h"
>  #include "amdgpu_amdkfd.h"
> +#include "amdgpu_xgmi.h"
>
>  /* GPU Processor ID base for dGPUs for which VCRAT needs to be created.
>   * GPU processor ID are expressed with Bit[31]=1.
> @@ -2329,6 +2330,8 @@ static int kfd_create_vcrat_image_gpu(void *pcrat_image,
>                               continue;
>                       if (peer_dev->gpu->kfd->hive_id != kdev->kfd->hive_id)
>                               continue;
> +                     if (!amdgpu_xgmi_get_is_sharing_enabled(kdev->adev,
> peer_dev->gpu->adev))
> +                             continue;
>                       sub_type_hdr = (typeof(sub_type_hdr))(
>                               (char *)sub_type_hdr +
>                               sizeof(struct crat_subtype_iolink));
> --
> 2.34.1



More information about the amd-gfx mailing list