[PATCH] drm/amdkfd: Ensure gpu_id is unique
Felix Kuehling
felix.kuehling at amd.com
Mon May 6 22:14:12 UTC 2024
On 2024-05-06 17:10, Harish Kasiviswanathan wrote:
> On 2024-05-06 16:30, Felix Kuehling wrote:
>> On 2024-05-03 18:06, Harish Kasiviswanathan wrote:
>>> gpu_id needs to be unique for user space to identify GPUs via KFD
>>> interface. In the current implementation there is a very small
>>> probability of having non unique gpu_ids.
>>>
>>> v2: Add check to confirm if gpu_id is unique. If not unique, find one
>>> Changed commit header to reflect the above
>>>
>>> Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 26 ++++++++++++++++++++++-
>>> 1 file changed, 25 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>>> index b93913934b03..01d4c2e10c6d 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>>> @@ -1095,6 +1095,8 @@ static uint32_t kfd_generate_gpu_id(struct kfd_node *gpu)
>>> uint32_t hashout;
>>> uint32_t buf[8];
>>> uint64_t local_mem_size;
>>> + struct kfd_topology_device *dev;
>>> + bool is_unique;
>>> int i;
>>> if (!gpu)
>>> @@ -1115,6 +1117,28 @@ static uint32_t kfd_generate_gpu_id(struct kfd_node *gpu)
>>> for (i = 0, hashout = 0; i < 8; i++)
>>> hashout ^= hash_32(buf[i], KFD_GPU_ID_HASH_WIDTH);
>>> + /* hash generated could be non-unique. Check if it is unique.
>>> + * If not unique increment till unique one is found. In case
>>> + * of overflow, restart from 1
>>> + */
>>> + down_read(&topology_lock);
>>> + do {
>>> + is_unique = true;
>>> + list_for_each_entry(dev, &topology_device_list, list) {
>>> + if (dev->gpu && dev->gpu_id == hashout) {
>>> + is_unique = false;
>>> + break;
>>> + }
>>> + }
>>> + if (unlikely(!is_unique)) {
>>> + hashout = (hashout + 1) &
>>> + ((1 << KFD_GPU_ID_HASH_WIDTH) - 1);
>>> + if (!hashout)
>>> + hashout = 1;
>> This doesn't catch the case that hashout was 0 before incrementing it, and was found to be unique.
> I didn't actively think about this case when I sent the patch out. However, we don't have gpu_id to be 0. There are places where gpu_id=0 means it is CPU node
I think we make that assumption in a few places, both in kernel mode and
user mode, e.g.:
struct kfd_process_device *kfd_process_device_data_by_id(struct kfd_process *p, uint32_t gpu_id)
{
int i;
if (gpu_id) {
for (i = 0; i < p->n_pdds; i++) {
struct kfd_process_device *pdd = p->pdds[i];
if (pdd->user_gpu_id == gpu_id)
return pdd;
}
}
return NULL;
}
Or in the Thunk in hsaKmtGetNodeProperties:
/* For CPU only node don't add any additional GPU memory banks. */
if (gpu_id) {
uint64_t base, limit;
if (is_dgpu)
NodeProperties->NumMemoryBanks += NUM_OF_DGPU_HEAPS;
else
NodeProperties->NumMemoryBanks += NUM_OF_IGPU_HEAPS;
if (fmm_get_aperture_base_and_limit(FMM_MMIO, gpu_id, &base,
&limit) == HSAKMT_STATUS_SUCCESS)
NodeProperties->NumMemoryBanks += 1;
}
Regards,
Felix
>
>> Regards,
>> Felix
>>
>>
>>> + }
>>> + } while (!is_unique);
>>> + up_read(&topology_lock);
>>> +
>>> return hashout;
>>> }
>>> /* kfd_assign_gpu - Attach @gpu to the correct kfd topology device. If
>>> @@ -1946,7 +1970,6 @@ int kfd_topology_add_device(struct kfd_node *gpu)
>>> struct amdgpu_gfx_config *gfx_info = &gpu->adev->gfx.config;
>>> struct amdgpu_cu_info *cu_info = &gpu->adev->gfx.cu_info;
>>> - gpu_id = kfd_generate_gpu_id(gpu);
>>> if (gpu->xcp && !gpu->xcp->ddev) {
>>> dev_warn(gpu->adev->dev,
>>> "Won't add GPU to topology since it has no drm node assigned.");
>>> @@ -1969,6 +1992,7 @@ int kfd_topology_add_device(struct kfd_node *gpu)
>>> if (res)
>>> return res;
>>> + gpu_id = kfd_generate_gpu_id(gpu);
>>> dev->gpu_id = gpu_id;
>>> gpu->id = gpu_id;
>>>
More information about the amd-gfx
mailing list