[PATCH v3] drm/amdgpu: skip xcp drm device allocation when out of drm resource

Sat Aug 12 17:00:20 UTC 2023

On 8/12/2023 6:14 PM, James Zhu wrote:
> 
> On 2023-08-11 21:39, Lazar, Lijo wrote:
>>
>> [AMD Official Use Only - General]
>>
>>
>> A dynamic partition switch could happen later.  The switch could still 
>> be successful in terms of hardware,
> [JZ] Only ignore render node assignment, and remove visibility in user 
> space, xcp continues to be generated as usual. so switch should work as 
> usual

Switch is not useful for the user unless the apps can make use of the 
render nodes. A 'success' from hardware perspective doesn't turn out to 
be a 'success' for users eventually to make use of the extra partition.

>> and hence gives a false feeling of success even if there are no render 
>> nodes available for any app to make use of the partition.
> [JZ] from driver prospective, the switch is real success, treat the last 
> one harvested in user space.. there is warning in kernel log, and final 
> solution for more than 64 nodes is on-going

The render nodes are allocated during driver load and the message will 
go unnoticed. We could still allow the switch, but the message should be 
there during a partition switch like 'only x/y (x out of y nodes) are 
usable'. The worst case is - only 1 out of N meaning no benefit - and in 
that case user may switch back to normal mode to make use of full 
compute power.

>>
>> Also, a kfd node is not expected to have a valid xcp pointer on 
>> devices without partition.
> [JZ] won't affect xcp pointer, only ddev.
>> This access could break then gpu->xcp->ddev.
> [JZ] added skip when ddev==NULL

What I meant is xcp in kfd node could be NULL on SOCs like NV series. 
There should be a check for xcp before accessing ddev -
https://elixir.bootlin.com/linux/v6.5-rc5/source/drivers/gpu/drm/amd/amdkfd/kfd_device.c#L794

Thanks,
Lijo

>>
>> Thanks,
>> Lijo
>> ------------------------------------------------------------------------
>> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of 
>> James Zhu <James.Zhu at amd.com>
>> *Sent:* Saturday, August 12, 2023 2:36:27 AM
>> *To:* amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
>> *Cc:* Lin, Amber <Amber.Lin at amd.com>; Zhu, James <James.Zhu at amd.com>; 
>> Kasiviswanathan, Harish <Harish.Kasiviswanathan at amd.com>; Koenig, 
>> Christian <Christian.Koenig at amd.com>
>> *Subject:* [PATCH v3] drm/amdgpu: skip xcp drm device allocation when 
>> out of drm resource
>> Return 0 when drm device alloc failed with -ENOSPC in
>> order to  allow amdgpu drive loading. But the xcp without
>> drm device node assigned won't be visiable in user space.
>> This helps amdgpu driver loading on system which has more
>> than 64 nodes, the current limitation.
>>
>> The proposal to add more drm nodes is discussed in public,
>> which will support up to 2^20 nodes totally.
>> kernel drm:
>> https://lore.kernel.org/lkml/20230724211428.3831636-1-michal.winiarski@intel.com/T/
>> libdrm:
>> https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/305
>>
>> Signed-off-by: James Zhu <James.Zhu at amd.com>
>> Acked-by: Christian König <christian.koenig at amd.com>
>>
>> -v2: added warning message
>> -v3: use dev_warn
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c   | 13 ++++++++++++-
>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 10 +++++++++-
>>  2 files changed, 21 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> index 9c9cca129498..565a1fa436d4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> @@ -239,8 +239,13 @@ static int amdgpu_xcp_dev_alloc(struct 
>> amdgpu_device *adev)
>>
>>          for (i = 1; i < MAX_XCP; i++) {
>>                  ret = amdgpu_xcp_drm_dev_alloc(&p_ddev);
>> -               if (ret)
>> +               if (ret == -ENOSPC) {
>> +                       dev_warn(adev->dev,
>> +                       "Skip xcp node #%d when out of drm node 
>> resource.", i);
>> +                       return 0;
>> +               } else if (ret) {
>>                          return ret;
>> +               }
>>
>>                  /* Redirect all IOCTLs to the primary device */
>>                  adev->xcp_mgr->xcp[i].rdev = p_ddev->render->dev;
>> @@ -328,6 +333,9 @@ int amdgpu_xcp_dev_register(struct amdgpu_device 
>> *adev,
>>                  return 0;
>>
>>          for (i = 1; i < MAX_XCP; i++) {
>> +               if (!adev->xcp_mgr->xcp[i].ddev)
>> +                       break;
>> +
>>                  ret = drm_dev_register(adev->xcp_mgr->xcp[i].ddev, 
>> ent->driver_data);
>>                  if (ret)
>>                          return ret;
>> @@ -345,6 +353,9 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
>>                  return;
>>
>>          for (i = 1; i < MAX_XCP; i++) {
>> +               if (!adev->xcp_mgr->xcp[i].ddev)
>> +                       break;
>> +
>>                  p_ddev = adev->xcp_mgr->xcp[i].ddev;
>>                  drm_dev_unplug(p_ddev);
>>                  p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>> index 3b0749390388..310df98ba46a 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>> @@ -1969,8 +1969,16 @@ int kfd_topology_add_device(struct kfd_node *gpu)
>>          int i;
>>          const char *asic_name = amdgpu_asic_name[gpu->adev->asic_type];
>>
>> +
>>          gpu_id = kfd_generate_gpu_id(gpu);
>> -       pr_debug("Adding new GPU (ID: 0x%x) to topology\n", gpu_id);
>> +       if (!gpu->xcp->ddev) {
>> +               dev_warn(gpu->adev->dev,
>> +               "Won't add GPU (ID: 0x%x) to topology since it has no 
>> drm node assigned.",
>> +               gpu_id);
>> +               return 0;
>> +       } else {
>> +               pr_debug("Adding new GPU (ID: 0x%x) to topology\n", 
>> gpu_id);
>> +       }
>>
>>          /* Check to see if this gpu device exists in the 
>> topology_device_list.
>>           * If so, assign the gpu to that device,
>> -- 
>> 2.34.1
>>