[PATCH v3] drm/amdgpu: skip xcp drm device allocation when out of drm resource
Lazar, Lijo
lijo.lazar at amd.com
Sat Aug 12 17:00:20 UTC 2023
On 8/12/2023 6:14 PM, James Zhu wrote:
>
> On 2023-08-11 21:39, Lazar, Lijo wrote:
>>
>> [AMD Official Use Only - General]
>>
>>
>> A dynamic partition switch could happen later. The switch could still
>> be successful in terms of hardware,
> [JZ] Only ignore render node assignment, and remove visibility in user
> space, xcp continues to be generated as usual. so switch should work as
> usual
Switch is not useful for the user unless the apps can make use of the
render nodes. A 'success' from hardware perspective doesn't turn out to
be a 'success' for users eventually to make use of the extra partition.
>> and hence gives a false feeling of success even if there are no render
>> nodes available for any app to make use of the partition.
> [JZ] from driver prospective, the switch is real success, treat the last
> one harvested in user space.. there is warning in kernel log, and final
> solution for more than 64 nodes is on-going
The render nodes are allocated during driver load and the message will
go unnoticed. We could still allow the switch, but the message should be
there during a partition switch like 'only x/y (x out of y nodes) are
usable'. The worst case is - only 1 out of N meaning no benefit - and in
that case user may switch back to normal mode to make use of full
compute power.
>>
>> Also, a kfd node is not expected to have a valid xcp pointer on
>> devices without partition.
> [JZ] won't affect xcp pointer, only ddev.
>> This access could break then gpu->xcp->ddev.
> [JZ] added skip when ddev==NULL
What I meant is xcp in kfd node could be NULL on SOCs like NV series.
There should be a check for xcp before accessing ddev -
https://elixir.bootlin.com/linux/v6.5-rc5/source/drivers/gpu/drm/amd/amdkfd/kfd_device.c#L794
Thanks,
Lijo
>>
>> Thanks,
>> Lijo
>> ------------------------------------------------------------------------
>> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of
>> James Zhu <James.Zhu at amd.com>
>> *Sent:* Saturday, August 12, 2023 2:36:27 AM
>> *To:* amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
>> *Cc:* Lin, Amber <Amber.Lin at amd.com>; Zhu, James <James.Zhu at amd.com>;
>> Kasiviswanathan, Harish <Harish.Kasiviswanathan at amd.com>; Koenig,
>> Christian <Christian.Koenig at amd.com>
>> *Subject:* [PATCH v3] drm/amdgpu: skip xcp drm device allocation when
>> out of drm resource
>> Return 0 when drm device alloc failed with -ENOSPC in
>> order to allow amdgpu drive loading. But the xcp without
>> drm device node assigned won't be visiable in user space.
>> This helps amdgpu driver loading on system which has more
>> than 64 nodes, the current limitation.
>>
>> The proposal to add more drm nodes is discussed in public,
>> which will support up to 2^20 nodes totally.
>> kernel drm:
>> https://lore.kernel.org/lkml/20230724211428.3831636-1-michal.winiarski@intel.com/T/
>> libdrm:
>> https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/305
>>
>> Signed-off-by: James Zhu <James.Zhu at amd.com>
>> Acked-by: Christian König <christian.koenig at amd.com>
>>
>> -v2: added warning message
>> -v3: use dev_warn
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 13 ++++++++++++-
>> drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 10 +++++++++-
>> 2 files changed, 21 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> index 9c9cca129498..565a1fa436d4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> @@ -239,8 +239,13 @@ static int amdgpu_xcp_dev_alloc(struct
>> amdgpu_device *adev)
>>
>> for (i = 1; i < MAX_XCP; i++) {
>> ret = amdgpu_xcp_drm_dev_alloc(&p_ddev);
>> - if (ret)
>> + if (ret == -ENOSPC) {
>> + dev_warn(adev->dev,
>> + "Skip xcp node #%d when out of drm node
>> resource.", i);
>> + return 0;
>> + } else if (ret) {
>> return ret;
>> + }
>>
>> /* Redirect all IOCTLs to the primary device */
>> adev->xcp_mgr->xcp[i].rdev = p_ddev->render->dev;
>> @@ -328,6 +333,9 @@ int amdgpu_xcp_dev_register(struct amdgpu_device
>> *adev,
>> return 0;
>>
>> for (i = 1; i < MAX_XCP; i++) {
>> + if (!adev->xcp_mgr->xcp[i].ddev)
>> + break;
>> +
>> ret = drm_dev_register(adev->xcp_mgr->xcp[i].ddev,
>> ent->driver_data);
>> if (ret)
>> return ret;
>> @@ -345,6 +353,9 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
>> return;
>>
>> for (i = 1; i < MAX_XCP; i++) {
>> + if (!adev->xcp_mgr->xcp[i].ddev)
>> + break;
>> +
>> p_ddev = adev->xcp_mgr->xcp[i].ddev;
>> drm_dev_unplug(p_ddev);
>> p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>> index 3b0749390388..310df98ba46a 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>> @@ -1969,8 +1969,16 @@ int kfd_topology_add_device(struct kfd_node *gpu)
>> int i;
>> const char *asic_name = amdgpu_asic_name[gpu->adev->asic_type];
>>
>> +
>> gpu_id = kfd_generate_gpu_id(gpu);
>> - pr_debug("Adding new GPU (ID: 0x%x) to topology\n", gpu_id);
>> + if (!gpu->xcp->ddev) {
>> + dev_warn(gpu->adev->dev,
>> + "Won't add GPU (ID: 0x%x) to topology since it has no
>> drm node assigned.",
>> + gpu_id);
>> + return 0;
>> + } else {
>> + pr_debug("Adding new GPU (ID: 0x%x) to topology\n",
>> gpu_id);
>> + }
>>
>> /* Check to see if this gpu device exists in the
>> topology_device_list.
>> * If so, assign the gpu to that device,
>> --
>> 2.34.1
>>
More information about the amd-gfx
mailing list