[PATCH] drm/amdkfd: increase max number of queues per process

Mon Mar 24 21:59:10 UTC 2025

On 2025-03-24 17:21, Alex Deucher wrote:
> On Mon, Mar 24, 2025 at 5:07 PM Eric Huang <jinhuieric.huang at amd.com> wrote:
>>
>> On 2025-03-24 15:32, Alex Deucher wrote:
>>> On Mon, Mar 24, 2025 at 1:26 PM Eric Huang <jinhuieric.huang at amd.com> wrote:
>>>> kfdtest KFDQMTest.OverSubscribeCpQueues with multiple
>>>> gpu mode fails on gfx v9.4.3+NPS4+CPX which has 64 gpu
>>>> nodes, the queues created are 65x64=4160, but the number
>>>> 1024 0f KFD_MAX_NUM_OF_QUEUES_PER_PROCESS is not enough
>>>> and test fails at function find_available_queue_slot().
>>>> So increasing the nubmer will make the test passed.
>>>>
>>>> Signed-off-by: Eric Huang <jinhuieric.huang at amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>> index f6aedf69c644..054a78207ffe 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>> @@ -94,7 +94,7 @@
>>>>           ((typeof(ptr_to_struct)) kzalloc(sizeof(*ptr_to_struct), GFP_KERNEL))
>>>>
>>>>    #define KFD_MAX_NUM_OF_PROCESSES 512
>>>> -#define KFD_MAX_NUM_OF_QUEUES_PER_PROCESS 1024
>>>> +#define KFD_MAX_NUM_OF_QUEUES_PER_PROCESS 4160
>>> Doesn't this limit have more to do with the number of doorbells you
>>> can fit into a 4K page?  If you only allocate 4K for doorbells how can
>>> you increase this?
>> The doorbells size is allocated dynamically as multiple pages based on
>> KFD_MAX_NUM_OF_QUEUES_PER_PROCESS in KFD. Currently with 1024 of this
>> macro 2 pages are allocated, and after changing to 4160, 9 pages will be
>> allocated. Please refer in function kfd_allocate_process_doorbells().
> Thanks for the details.  Since most apps don't use that many, it seems
> like a waste of doorbells.  Should this be limited to certain
> partition modes?

No, it is generic for all GPU nodes/partitions available per process. It 
just creates more capability of queue's max number with more memory 
cost/waste.

Thanks,
Eric

>
> Alex
>
>> Thanks,
>> Eric
>>
>>> Alex
>>>
>>>>    /*
>>>>     * Size of the per-process TBA+TMA buffer: 2 pages
>>>> --
>>>> 2.34.1
>>>>