[PATCH 1/2] drm/amdkfd: Fix GWS queue count

Felix Kuehling felix.kuehling at amd.com
Thu Apr 14 18:04:35 UTC 2022


Am 2022-04-14 um 13:56 schrieb Yat Sin, David:
>> -----Original Message-----
>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
>> Sent: Thursday, April 14, 2022 11:42 AM
>> To: Yat Sin, David <David.YatSin at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH 1/2] drm/amdkfd: Fix GWS queue count
>>
>>
>> Am 2022-04-14 um 11:08 schrieb Yat Sin, David:
>>>> -----Original Message-----
>>>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
>>>> Sent: Thursday, April 14, 2022 10:52 AM
>>>> To: Yat Sin, David <David.YatSin at amd.com>;
>>>> amd-gfx at lists.freedesktop.org
>>>> Subject: Re: [PATCH 1/2] drm/amdkfd: Fix GWS queue count
>>>>
>>>>
>>>> Am 2022-04-13 um 11:51 schrieb David Yat Sin:
>>>>> Queue can be inactive during process termination. This would cause
>>>>> dqm->gws_queue_count to not be decremented. There can only be 1
>> GWS
>>>>> queue per device process so moving the logic out of loop.
>>>>>
>>>>> Signed-off-by: David Yat Sin <david.yatsin at amd.com>
>>>>> ---
>>>>>     .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c    | 12 ++++++-
>> ----
>>>> -
>>>>>     1 file changed, 6 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>> index acf4f7975850..7c107b88d944 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>> @@ -1945,17 +1945,17 @@ static int process_termination_cpsch(struct
>>>> device_queue_manager *dqm,
>>>>>     		else if (q->properties.type ==
>>>> KFD_QUEUE_TYPE_SDMA_XGMI)
>>>>>     			deallocate_sdma_queue(dqm, q);
>>>>>
>>>>> -		if (q->properties.is_active) {
>>>>> +		if (q->properties.is_active)
>>>>>     			decrement_queue_count(dqm, q->properties.type);
>>>>> -			if (q->properties.is_gws) {
>>>>> -				dqm->gws_queue_count--;
>>>>> -				qpd->mapped_gws_queue = false;
>>>>> -			}
>>>>> -		}
>>>> This looks like the original intention was to decrement the counter
>>>> for inactive GWS queues. This makes sense because this counter is
>>>> used to determine whether the runlist is oversubscribed. Inactive
>>>> queues are not in the runlist, so they should not be counted.
>>>>
>>>> I see that the counter is updated when queues are activated and
>>>> deactivated in update_queue. So IMO this patch is both incorrect and
>>>> unnecessary. Did you see an actual problem with the GWS counter during
>> process termination?
>>>> If so, I'd like to know more to understand the root cause.
>>>>
>>>> Regards,
>>>>      Felix
>>> Yes, when using a unit test (for example KFDGWSTest.Semaphore), 1. Put
>>> a sleep in the middle of the application (after calling
>>> hsaKmtAllocQueueGWS) 2. Run application and kill the application which it
>> is in sleep (application never calls queue.Destroy()).
>>> Then inside kernel dqm->gws_queue_count keeps incrementing each time
>> the experiment is repeated and never goes back to 0. This seems to be
>> because the sleep causes the process to be evicted and q-
>>> properties.is_active is false so code does not enter that if statement.
>> That's weird. Can you find out why it's not getting there? In the test you
>> describe, I would expect the queue to be active, so the counter should be
>> decremented by the existing code.
>>
>> Does the test evict the queues for some reason? is_active gets set to false
>> when a queue is evicted.
> Yes, the queue is evicted because of the sleep() call in userspace.

Sleep() shouldn't cause queue evictions. Evictions are usually temporary 
due to memory manager events (memory evictions or MMU notifiers). More 
permanent evictions can happen after VM faults. Does the test cause a VM 
fault before the sleep?


>
>> Looks like we're missing code to update the
>> gws_queue_count in evict/restore_process_queues_cpsch (it is present in
>> evict/restore_process_queues_nocpsch).
> I think this is the actual problem and the increment/decrement should be done there. I did not realize the dqm->gws_queue_count only counts not-evicted queues. I will post new patch with this change instead.

Thanks,
   Felix


>
>> Maybe the management of this counter should be moved into
>> increment/decrement_queue_count, so we don't need to duplicate it in
>> many places.
> ACK
>
> Regards,
> David
>
>> Regards,
>>     Felix
>>
>>
>>> Regards,
>>> David
>>>
>>>>>     		dqm->total_queue_count--;
>>>>>     	}
>>>>>
>>>>> +	if (qpd->mapped_gws_queue) {
>>>>> +		dqm->gws_queue_count--;
>>>>> +		qpd->mapped_gws_queue = false;
>>>>> +	}
>>>>> +
>>>>>     	/* Unregister process */
>>>>>     	list_for_each_entry_safe(cur, next_dpn, &dqm->queues, list) {
>>>>>     		if (qpd == cur->qpd) {


More information about the amd-gfx mailing list