[PATCH 1/2] drm/amdkfd: Fix GWS queue count

Felix Kuehling felix.kuehling at amd.com
Thu Apr 14 15:42:21 UTC 2022


Am 2022-04-14 um 11:08 schrieb Yat Sin, David:
>> -----Original Message-----
>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
>> Sent: Thursday, April 14, 2022 10:52 AM
>> To: Yat Sin, David <David.YatSin at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH 1/2] drm/amdkfd: Fix GWS queue count
>>
>>
>> Am 2022-04-13 um 11:51 schrieb David Yat Sin:
>>> Queue can be inactive during process termination. This would cause
>>> dqm->gws_queue_count to not be decremented. There can only be 1 GWS
>>> queue per device process so moving the logic out of loop.
>>>
>>> Signed-off-by: David Yat Sin <david.yatsin at amd.com>
>>> ---
>>>    .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c    | 12 ++++++-----
>> -
>>>    1 file changed, 6 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> index acf4f7975850..7c107b88d944 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> @@ -1945,17 +1945,17 @@ static int process_termination_cpsch(struct
>> device_queue_manager *dqm,
>>>    		else if (q->properties.type ==
>> KFD_QUEUE_TYPE_SDMA_XGMI)
>>>    			deallocate_sdma_queue(dqm, q);
>>>
>>> -		if (q->properties.is_active) {
>>> +		if (q->properties.is_active)
>>>    			decrement_queue_count(dqm, q->properties.type);
>>> -			if (q->properties.is_gws) {
>>> -				dqm->gws_queue_count--;
>>> -				qpd->mapped_gws_queue = false;
>>> -			}
>>> -		}
>> This looks like the original intention was to decrement the counter for inactive
>> GWS queues. This makes sense because this counter is used to determine
>> whether the runlist is oversubscribed. Inactive queues are not in the runlist,
>> so they should not be counted.
>>
>> I see that the counter is updated when queues are activated and deactivated
>> in update_queue. So IMO this patch is both incorrect and unnecessary. Did
>> you see an actual problem with the GWS counter during process termination?
>> If so, I'd like to know more to understand the root cause.
>>
>> Regards,
>>     Felix
> Yes, when using a unit test (for example KFDGWSTest.Semaphore),
> 1. Put a sleep in the middle of the application (after calling hsaKmtAllocQueueGWS)
> 2. Run application and kill the application which it is in sleep (application never calls queue.Destroy()).
>
> Then inside kernel dqm->gws_queue_count keeps incrementing each time the experiment is repeated and never goes back to 0. This seems to be because the sleep causes the process to be evicted and q->properties.is_active is false so code does not enter that if statement.

That's weird. Can you find out why it's not getting there? In the test 
you describe, I would expect the queue to be active, so the counter 
should be decremented by the existing code.

Does the test evict the queues for some reason? is_active gets set to 
false when a queue is evicted. Looks like we're missing code to update 
the gws_queue_count in evict/restore_process_queues_cpsch (it is present 
in evict/restore_process_queues_nocpsch).

Maybe the management of this counter should be moved into 
increment/decrement_queue_count, so we don't need to duplicate it in 
many places.

Regards,
   Felix


>
> Regards,
> David
>
>>>    		dqm->total_queue_count--;
>>>    	}
>>>
>>> +	if (qpd->mapped_gws_queue) {
>>> +		dqm->gws_queue_count--;
>>> +		qpd->mapped_gws_queue = false;
>>> +	}
>>> +
>>>    	/* Unregister process */
>>>    	list_for_each_entry_safe(cur, next_dpn, &dqm->queues, list) {
>>>    		if (qpd == cur->qpd) {


More information about the amd-gfx mailing list