[PATCH] drm/amdgpu: Fix KFD oversubscription by tracking queues correctly

Felix Kuehling felix.kuehling at amd.com
Thu Jul 13 19:35:07 UTC 2017


On 17-07-13 03:15 PM, Jay Cornwall wrote:
> On Thu, Jul 13, 2017, at 13:36, Andres Rodriguez wrote:
>> On 2017-07-12 02:26 PM, Jay Cornwall wrote:
>>> The number of compute queues available to the KFD was erroneously
>>> calculated as 64. Only the first MEC can execute compute queues and
>>> it has 32 queue slots.
>>>
>>> This caused the oversubscription limit to be calculated incorrectly,
>>> leading to a missing chained runlist command at the end of an
>>> oversubscribed runlist.
>>>
>>> Change-Id: Ic4a139c04b8a6d025fbb831a0a67e98728bfe461
>>> Signed-off-by: Jay Cornwall <Jay.Cornwall at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> index 7060daf..aa4006a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> @@ -140,7 +140,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
>>>   		/* According to linux/bitmap.h we shouldn't use bitmap_clear if
>>>   		 * nbits is not compile time constant
>>>   		 */
>>> -		last_valid_bit = adev->gfx.mec.num_mec
>>> +		last_valid_bit = 1 /* only first MEC can have compute queues */
>> Hey Jay,
>>
>> Minor nitpick. We already have some similar resource patching in 
>> kgd2kfd_device_init(), and I think it would be good to keep all of these 
>> together.
> OK. I see shared_resources.num_mec is set to 1 in kgd2kfd_device_init.
> That's not very clear (the number of MECs doesn't change) and num_mec
> doesn't appear to be used anywhere except in dead code in kfd_device.c.
> That code also runs after the queue bitmap setup.
>
> How about I remove that field entirely?
Yeah, that's fine with me.



More information about the amd-gfx mailing list