[PATCH 1/2] drm/amdgpu: optimize VRAM allocation when using drm buddy

Wed Feb 15 22:18:35 UTC 2023

Am 2023-02-15 um 05:44 schrieb Christian König:
> Am 15.02.23 um 03:51 schrieb Xiao, Shane:
>> For public review
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>> Sent: Wednesday, February 15, 2023 3:02 AM
>>> To: Xiao, Shane <shane.xiao at amd.com>; Paneer Selvam, Arunpravin
>>> <Arunpravin.PaneerSelvam at amd.com>
>>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when 
>>> using
>>> drm buddy
>>>
>>> Am 14.02.23 um 15:53 schrieb Xiao, Shane:
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>> Sent: Tuesday, February 14, 2023 8:41 PM
>>>>> To: Xiao, Shane <shane.xiao at amd.com>; brahma_sw_dev
>>>>> <brahma_sw_dev at amd.com>
>>>>> Cc: Paneer Selvam, Arunpravin <Arunpravin.PaneerSelvam at amd.com>
>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when
>>>>> using drm buddy
>>>>>
>>>>> Am 14.02.23 um 12:18 schrieb Shane Xiao:
>>>>>> Since the VRAM manager changed from drm mm to drm buddy. It's not
>>>>>> necessary to allocate 2MB aligned VRAM for more than 2MB unaligned
>>>>>> size, and then do trim. This method improves the allocation
>>>>>> efficiency and reduces memory fragmentation.
>>>>> Well that is a trade off.
>>>>>
>>>>> Allocating the BO as one contiguous chunk and then trimming is
>>>>> beneficial because if we then later need it contiguous we don't need
>>>>> to re-allocate and copy. This can be needed to display something for
>>> example.
>> Hi Christian,
>>
>> This case means that you allocate BO that is unnecessary to be 
>> continuous at first time,
>> and latter the BO should be continuous. I'm not familiar with 
>> display. Could you give me
>> a few more specific examples ?
>
> On most generations DCE/DCN hardware needs the buffer contiguous to be 
> able to scanout from it.
>
> Only newer APUs can use S/G to scanout from system memory pages.
>
>>>> Yes, I agree that one contiguous chunk may get beneficial sometimes.
>>>> But as far as I know, you cannot guarantee that amdgpu_vram_mgr_new
>>> can get one contiguous chunk  if you don't set TTM_PL_FLAG_CONTIGUOUS
>>> flags.
>>>> For example, if you want to allocate 4M+4K BO, it will allocate one 
>>>> 4M block
>>> + one 2M block which is unnecessary to be continuous, then 2M block 
>>> will be
>>> trimmed.
>>>
>>> Oh, that's indeed not something which should happen. Sounds more like a
>>> bug fix then.
>> Yes, I think this case should not be happened.
>> Actually, I'm not sure that why the allocated BO should be aligned 
>> with pages_per_block, which is set to 2MB by default.
>> Does this help improve performance when allocating 2M or above BO?
>>  From my point of view, the TLB may be one of reason of this. But I'm 
>> not sure about this.
>
> Yes, we try to use allocations which are as contiguous as much as 
> possible for better TLB usage.
>
> Especially for some compute use cases this can make a >20% performance 
> difference.
We actually found that >2MB virtual address alignment was hurting 
performance due to cache line aliasing. So we can't take advantage of 
 >2MB pages in our page tables.

Regards,
   Felix


>
> Regards,
> Christian.
>
>>
>> Best Regards,
>> Shane
>>
>>>>> On the other hand I completely agree allocating big and then trimming
>>>>> creates more fragmentation than necessary.
>>>>>
>>>>> Do you have some test case which can show the difference?
>>>> I have use rocrtst to show the difference.
>>>> The attachment is shown that after applying this patch, the order < 
>>>> 9 total
>>> vram size decrease from 99MB to 43MB.
>>>> And the latter has more higher order block memory.
>>> Arun can you take a look? That problem here sounds important.
>>>
>>> Thanks,
>>> Christian.
>>>
>>>>> BTW: No need to discuss that on the internal mailing list, please use
>>>>> the public one instead.
>>>>>
>>>> I will send it to public. Thank you for your remind.
>>>>
>>>> Best Regards,
>>>> Shane
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Signed-off-by: Shane Xiao <shane.xiao at amd.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 2 +-
>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>> index 75c80c557b6e..3fea58f9427c 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>> @@ -453,7 +453,7 @@ static int amdgpu_vram_mgr_new(struct
>>>>> ttm_resource_manager *man,
>>>>>>             /* Limit maximum size to 2GiB due to SG table 
>>>>>> limitations */
>>>>>>             size = min(remaining_size, 2ULL << 30);
>>>>>>
>>>>>> -        if (size >= (u64)pages_per_block << PAGE_SHIFT)
>>>>>> +        if (!(size % ((u64)pages_per_block << PAGE_SHIFT)))
>>>>>>                 min_block_size = (u64)pages_per_block <<
>>>>> PAGE_SHIFT;
>>>>>>             cur_size = size;
>