[PATCH 1/2] drm/amdgpu: optimize VRAM allocation when using drm buddy

Thu Feb 16 07:09:15 UTC 2023

Am 16.02.23 um 07:48 schrieb Xiao, Shane:
>> -----Original Message-----
>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
>> Sent: Thursday, February 16, 2023 6:19 AM
>> To: Christian König <ckoenig.leichtzumerken at gmail.com>; Xiao, Shane
>> <shane.xiao at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>;
>> Paneer Selvam, Arunpravin <Arunpravin.PaneerSelvam at amd.com>
>> Cc: amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when using
>> drm buddy
>>
>>
>> Am 2023-02-15 um 05:44 schrieb Christian König:
>>> Am 15.02.23 um 03:51 schrieb Xiao, Shane:
>>>> For public review
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>> Sent: Wednesday, February 15, 2023 3:02 AM
>>>>> To: Xiao, Shane <shane.xiao at amd.com>; Paneer Selvam, Arunpravin
>>>>> <Arunpravin.PaneerSelvam at amd.com>
>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when
>>>>> using drm buddy
>>>>>
>>>>> Am 14.02.23 um 15:53 schrieb Xiao, Shane:
>>>>>>> -----Original Message-----
>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>> Sent: Tuesday, February 14, 2023 8:41 PM
>>>>>>> To: Xiao, Shane <shane.xiao at amd.com>; brahma_sw_dev
>>>>>>> <brahma_sw_dev at amd.com>
>>>>>>> Cc: Paneer Selvam, Arunpravin
>> <Arunpravin.PaneerSelvam at amd.com>
>>>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation
>> when
>>>>>>> using drm buddy
>>>>>>>
>>>>>>> Am 14.02.23 um 12:18 schrieb Shane Xiao:
>>>>>>>> Since the VRAM manager changed from drm mm to drm buddy. It's
>> not
>>>>>>>> necessary to allocate 2MB aligned VRAM for more than 2MB
>>>>>>>> unaligned size, and then do trim. This method improves the
>>>>>>>> allocation efficiency and reduces memory fragmentation.
>>>>>>> Well that is a trade off.
>>>>>>>
>>>>>>> Allocating the BO as one contiguous chunk and then trimming is
>>>>>>> beneficial because if we then later need it contiguous we don't
>>>>>>> need to re-allocate and copy. This can be needed to display
>>>>>>> something for
>>>>> example.
>>>> Hi Christian,
>>>>
>>>> This case means that you allocate BO that is unnecessary to be
>>>> continuous at first time, and latter the BO should be continuous. I'm
>>>> not familiar with display. Could you give me a few more specific
>>>> examples ?
>>> On most generations DCE/DCN hardware needs the buffer contiguous to
>> be
>>> able to scanout from it.
>>>
>>> Only newer APUs can use S/G to scanout from system memory pages.
>>>
>>>>>> Yes, I agree that one contiguous chunk may get beneficial sometimes.
>>>>>> But as far as I know, you cannot guarantee that
>> amdgpu_vram_mgr_new
>>>>> can get one contiguous chunk  if you don't set
>>>>> TTM_PL_FLAG_CONTIGUOUS flags.
>>>>>> For example, if you want to allocate 4M+4K BO, it will allocate one
>>>>>> 4M block
>>>>> + one 2M block which is unnecessary to be continuous, then 2M block
>>>>> will be
>>>>> trimmed.
>>>>>
>>>>> Oh, that's indeed not something which should happen. Sounds more
>>>>> like a bug fix then.
>>>> Yes, I think this case should not be happened.
>>>> Actually, I'm not sure that why the allocated BO should be aligned
>>>> with pages_per_block, which is set to 2MB by default.
>>>> Does this help improve performance when allocating 2M or above BO?
>>>>   From my point of view, the TLB may be one of reason of this. But I'm
>>>> not sure about this.
>>> Yes, we try to use allocations which are as contiguous as much as
>>> possible for better TLB usage.
>>>
>>> Especially for some compute use cases this can make a >20% performance
>>> difference.
>> We actually found that >2MB virtual address alignment was hurting
>> performance due to cache line aliasing. So we can't take advantage of  >2MB
>> pages in our page tables.
>>
>> Regards,
>>     Felix
> Yes, if we want to take advantage of 2M TLB usage, we should keep virtual address aligned.
>
> As you have mentioned that cache line aliasing issue, I'm confused about this.
> If 2MB aligned VA get the right PA from TLB or page table and the cache line addressing mode is not changed,
> the cache line aliasing issue should not happen here.
> Is there something wrong with my understanding? Or maybe there are some backgrounds that I didn't know.

The problem is with virtual address alignments > 2MiB (or whatever the 
big cache line size is).

Let's assume an example where you have a lot of buffer each 66MiB in 
size. When you align those to 2MiB in the virtual address space you end 
up with

64MiB..2MiB..62MiB..4MiB..60MiB... etc...

In your address space. In this configuration each 2MiB cache line is 
equally used.

But if you align the buffers to say the next power of two (128MiB) you 
end up like this:

64MiB..2MiB..62MiB hole..64MiB..2MiB..62MiB hole... etc....

In this case the first 2MiB cache line of each buffer is used twice as 
much as all the other cache lines. This can hurt performance very badly.

Regards,
Christian.

> Best Regards,
> Shane
>   
>>
>>> Regards,
>>> Christian.
>>>
>>>> Best Regards,
>>>> Shane
>>>>
>>>>>>> On the other hand I completely agree allocating big and then
>>>>>>> trimming creates more fragmentation than necessary.
>>>>>>>
>>>>>>> Do you have some test case which can show the difference?
>>>>>> I have use rocrtst to show the difference.
>>>>>> The attachment is shown that after applying this patch, the order <
>>>>>> 9 total
>>>>> vram size decrease from 99MB to 43MB.
>>>>>> And the latter has more higher order block memory.
>>>>> Arun can you take a look? That problem here sounds important.
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>>>> BTW: No need to discuss that on the internal mailing list, please
>>>>>>> use the public one instead.
>>>>>>>
>>>>>> I will send it to public. Thank you for your remind.
>>>>>>
>>>>>> Best Regards,
>>>>>> Shane
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Signed-off-by: Shane Xiao <shane.xiao at amd.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 2 +-
>>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>>>> index 75c80c557b6e..3fea58f9427c 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>>>> @@ -453,7 +453,7 @@ static int amdgpu_vram_mgr_new(struct
>>>>>>> ttm_resource_manager *man,
>>>>>>>>              /* Limit maximum size to 2GiB due to SG table
>>>>>>>> limitations */
>>>>>>>>              size = min(remaining_size, 2ULL << 30);
>>>>>>>>
>>>>>>>> -        if (size >= (u64)pages_per_block << PAGE_SHIFT)
>>>>>>>> +        if (!(size % ((u64)pages_per_block << PAGE_SHIFT)))
>>>>>>>>                  min_block_size = (u64)pages_per_block <<
>>>>>>> PAGE_SHIFT;
>>>>>>>>              cur_size = size;