[PATCH 1/2] drm/amdgpu: optimize VRAM allocation when using drm buddy

Fri Feb 17 09:37:52 UTC 2023

Am 17.02.23 um 04:11 schrieb Xiao, Shane:
>
>> -----Original Message-----
>> From: Paneer Selvam, Arunpravin <Arunpravin.PaneerSelvam at amd.com>
>> Sent: Thursday, February 16, 2023 3:24 PM
>> To: Koenig, Christian <Christian.Koenig at amd.com>; Xiao, Shane
>> <shane.xiao at amd.com>; Kuehling, Felix <Felix.Kuehling at amd.com>;
>> Christian König <ckoenig.leichtzumerken at gmail.com>
>> Cc: amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when using
>> drm buddy
>>
>> This patch seems to pass the rocm memory stress test case.
>> Reviewed-by: Arunpravin Paneer Selvam
>> <Arunpravin.PaneerSelvam at amd.com>
> Hi Christian,
>
> Do you think we should upstream this patch?

Arun already gave his rb for this, I only see one option for further 
improvements and that is to avoid the "% pages_per_block".

I think pages_per_block is a power of two (but you need to double check 
this), so you can use something like "& (pages_per_block - 1)" instead 
which should have less overhead.

But that's only a minimal optimization. Either way feel free to push 
this to amd-staging-drm-next with Arun's rb added.

Regards,
Christian.

>
> Best Regards,
> Shane
>
>
>> On 2/16/2023 12:39 PM, Christian König wrote:
>>> Am 16.02.23 um 07:48 schrieb Xiao, Shane:
>>>>> -----Original Message-----
>>>>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
>>>>> Sent: Thursday, February 16, 2023 6:19 AM
>>>>> To: Christian König <ckoenig.leichtzumerken at gmail.com>; Xiao, Shane
>>>>> <shane.xiao at amd.com>; Koenig, Christian
>> <Christian.Koenig at amd.com>;
>>>>> Paneer Selvam, Arunpravin <Arunpravin.PaneerSelvam at amd.com>
>>>>> Cc: amd-gfx at lists.freedesktop.org
>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when
>>>>> using drm buddy
>>>>>
>>>>>
>>>>> Am 2023-02-15 um 05:44 schrieb Christian König:
>>>>>> Am 15.02.23 um 03:51 schrieb Xiao, Shane:
>>>>>>> For public review
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>> Sent: Wednesday, February 15, 2023 3:02 AM
>>>>>>>> To: Xiao, Shane <shane.xiao at amd.com>; Paneer Selvam,
>> Arunpravin
>>>>>>>> <Arunpravin.PaneerSelvam at amd.com>
>>>>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation
>>>>>>>> when using drm buddy
>>>>>>>>
>>>>>>>> Am 14.02.23 um 15:53 schrieb Xiao, Shane:
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>>>> Sent: Tuesday, February 14, 2023 8:41 PM
>>>>>>>>>> To: Xiao, Shane <shane.xiao at amd.com>; brahma_sw_dev
>>>>>>>>>> <brahma_sw_dev at amd.com>
>>>>>>>>>> Cc: Paneer Selvam, Arunpravin
>>>>> <Arunpravin.PaneerSelvam at amd.com>
>>>>>>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation
>>>>> when
>>>>>>>>>> using drm buddy
>>>>>>>>>>
>>>>>>>>>> Am 14.02.23 um 12:18 schrieb Shane Xiao:
>>>>>>>>>>> Since the VRAM manager changed from drm mm to drm buddy.
>> It's
>>>>> not
>>>>>>>>>>> necessary to allocate 2MB aligned VRAM for more than 2MB
>>>>>>>>>>> unaligned size, and then do trim. This method improves the
>>>>>>>>>>> allocation efficiency and reduces memory fragmentation.
>>>>>>>>>> Well that is a trade off.
>>>>>>>>>>
>>>>>>>>>> Allocating the BO as one contiguous chunk and then trimming is
>>>>>>>>>> beneficial because if we then later need it contiguous we don't
>>>>>>>>>> need to re-allocate and copy. This can be needed to display
>>>>>>>>>> something for
>>>>>>>> example.
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> This case means that you allocate BO that is unnecessary to be
>>>>>>> continuous at first time, and latter the BO should be continuous.
>>>>>>> I'm not familiar with display. Could you give me a few more
>>>>>>> specific examples ?
>>>>>> On most generations DCE/DCN hardware needs the buffer contiguous
>> to
>>>>> be
>>>>>> able to scanout from it.
>>>>>>
>>>>>> Only newer APUs can use S/G to scanout from system memory pages.
>>>>>>
>>>>>>>>> Yes, I agree that one contiguous chunk may get beneficial
>>>>>>>>> sometimes.
>>>>>>>>> But as far as I know, you cannot guarantee that
>>>>> amdgpu_vram_mgr_new
>>>>>>>> can get one contiguous chunk  if you don't set
>>>>>>>> TTM_PL_FLAG_CONTIGUOUS flags.
>>>>>>>>> For example, if you want to allocate 4M+4K BO, it will allocate
>>>>>>>>> one 4M block
>>>>>>>> + one 2M block which is unnecessary to be continuous, then 2M
>>>>>>>> + block
>>>>>>>> will be
>>>>>>>> trimmed.
>>>>>>>>
>>>>>>>> Oh, that's indeed not something which should happen. Sounds more
>>>>>>>> like a bug fix then.
>>>>>>> Yes, I think this case should not be happened.
>>>>>>> Actually, I'm not sure that why the allocated BO should be aligned
>>>>>>> with pages_per_block, which is set to 2MB by default.
>>>>>>> Does this help improve performance when allocating 2M or above BO?
>>>>>>>    From my point of view, the TLB may be one of reason of this. But
>>>>>>> I'm not sure about this.
>>>>>> Yes, we try to use allocations which are as contiguous as much as
>>>>>> possible for better TLB usage.
>>>>>>
>>>>>> Especially for some compute use cases this can make a >20%
>>>>>> performance difference.
>>>>> We actually found that >2MB virtual address alignment was hurting
>>>>> performance due to cache line aliasing. So we can't take advantage
>>>>> of  >2MB pages in our page tables.
>>>>>
>>>>> Regards,
>>>>>      Felix
>>>> Yes, if we want to take advantage of 2M TLB usage, we should keep
>>>> virtual address aligned.
>>>>
>>>> As you have mentioned that cache line aliasing issue, I'm confused
>>>> about this.
>>>> If 2MB aligned VA get the right PA from TLB or page table and the
>>>> cache line addressing mode is not changed, the cache line aliasing
>>>> issue should not happen here.
>>>> Is there something wrong with my understanding? Or maybe there are
>>>> some backgrounds that I didn't know.
>>> The problem is with virtual address alignments > 2MiB (or whatever the
>>> big cache line size is).
>>>
>>> Let's assume an example where you have a lot of buffer each 66MiB in
>>> size. When you align those to 2MiB in the virtual address space you
>>> end up with
>>>
>>> 64MiB..2MiB..62MiB..4MiB..60MiB... etc...
>>>
>>> In your address space. In this configuration each 2MiB cache line is
>>> equally used.
>>>
>>> But if you align the buffers to say the next power of two (128MiB) you
>>> end up like this:
>>>
>>> 64MiB..2MiB..62MiB hole..64MiB..2MiB..62MiB hole... etc....
>>>
>>> In this case the first 2MiB cache line of each buffer is used twice as
>>> much as all the other cache lines. This can hurt performance very badly.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Best Regards,
>>>> Shane
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Best Regards,
>>>>>>> Shane
>>>>>>>
>>>>>>>>>> On the other hand I completely agree allocating big and then
>>>>>>>>>> trimming creates more fragmentation than necessary.
>>>>>>>>>>
>>>>>>>>>> Do you have some test case which can show the difference?
>>>>>>>>> I have use rocrtst to show the difference.
>>>>>>>>> The attachment is shown that after applying this patch, the
>>>>>>>>> order <
>>>>>>>>> 9 total
>>>>>>>> vram size decrease from 99MB to 43MB.
>>>>>>>>> And the latter has more higher order block memory.
>>>>>>>> Arun can you take a look? That problem here sounds important.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>> BTW: No need to discuss that on the internal mailing list,
>>>>>>>>>> please use the public one instead.
>>>>>>>>>>
>>>>>>>>> I will send it to public. Thank you for your remind.
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Shane
>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Shane Xiao <shane.xiao at amd.com>
>>>>>>>>>>> ---
>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 2 +-
>>>>>>>>>>>       1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>>>>>>> index 75c80c557b6e..3fea58f9427c 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>>>>>>> @@ -453,7 +453,7 @@ static int amdgpu_vram_mgr_new(struct
>>>>>>>>>> ttm_resource_manager *man,
>>>>>>>>>>>               /* Limit maximum size to 2GiB due to SG table
>>>>>>>>>>> limitations */
>>>>>>>>>>>               size = min(remaining_size, 2ULL << 30);
>>>>>>>>>>>
>>>>>>>>>>> -        if (size >= (u64)pages_per_block << PAGE_SHIFT)
>>>>>>>>>>> +        if (!(size % ((u64)pages_per_block << PAGE_SHIFT)))
>>>>>>>>>>>                   min_block_size = (u64)pages_per_block <<
>>>>>>>>>> PAGE_SHIFT;
>>>>>>>>>>>               cur_size = size;