[PATCH 1/2] drm/amdgpu: optimize VRAM allocation when using drm buddy

Thu Feb 16 07:24:01 UTC 2023

This patch seems to pass the rocm memory stress test case.
Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam at amd.com>

On 2/16/2023 12:39 PM, Christian König wrote:
> Am 16.02.23 um 07:48 schrieb Xiao, Shane:
>>> -----Original Message-----
>>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
>>> Sent: Thursday, February 16, 2023 6:19 AM
>>> To: Christian König <ckoenig.leichtzumerken at gmail.com>; Xiao, Shane
>>> <shane.xiao at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>;
>>> Paneer Selvam, Arunpravin <Arunpravin.PaneerSelvam at amd.com>
>>> Cc: amd-gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when 
>>> using
>>> drm buddy
>>>
>>>
>>> Am 2023-02-15 um 05:44 schrieb Christian König:
>>>> Am 15.02.23 um 03:51 schrieb Xiao, Shane:
>>>>> For public review
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>> Sent: Wednesday, February 15, 2023 3:02 AM
>>>>>> To: Xiao, Shane <shane.xiao at amd.com>; Paneer Selvam, Arunpravin
>>>>>> <Arunpravin.PaneerSelvam at amd.com>
>>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when
>>>>>> using drm buddy
>>>>>>
>>>>>> Am 14.02.23 um 15:53 schrieb Xiao, Shane:
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>> Sent: Tuesday, February 14, 2023 8:41 PM
>>>>>>>> To: Xiao, Shane <shane.xiao at amd.com>; brahma_sw_dev
>>>>>>>> <brahma_sw_dev at amd.com>
>>>>>>>> Cc: Paneer Selvam, Arunpravin
>>> <Arunpravin.PaneerSelvam at amd.com>
>>>>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation
>>> when
>>>>>>>> using drm buddy
>>>>>>>>
>>>>>>>> Am 14.02.23 um 12:18 schrieb Shane Xiao:
>>>>>>>>> Since the VRAM manager changed from drm mm to drm buddy. It's
>>> not
>>>>>>>>> necessary to allocate 2MB aligned VRAM for more than 2MB
>>>>>>>>> unaligned size, and then do trim. This method improves the
>>>>>>>>> allocation efficiency and reduces memory fragmentation.
>>>>>>>> Well that is a trade off.
>>>>>>>>
>>>>>>>> Allocating the BO as one contiguous chunk and then trimming is
>>>>>>>> beneficial because if we then later need it contiguous we don't
>>>>>>>> need to re-allocate and copy. This can be needed to display
>>>>>>>> something for
>>>>>> example.
>>>>> Hi Christian,
>>>>>
>>>>> This case means that you allocate BO that is unnecessary to be
>>>>> continuous at first time, and latter the BO should be continuous. I'm
>>>>> not familiar with display. Could you give me a few more specific
>>>>> examples ?
>>>> On most generations DCE/DCN hardware needs the buffer contiguous to
>>> be
>>>> able to scanout from it.
>>>>
>>>> Only newer APUs can use S/G to scanout from system memory pages.
>>>>
>>>>>>> Yes, I agree that one contiguous chunk may get beneficial 
>>>>>>> sometimes.
>>>>>>> But as far as I know, you cannot guarantee that
>>> amdgpu_vram_mgr_new
>>>>>> can get one contiguous chunk  if you don't set
>>>>>> TTM_PL_FLAG_CONTIGUOUS flags.
>>>>>>> For example, if you want to allocate 4M+4K BO, it will allocate one
>>>>>>> 4M block
>>>>>> + one 2M block which is unnecessary to be continuous, then 2M block
>>>>>> will be
>>>>>> trimmed.
>>>>>>
>>>>>> Oh, that's indeed not something which should happen. Sounds more
>>>>>> like a bug fix then.
>>>>> Yes, I think this case should not be happened.
>>>>> Actually, I'm not sure that why the allocated BO should be aligned
>>>>> with pages_per_block, which is set to 2MB by default.
>>>>> Does this help improve performance when allocating 2M or above BO?
>>>>>   From my point of view, the TLB may be one of reason of this. But 
>>>>> I'm
>>>>> not sure about this.
>>>> Yes, we try to use allocations which are as contiguous as much as
>>>> possible for better TLB usage.
>>>>
>>>> Especially for some compute use cases this can make a >20% performance
>>>> difference.
>>> We actually found that >2MB virtual address alignment was hurting
>>> performance due to cache line aliasing. So we can't take advantage 
>>> of  >2MB
>>> pages in our page tables.
>>>
>>> Regards,
>>>     Felix
>> Yes, if we want to take advantage of 2M TLB usage, we should keep 
>> virtual address aligned.
>>
>> As you have mentioned that cache line aliasing issue, I'm confused 
>> about this.
>> If 2MB aligned VA get the right PA from TLB or page table and the 
>> cache line addressing mode is not changed,
>> the cache line aliasing issue should not happen here.
>> Is there something wrong with my understanding? Or maybe there are 
>> some backgrounds that I didn't know.
>
> The problem is with virtual address alignments > 2MiB (or whatever the 
> big cache line size is).
>
> Let's assume an example where you have a lot of buffer each 66MiB in 
> size. When you align those to 2MiB in the virtual address space you 
> end up with
>
> 64MiB..2MiB..62MiB..4MiB..60MiB... etc...
>
> In your address space. In this configuration each 2MiB cache line is 
> equally used.
>
> But if you align the buffers to say the next power of two (128MiB) you 
> end up like this:
>
> 64MiB..2MiB..62MiB hole..64MiB..2MiB..62MiB hole... etc....
>
> In this case the first 2MiB cache line of each buffer is used twice as 
> much as all the other cache lines. This can hurt performance very badly.
>
> Regards,
> Christian.
>
>> Best Regards,
>> Shane
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Best Regards,
>>>>> Shane
>>>>>
>>>>>>>> On the other hand I completely agree allocating big and then
>>>>>>>> trimming creates more fragmentation than necessary.
>>>>>>>>
>>>>>>>> Do you have some test case which can show the difference?
>>>>>>> I have use rocrtst to show the difference.
>>>>>>> The attachment is shown that after applying this patch, the order <
>>>>>>> 9 total
>>>>>> vram size decrease from 99MB to 43MB.
>>>>>>> And the latter has more higher order block memory.
>>>>>> Arun can you take a look? That problem here sounds important.
>>>>>>
>>>>>> Thanks,
>>>>>> Christian.
>>>>>>
>>>>>>>> BTW: No need to discuss that on the internal mailing list, please
>>>>>>>> use the public one instead.
>>>>>>>>
>>>>>>> I will send it to public. Thank you for your remind.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Shane
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>> Signed-off-by: Shane Xiao <shane.xiao at amd.com>
>>>>>>>>> ---
>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 2 +-
>>>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>>>>> index 75c80c557b6e..3fea58f9427c 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>>>>>>> @@ -453,7 +453,7 @@ static int amdgpu_vram_mgr_new(struct
>>>>>>>> ttm_resource_manager *man,
>>>>>>>>>              /* Limit maximum size to 2GiB due to SG table
>>>>>>>>> limitations */
>>>>>>>>>              size = min(remaining_size, 2ULL << 30);
>>>>>>>>>
>>>>>>>>> -        if (size >= (u64)pages_per_block << PAGE_SHIFT)
>>>>>>>>> +        if (!(size % ((u64)pages_per_block << PAGE_SHIFT)))
>>>>>>>>>                  min_block_size = (u64)pages_per_block <<
>>>>>>>> PAGE_SHIFT;
>>>>>>>>>              cur_size = size;
>