[Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager

Wed Feb 22 11:39:15 UTC 2023

Hi Thomas,

Am 22.02.23 um 12:00 schrieb Thomas Hellström:
> Hi, Christian,
>
> So I resurrected Maarten's previous patch series around this (the 
> amdgpu suballocator) slightly modified the code to match the API of 
> this patch series, re-introduced the per-allocation alignment as per a 
> previous review comment from you on that series, and made 
> checkpatch.pl pass mostly, except for pre-existing style problems, and 
> added / fixed some comments. No memory corruption seen so far on 
> limited Xe testing.
>
> To move this forward I suggest starting with that as a common drm 
> suballocator. I'll post the series later today. We can follow up with 
> potential simplifactions lif needed.
>
> I also made a kunit test also reporting some timing information. Will 
> post that as a follow up. Some interesting preliminary conclusions:
>
> * drm_mm is per se not a cpu hog, If the rb tree processing is 
> disabled and the EVICT algorithm is changed from MRU to ring-like LRU 
> traversal, it's more or less just as fast as the ring suballocator.
>
> * With a single ring, and the suballocation buffer never completely 
> filled (no sleeps) the amd suballocator is a bit faster per allocation 
> / free. (Around 250 ns instead of 350). Allocation is slightly slower 
> on the amdgpu one, freeing is faster, mostly due to the locking 
> overhead incurred when setting up the fence callbacks, and for 
> avoiding irq-disabled processing on the one I proposed.

For some more realistic numbers try to signal the fence from another 
CPU. Alternative you can invalidate all the CPU read cache lines touched 
by the fence callback so that they need to be read in again from the 
allocating CPU.

>
> * With multiple rings and varying allocation sizes and signalling 
> times creating fragmentation, the picture becomes different as the 
> amdgpu allocator starts to sleep/throttle already round 50% - 75% 
> fill. The one I proposed between 75% to 90% fill, and once that 
> happens, the CPU cost of putting to sleep and waking up should really 
> shadow the above numbers.
>
> So it's really a tradeoff. Where IMO also code size and 
> maintainability should play a role.
>
> Also I looked at the history of the amdgpu allocator originating back 
> to Radeon 2012-ish, but couldn't find any commits mentioning fence 
> callbacks nor problem with those. Could you point me to that discussion?

Uff that was ~10 years ago. I don't think I can find that again.

Regards,
Christian.

>
> Thanks,
>
> Thomas
>
>
>
> On 2/17/23 14:51, Thomas Hellström wrote:
>>
>> On 2/17/23 14:18, Christian König wrote:
>>> Am 17.02.23 um 14:10 schrieb Thomas Hellström:
>>>> [SNIP]
>>>>>>>>
>>>>>>>> Any chance you could do a quick performance comparison? If not, 
>>>>>>>> anything against merging this without the amd / radeon changes 
>>>>>>>> until we can land a simpler allocator?
>>>>>>>
>>>>>>> Only if you can stick the allocator inside Xe and not drm, cause 
>>>>>>> this seems to be for a different use case than the allocators 
>>>>>>> inside radeon/amdgpu.
>>>>>>
>>>>>> Hmm. No It's allocating in a ring-like fashion as well. Let me 
>>>>>> put together a unit test for benchmaking. I think it would be a 
>>>>>> failure for the community to end up with three separate 
>>>>>> suballocators doing the exact same thing for the same problem, 
>>>>>> really.
>>>>>
>>>>> Well exactly that's the point. Those allocators aren't the same 
>>>>> because they handle different problems.
>>>>>
>>>>> The allocator in radeon is simpler because it only had to deal 
>>>>> with a limited number of fence timelines. The one in amdgpu is a 
>>>>> bit more complex because of the added complexity for more fence 
>>>>> timelines.
>>>>>
>>>>> We could take the one from amdgpu and use it for radeon and others 
>>>>> as well, but the allocator proposed here doesn't even remotely 
>>>>> matches the requirements.
>>>>
>>>> But again, what *are* those missing requirements exactly? What is 
>>>> the pathological case you see for the current code?
>>>
>>> Well very low CPU overhead and don't do anything in a callback.
>>
>> Well, dma_fence_wait_any() will IIRC register callbacks on all 
>> affected fences, although admittedly there is no actual allocator 
>> processing in them.
>>
>>>
>>>>
>>>> From what I can tell the amdgpu suballocator introduces excessive 
>>>> complexity to coalesce waits for fences from the same contexts, 
>>>> whereas the present code just frees from the fence callback if the 
>>>> fence wasn't already signaled.
>>>
>>> And this is exactly the design we had previously which we removed 
>>> after Dave stumbled over tons of problems with it.
>>
>> So is the worry that those problems have spilled over in this code 
>> then? It's been pretty extensively tested, or is it you should never 
>> really use dma-fence callbacks?
>>
>>>
>>>> The fence signalling code that fires that callback is typcally 
>>>> always run anyway on scheduler fences.
>>>>
>>>> The reason we had for not using the amdgpu suballocator as 
>>>> originally planned was that this complexity made it very hard for 
>>>> us to undertand it and to fix issues we had with it.
>>>
>>> Well what are those problems? The idea is actually not that hardware 
>>> to understand.
>>
>> We hit memory corruption, and we spent substantially more time trying 
>> to debug it than to put together this patch, while never really 
>> understanding what  happened, nor why you don't see that with amdgpu.
>>
>>>
>>> We could simplify it massively for the cost of only waiting for the 
>>> oldest fence if that helps.
>>
>> Let me grab the latest version from amdgpu and give it a try again, 
>> but yes I think that to make it common code we'll need it simpler 
>> (and my personal wish would be to separate the allocator 
>> functionality a bit more from the fence waiting, which I guess should 
>> be OK if the fence waiting is vastly simplified).
>>
>> /Thomas
>>
>>
>>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Regards,
>>>>
>>>> Thomas
>>>