[Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager

Wed Feb 22 15:58:45 UTC 2023

On 2/22/23 15:20, Christian König wrote:
> Am 22.02.23 um 14:54 schrieb Thomas Hellström:
>> Hi,
>>
>> On 2/22/23 12:39, Christian König wrote:
>>> Hi Thomas,
>>>
>>> Am 22.02.23 um 12:00 schrieb Thomas Hellström:
>>>> Hi, Christian,
>>>>
>>>> So I resurrected Maarten's previous patch series around this (the 
>>>> amdgpu suballocator) slightly modified the code to match the API of 
>>>> this patch series, re-introduced the per-allocation alignment as 
>>>> per a previous review comment from you on that series, and made 
>>>> checkpatch.pl pass mostly, except for pre-existing style problems, 
>>>> and added / fixed some comments. No memory corruption seen so far 
>>>> on limited Xe testing.
>>>>
>>>> To move this forward I suggest starting with that as a common drm 
>>>> suballocator. I'll post the series later today. We can follow up 
>>>> with potential simplifactions lif needed.
>>>>
>>>> I also made a kunit test also reporting some timing information. 
>>>> Will post that as a follow up. Some interesting preliminary 
>>>> conclusions:
>>>>
>>>> * drm_mm is per se not a cpu hog, If the rb tree processing is 
>>>> disabled and the EVICT algorithm is changed from MRU to ring-like 
>>>> LRU traversal, it's more or less just as fast as the ring 
>>>> suballocator.
>>>>
>>>> * With a single ring, and the suballocation buffer never completely 
>>>> filled (no sleeps) the amd suballocator is a bit faster per 
>>>> allocation / free. (Around 250 ns instead of 350). Allocation is 
>>>> slightly slower on the amdgpu one, freeing is faster, mostly due to 
>>>> the locking overhead incurred when setting up the fence callbacks, 
>>>> and for avoiding irq-disabled processing on the one I proposed.
>>>
>>> For some more realistic numbers try to signal the fence from another 
>>> CPU. Alternative you can invalidate all the CPU read cache lines 
>>> touched by the fence callback so that they need to be read in again 
>>> from the allocating CPU.
>>
>> Fences are signalled using hr-timer driven fake "ring"s, so should 
>> probably be distributed among cpus in a pretty realistic way. But 
>> anyway I agree results obtained from that kunit test can and should 
>> be challenged before we actually use them for improvements.
>
> I would double check that. My expectation is that hr-timers execute by 
> default on the CPU from which they are started.

Hmm, since not using the _PINNED hrtimer flag, I'd expect them to be 
more distributed but you're right, they weren't. A rather few 
timer_expires from other cpus only. So figures for signalling on other 
cpus are, around 500ns for the amdgpu variant, around 900 ns for the 
fence-callback one. Still, sleeping starts around 50-75% fill with the 
amdgpu variant.

>
>>
>>>
>>>>
>>>> * With multiple rings and varying allocation sizes and signalling 
>>>> times creating fragmentation, the picture becomes different as the 
>>>> amdgpu allocator starts to sleep/throttle already round 50% - 75% 
>>>> fill. The one I proposed between 75% to 90% fill, and once that 
>>>> happens, the CPU cost of putting to sleep and waking up should 
>>>> really shadow the above numbers.
>>>>
>>>> So it's really a tradeoff. Where IMO also code size and 
>>>> maintainability should play a role.
>>>>
>>>> Also I looked at the history of the amdgpu allocator originating 
>>>> back to Radeon 2012-ish, but couldn't find any commits mentioning 
>>>> fence callbacks nor problem with those. Could you point me to that 
>>>> discussion?
>>>
>>> Uff that was ~10 years ago. I don't think I can find that again.
>>
>> OK, fair enough. But what was the objective reasoning against using 
>> fence callbacks for this sort of stuff, was it unforeseen locking 
>> problems, caching issues or something else?
>
> Well caching line bouncing is one major problem. Also take a look at 
> the discussion about using list_head in interrupt handlers, that 
> should be easy to find on LWN.
>
> The allocator usually manages enough memory so that it never runs into 
> waiting for anything, only in extreme cases like GPU resets we 
> actually wait for allocations to be freed.

I guess this varies with the application, but can be remedied with just 
adding more managed memory if needed.

/Thomas

>
> So the only cache lines which is accessed from more than one CPU 
> should be the signaled flag of the fence.
>
> With moving list work into the interrupt handler you have at least 3 
> cache lines which start to bounce between different CPUs.
>
> Regards,
> Christian.
>
>>
>> Thanks,
>>
>> Thomas
>>
>>
>>
>>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>> On 2/17/23 14:51, Thomas Hellström wrote:
>>>>>
>>>>> On 2/17/23 14:18, Christian König wrote:
>>>>>> Am 17.02.23 um 14:10 schrieb Thomas Hellström:
>>>>>>> [SNIP]
>>>>>>>>>>>
>>>>>>>>>>> Any chance you could do a quick performance comparison? If 
>>>>>>>>>>> not, anything against merging this without the amd / radeon 
>>>>>>>>>>> changes until we can land a simpler allocator?
>>>>>>>>>>
>>>>>>>>>> Only if you can stick the allocator inside Xe and not drm, 
>>>>>>>>>> cause this seems to be for a different use case than the 
>>>>>>>>>> allocators inside radeon/amdgpu.
>>>>>>>>>
>>>>>>>>> Hmm. No It's allocating in a ring-like fashion as well. Let me 
>>>>>>>>> put together a unit test for benchmaking. I think it would be 
>>>>>>>>> a failure for the community to end up with three separate 
>>>>>>>>> suballocators doing the exact same thing for the same problem, 
>>>>>>>>> really.
>>>>>>>>
>>>>>>>> Well exactly that's the point. Those allocators aren't the same 
>>>>>>>> because they handle different problems.
>>>>>>>>
>>>>>>>> The allocator in radeon is simpler because it only had to deal 
>>>>>>>> with a limited number of fence timelines. The one in amdgpu is 
>>>>>>>> a bit more complex because of the added complexity for more 
>>>>>>>> fence timelines.
>>>>>>>>
>>>>>>>> We could take the one from amdgpu and use it for radeon and 
>>>>>>>> others as well, but the allocator proposed here doesn't even 
>>>>>>>> remotely matches the requirements.
>>>>>>>
>>>>>>> But again, what *are* those missing requirements exactly? What 
>>>>>>> is the pathological case you see for the current code?
>>>>>>
>>>>>> Well very low CPU overhead and don't do anything in a callback.
>>>>>
>>>>> Well, dma_fence_wait_any() will IIRC register callbacks on all 
>>>>> affected fences, although admittedly there is no actual allocator 
>>>>> processing in them.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> From what I can tell the amdgpu suballocator introduces 
>>>>>>> excessive complexity to coalesce waits for fences from the same 
>>>>>>> contexts, whereas the present code just frees from the fence 
>>>>>>> callback if the fence wasn't already signaled.
>>>>>>
>>>>>> And this is exactly the design we had previously which we removed 
>>>>>> after Dave stumbled over tons of problems with it.
>>>>>
>>>>> So is the worry that those problems have spilled over in this code 
>>>>> then? It's been pretty extensively tested, or is it you should 
>>>>> never really use dma-fence callbacks?
>>>>>
>>>>>>
>>>>>>> The fence signalling code that fires that callback is typcally 
>>>>>>> always run anyway on scheduler fences.
>>>>>>>
>>>>>>> The reason we had for not using the amdgpu suballocator as 
>>>>>>> originally planned was that this complexity made it very hard 
>>>>>>> for us to undertand it and to fix issues we had with it.
>>>>>>
>>>>>> Well what are those problems? The idea is actually not that 
>>>>>> hardware to understand.
>>>>>
>>>>> We hit memory corruption, and we spent substantially more time 
>>>>> trying to debug it than to put together this patch, while never 
>>>>> really understanding what  happened, nor why you don't see that 
>>>>> with amdgpu.
>>>>>
>>>>>>
>>>>>> We could simplify it massively for the cost of only waiting for 
>>>>>> the oldest fence if that helps.
>>>>>
>>>>> Let me grab the latest version from amdgpu and give it a try 
>>>>> again, but yes I think that to make it common code we'll need it 
>>>>> simpler (and my personal wish would be to separate the allocator 
>>>>> functionality a bit more from the fence waiting, which I guess 
>>>>> should be OK if the fence waiting is vastly simplified).
>>>>>
>>>>> /Thomas
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Thomas
>>>>>>
>>>
>