[RFC 0/5] Discussion around eviction improvements

Christian König christian.koenig at amd.com
Tue May 14 15:47:32 UTC 2024


Am 14.05.24 um 17:14 schrieb Tvrtko Ursulin:
>
> On 13/05/2024 14:49, Tvrtko Ursulin wrote:
>>
>> On 09/05/2024 13:40, Tvrtko Ursulin wrote:
>>>
>>> On 08/05/2024 19:09, Tvrtko Ursulin wrote:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>>
>>>> Last few days I was looking at the situation with VRAM over 
>>>> subscription, what
>>>> happens versus what perhaps should happen. Browsing through the 
>>>> driver and
>>>> running some simple experiments.
>>>>
>>>> I ended up with this patch series which, as a disclaimer, may be 
>>>> completely
>>>> wrong but as I found some suspicious things, to me at least, I 
>>>> thought it was a
>>>> good point to stop and request some comments.
>>>>
>>>> To perhaps summarise what are the main issues I think I found:
>>>>
>>>>   * Migration rate limiting does not bother knowing if actual 
>>>> migration happened
>>>>     and so can over-account and unfairly penalise.
>>>>
>>>>   * Migration rate limiting does not even work, at least not for 
>>>> the common case
>>>>     where userspace configures VRAM+GTT. It thinks it can stop 
>>>> migration attempts
>>>>     by playing with bo->allowed_domains vs bo->preferred domains 
>>>> but, both from
>>>>     the code, and from empirical experiments, I see that not 
>>>> working at all. Both
>>>>     masks are identical so fiddling with them achieves nothing.
>>>>
>>>>   * Idea of the fallback placement only works when VRAM has free 
>>>> space. As soon
>>>>     as it does not, ttm_resource_compatible is happy to leave the 
>>>> buffers in the
>>>>     secondary placement forever.
>>>>
>>>>   * Driver thinks it will be re-validating evicted buffers on the 
>>>> next submission
>>>>     but it does not for the very common case of VRAM+GTT because it 
>>>> only checks
>>>>     if current placement is *none* of the preferred placements.
>>>>
>>>> All those problems are addressed in individual patches.
>>>>
>>>> End result of this series appears to be driver which will try 
>>>> harder to move
>>>> buffers back into VRAM, but will be (more) correctly throttled in 
>>>> doing so by
>>>> the existing rate limiting logic.
>>>>
>>>> I have run a quick benchmark of Cyberpunk 2077 and cannot say that 
>>>> I saw a
>>>> change but that could be a good thing too. At least I did not break 
>>>> anything,
>>>> perhaps.. On one occassion I did see the rate limiting logic get 
>>>> confused while
>>>> for a period of few minutes it went to a mode where it was 
>>>> constantly giving a
>>>> high migration budget. But that recovered itself when I switched 
>>>> clients and did
>>>> not come back so I don't know. If there is something wrong there I 
>>>> don't think
>>>> it would be caused by any patches in this series.
>>>
>>> Since yesterday I also briefly tested with Far Cry New Dawn. One run 
>>> each so possibly doesn't mean anything apart that there isn't a 
>>> regression aka migration throttling is keeping things at bay even 
>>> with increased requests to migrate things back to VRAM:
>>>
>>>               before         after
>>> min/avg/max fps        36/44/54        37/45/55
>>>
>>> Cyberpunk 2077 from yesterday was similarly close:
>>>
>>>          26.96/29.59/30.40    29.70/30.00/30.32
>>>
>>> I guess the real story is proper DGPU where misplaced buffers have a 
>>> real cost.
>>
>> I found one game which regresses spectacularly badly with this series 
>> - Assasin's Creed Valhalla. The built-in benchmark at least. The game 
>> appears to have a working set much larger than the other games I 
>> tested, around 5GiB total during the benchmark. And for some reason 
>> migration throttling totally fails to put it in check. I will be 
>> investigating this shortly.
>
> I think that the conclusion is everything I attempted to add relating 
> to TTM_PL_PREFERRED does not really work as I initially thought it 
> did. Therefore please imagine this series as only containing patches 
> 1, 2 and 5.

Noted (and I had just started to wrap my head around that idea).

>
> (And FWIW it was quite annoying to get to the bottom of since for some 
> reason the system exibits some sort of a latching behaviour, where on 
> some boots and/or some minutes of runtime things were fine, and then 
> it would latch onto a mode where the TTM_PL_PREFERRED induced breakage 
> would show. And sometimes this breakage would appear straight away. Odd.)

Welcome to my world. You improve one use case and four other get a 
penalty. Even when you know the code and potential use cases inside out 
it's really hard to predict how some applications and the core memory 
management behave sometimes.

>
> I still need to test though if the subset of patches manage to achieve 
> some positive improvement on their own. It is possible, as patch 5 
> marks more buffers for re-validation so once overcommit subsides they 
> would get promoted to preferred placement straight away. And 1&2 are 
> notionally fixes for migration throttling so at least in broad sense 
> should be still valid as discussion points.

Yeah, especially 5 kind of makes sense but could potentially lead to 
higher overhead. Need to see how we can better handle that.

Regards,
Christian.

>
> Regards,
>
> Tvrtko
>
>>>> Series is probably rough but should be good enough for dicsussion. 
>>>> I am curious
>>>> to hear if I identified at least something correctly as a real 
>>>> problem.
>>>>
>>>> It would also be good to hear what are the suggested games to check 
>>>> and see
>>>> whether there is any improvement.
>>>>
>>>> Cc: Christian König <christian.koenig at amd.com>
>>>> Cc: Friedrich Vock <friedrich.vock at gmx.de>
>>>>
>>>> Tvrtko Ursulin (5):
>>>>    drm/amdgpu: Fix migration rate limiting accounting
>>>>    drm/amdgpu: Actually respect buffer migration budget
>>>>    drm/ttm: Add preferred placement flag
>>>>    drm/amdgpu: Use preferred placement for VRAM+GTT
>>>>    drm/amdgpu: Re-validate evicted buffers
>>>>
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 38 
>>>> +++++++++++++++++-----
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     | 21 ++++++++++--
>>>>   drivers/gpu/drm/ttm/ttm_resource.c         | 13 +++++---
>>>>   include/drm/ttm/ttm_placement.h            |  3 ++
>>>>   5 files changed, 65 insertions(+), 18 deletions(-)
>>>>



More information about the amd-gfx mailing list