[RFC 0/5] Discussion around eviction improvements

Tue May 14 15:14:02 UTC 2024

On 13/05/2024 14:49, Tvrtko Ursulin wrote:
> 
> On 09/05/2024 13:40, Tvrtko Ursulin wrote:
>>
>> On 08/05/2024 19:09, Tvrtko Ursulin wrote:
>>> From: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>
>>> Last few days I was looking at the situation with VRAM over 
>>> subscription, what
>>> happens versus what perhaps should happen. Browsing through the 
>>> driver and
>>> running some simple experiments.
>>>
>>> I ended up with this patch series which, as a disclaimer, may be 
>>> completely
>>> wrong but as I found some suspicious things, to me at least, I 
>>> thought it was a
>>> good point to stop and request some comments.
>>>
>>> To perhaps summarise what are the main issues I think I found:
>>>
>>>   * Migration rate limiting does not bother knowing if actual 
>>> migration happened
>>>     and so can over-account and unfairly penalise.
>>>
>>>   * Migration rate limiting does not even work, at least not for the 
>>> common case
>>>     where userspace configures VRAM+GTT. It thinks it can stop 
>>> migration attempts
>>>     by playing with bo->allowed_domains vs bo->preferred domains but, 
>>> both from
>>>     the code, and from empirical experiments, I see that not working 
>>> at all. Both
>>>     masks are identical so fiddling with them achieves nothing.
>>>
>>>   * Idea of the fallback placement only works when VRAM has free 
>>> space. As soon
>>>     as it does not, ttm_resource_compatible is happy to leave the 
>>> buffers in the
>>>     secondary placement forever.
>>>
>>>   * Driver thinks it will be re-validating evicted buffers on the 
>>> next submission
>>>     but it does not for the very common case of VRAM+GTT because it 
>>> only checks
>>>     if current placement is *none* of the preferred placements.
>>>
>>> All those problems are addressed in individual patches.
>>>
>>> End result of this series appears to be driver which will try harder 
>>> to move
>>> buffers back into VRAM, but will be (more) correctly throttled in 
>>> doing so by
>>> the existing rate limiting logic.
>>>
>>> I have run a quick benchmark of Cyberpunk 2077 and cannot say that I 
>>> saw a
>>> change but that could be a good thing too. At least I did not break 
>>> anything,
>>> perhaps.. On one occassion I did see the rate limiting logic get 
>>> confused while
>>> for a period of few minutes it went to a mode where it was constantly 
>>> giving a
>>> high migration budget. But that recovered itself when I switched 
>>> clients and did
>>> not come back so I don't know. If there is something wrong there I 
>>> don't think
>>> it would be caused by any patches in this series.
>>
>> Since yesterday I also briefly tested with Far Cry New Dawn. One run 
>> each so possibly doesn't mean anything apart that there isn't a 
>> regression aka migration throttling is keeping things at bay even with 
>> increased requests to migrate things back to VRAM:
>>
>>               before         after
>> min/avg/max fps        36/44/54        37/45/55
>>
>> Cyberpunk 2077 from yesterday was similarly close:
>>
>>          26.96/29.59/30.40    29.70/30.00/30.32
>>
>> I guess the real story is proper DGPU where misplaced buffers have a 
>> real cost.
> 
> I found one game which regresses spectacularly badly with this series - 
> Assasin's Creed Valhalla. The built-in benchmark at least. The game 
> appears to have a working set much larger than the other games I tested, 
> around 5GiB total during the benchmark. And for some reason migration 
> throttling totally fails to put it in check. I will be investigating 
> this shortly.

I think that the conclusion is everything I attempted to add relating to 
TTM_PL_PREFERRED does not really work as I initially thought it did. 
Therefore please imagine this series as only containing patches 1, 2 and 5.

(And FWIW it was quite annoying to get to the bottom of since for some 
reason the system exibits some sort of a latching behaviour, where on 
some boots and/or some minutes of runtime things were fine, and then it 
would latch onto a mode where the TTM_PL_PREFERRED induced breakage 
would show. And sometimes this breakage would appear straight away. Odd.)

I still need to test though if the subset of patches manage to achieve 
some positive improvement on their own. It is possible, as patch 5 marks 
more buffers for re-validation so once overcommit subsides they would 
get promoted to preferred placement straight away. And 1&2 are 
notionally fixes for migration throttling so at least in broad sense 
should be still valid as discussion points.

Regards,

Tvrtko

>>> Series is probably rough but should be good enough for dicsussion. I 
>>> am curious
>>> to hear if I identified at least something correctly as a real problem.
>>>
>>> It would also be good to hear what are the suggested games to check 
>>> and see
>>> whether there is any improvement.
>>>
>>> Cc: Christian König <christian.koenig at amd.com>
>>> Cc: Friedrich Vock <friedrich.vock at gmx.de>
>>>
>>> Tvrtko Ursulin (5):
>>>    drm/amdgpu: Fix migration rate limiting accounting
>>>    drm/amdgpu: Actually respect buffer migration budget
>>>    drm/ttm: Add preferred placement flag
>>>    drm/amdgpu: Use preferred placement for VRAM+GTT
>>>    drm/amdgpu: Re-validate evicted buffers
>>>
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 38 +++++++++++++++++-----
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     | 21 ++++++++++--
>>>   drivers/gpu/drm/ttm/ttm_resource.c         | 13 +++++---
>>>   include/drm/ttm/ttm_placement.h            |  3 ++
>>>   5 files changed, 65 insertions(+), 18 deletions(-)
>>>