[RFC PATCH 00/18] TTM interface for managing VRAM oversubscription
Marek Olšák
maraeo at gmail.com
Thu Apr 25 13:22:55 UTC 2024
The most extreme ping-ponging is mitigated by throttling buffer moves
in the kernel, but it only works without VM_ALWAYS_VALID and you can
set BO priorities in the BO list. A better approach that works with
VM_ALWAYS_VALID would be nice.
Marek
On Wed, Apr 24, 2024 at 1:12 PM Friedrich Vock <friedrich.vock at gmx.de> wrote:
>
> Hi everyone,
>
> recently I've been looking into remedies for apps (in particular, newer
> games) that experience significant performance loss when they start to
> hit VRAM limits, especially on older or lower-end cards that struggle
> to fit both desktop apps and all the game data into VRAM at once.
>
> The root of the problem lies in the fact that from userspace's POV,
> buffer eviction is very opaque: Userspace applications/drivers cannot
> tell how oversubscribed VRAM is, nor do they have fine-grained control
> over which buffers get evicted. At the same time, with GPU APIs becoming
> increasingly lower-level and GPU-driven, only the application itself
> can know which buffers are used within a particular submission, and
> how important each buffer is. For this, GPU APIs include interfaces
> to query oversubscription and specify memory priorities: In Vulkan,
> oversubscription can be queried through the VK_EXT_memory_budget
> extension. Different buffers can also be assigned priorities via the
> VK_EXT_pageable_device_local_memory extension. Modern games, especially
> D3D12 games via vkd3d-proton, rely on oversubscription being reported and
> priorities being respected in order to perform their memory management.
>
> However, relaying this information to the kernel via the current KMD uAPIs
> is not possible. On AMDGPU for example, all work submissions include a
> "bo list" that contains any buffer object that is accessed during the
> course of the submission. If VRAM is oversubscribed and a buffer in the
> list was evicted to system memory, that buffer is moved back to VRAM
> (potentially evicting other unused buffers).
>
> Since the usermode driver doesn't know what buffers are used by the
> application, its only choice is to submit a bo list that contains every
> buffer the application has allocated. In case of VRAM oversubscription,
> it is highly likely that some of the application's buffers were evicted,
> which almost guarantees that some buffers will get moved around. Since
> the bo list is only known at submit time, this also means the buffers
> will get moved right before submitting application work, which is the
> worst possible time to move buffers from a latency perspective. Another
> consequence of the large bo list is that nearly all memory from other
> applications will be evicted, too. When different applications (e.g. game
> and compositor) submit work one after the other, this causes a ping-pong
> effect where each app's submission evicts the other app's memory,
> resulting in a large amount of unnecessary moves.
>
> This overly aggressive eviction behavior led to RADV adopting a change
> that effectively allows all VRAM applications to reside in system memory
> [1]. This worked around the ping-ponging/excessive buffer moving problem,
> but also meant that any memory evicted to system memory would forever
> stay there, regardless of how VRAM is used.
>
> My proposal aims at providing a middle ground between these extremes.
> The goals I want to meet are:
> - Userspace is accurately informed about VRAM oversubscription/how much
> VRAM has been evicted
> - Buffer eviction respects priorities set by userspace - Wasteful
> ping-ponging is avoided to the extent possible
>
> I have been testing out some prototypes, and came up with this rough
> sketch of an API:
>
> - For each ttm_resource_manager, the amount of evicted memory is tracked
> (similarly to how "usage" tracks the memory usage). When memory is
> evicted via ttm_bo_evict, the size of the evicted memory is added, when
> memory is un-evicted (see below), its size is subtracted. The amount of
> evicted memory for e.g. VRAM can be queried by userspace via an ioctl.
>
> - Each ttm_resource_manager maintains a list of evicted buffer objects.
>
> - ttm_mem_unevict walks the list of evicted bos for a given
> ttm_resource_manager and tries moving evicted resources back. When a
> buffer is freed, this function is called to immediately restore some
> evicted memory.
>
> - Each ttm_buffer_object independently tracks the mem_type it wants
> to reside in.
>
> - ttm_bo_try_unevict is added as a helper function which attempts to
> move the buffer to its preferred mem_type. If no space is available
> there, it fails with -ENOSPC/-ENOMEM.
>
> - Similar to how ttm_bo_evict works, each driver can implement
> uneviction_valuable/unevict_flags callbacks to control buffer
> un-eviction.
>
> This is what patches 1-10 accomplish (together with an amdgpu
> implementation utilizing the new API).
>
> Userspace priorities could then be implemented as follows:
>
> - TTM already manages priorities for each buffer object. These priorities
> can be updated by userspace via a GEM_OP ioctl to inform the kernel
> which buffers should be evicted before others. If an ioctl increases
> the priority of a buffer, ttm_bo_try_unevict is called on that buffer to
> try and move it back (potentially evicting buffers with a lower
> priority)
>
> - Buffers should never be evicted by other buffers with equal/lower
> priority, but if there is a buffer with lower priority occupying VRAM,
> it should be evicted in favor of the higher-priority one. This prevents
> ping-ponging between buffers that try evicting each other and is
> trivially implementable with an early-exit in ttm_mem_evict_first.
>
> This is covered in patches 11-15, with the new features exposed to
> userspace in patches 16-18.
>
> I also have a RADV branch utilizing this API at [2], which I use for
> testing.
>
> This implementation is stil very much WIP, although the D3D12 games I
> tested already seemed to benefit from it. Nevertheless, are still quite
> a few TODOs and unresolved questions/problems.
>
> Some kernel drivers (e.g i915) already use TTM priorities for
> kernel-internal purposes. Of course, some of the highest priorities
> should stay reserved for these purposes (with userspace being able to
> use the lower priorities).
>
> Another problem with priorities is the possibility of apps starving other
> apps by occupying all of VRAM with high-priority allocations. A possible
> solution could be include restricting the highest priority/priorities
> to important apps like compositors.
>
> Tying into this problem, only apps that are actively cooperating
> to reduce memory pressure can benefit from the current memory priority
> implementation. Eventually the priority system could also be utilized
> to benefit all applications, for example with the desktop environment
> boosting the priority of the currently-focused app/its cgroup (to
> provide the best QoS to the apps the user is actively using). A full
> implementation of this is probably out-of-scope for this initial proposal,
> but it's probably a good idea to consider this as a possible future use
> of the priority API.
>
> I'm primarily looking to integrate this into amdgpu to solve the
> issues I've seen there, but I'm also interested in feedback from
> other drivers. Is this something you'd be interested in? Do you
> have any objections/comments/questions about my proposed design?
>
> Thanks,
> Friedrich
>
> [1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6833
> [2] https://gitlab.freedesktop.org/pixelcluster/mesa/-/tree/spilling
>
> Friedrich Vock (18):
> drm/ttm: Add tracking for evicted memory
> drm/ttm: Add per-BO eviction tracking
> drm/ttm: Implement BO eviction tracking
> drm/ttm: Add driver funcs for uneviction control
> drm/ttm: Add option to evict no BOs in operation
> drm/ttm: Add public buffer eviction/uneviction functions
> drm/amdgpu: Add TTM uneviction control functions
> drm/amdgpu: Don't try moving BOs to preferred domain before submit
> drm/amdgpu: Don't mark VRAM as a busy placement for VRAM|GTT resources
> drm/amdgpu: Don't add GTT to initial domains after failing to allocate
> VRAM
> drm/ttm: Bump BO priority count
> drm/ttm: Do not evict BOs with higher priority
> drm/ttm: Implement ttm_bo_update_priority
> drm/ttm: Consider BOs placed in non-favorite locations evicted
> drm/amdgpu: Set a default priority for user/kernel BOs
> drm/amdgpu: Implement SET_PRIORITY GEM op
> drm/amdgpu: Implement EVICTED_VRAM query
> drm/amdgpu: Bump minor version
>
> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 -
> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 191 +---------------
> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h | 4 -
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 25 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 26 ++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 4 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 50 ++++
> drivers/gpu/drm/ttm/ttm_bo.c | 253 ++++++++++++++++++++-
> drivers/gpu/drm/ttm/ttm_bo_util.c | 3 +
> drivers/gpu/drm/ttm/ttm_device.c | 1 +
> drivers/gpu/drm/ttm/ttm_resource.c | 19 +-
> include/drm/ttm/ttm_bo.h | 22 ++
> include/drm/ttm/ttm_device.h | 28 +++
> include/drm/ttm/ttm_resource.h | 11 +-
> include/uapi/drm/amdgpu_drm.h | 3 +
> 17 files changed, 430 insertions(+), 218 deletions(-)
>
> --
> 2.44.0
>
More information about the dri-devel
mailing list