[PATCH v4 0/4] TTM unlockable restartable LRU list iteration

Fri Mar 8 07:43:55 UTC 2024

Patches are tested on AMD platform.
Repeated stress test on Unigine Heaven, memory full (VRAM + GTT + system 
SWAP), then free.
No errors/warning in kernel log.
Any suggestion specific tests?

Regards,
S.Amarnath
On 3/6/2024 12:31 PM, Thomas Hellström wrote:
> This patch-set is a prerequisite for a standalone TTM shrinker
> and for exhaustive TTM eviction using sleeping dma_resv locks,
> which is the motivation for it.
>
> Currently when unlocking the TTM lru list lock, iteration needs
> to be restarted from the beginning, rather from the next LRU list
> node. This can potentially be a big problem, because if eviction
> or shrinking fails for whatever reason after unlock, restarting
> is likely to cause the same failure over and over again.
>
> There are various schemes to be able to continue the list
> iteration from where we left off. One such scheme used by the
> GEM LRU list traversal is to pull items already considered off
> the LRU list and reinsert them when iteration is done.
> This has the drawback that concurrent list iteration doesn't see
> the complete list (which is bad for exhaustive eviction) and also
> doesn't lend itself well to bulk-move sublists since these will
> be split in the process where items from those lists are
> temporarily pulled from the list and moved to the list tail.
>
> The approach taken here is that list iterators insert themselves
> into the list next position using a special list node. Iteration
> is then using that list node as starting point when restarting.
> Concurrent iterators just skip over the special list nodes.
>
> This is implemented in patch 1 and 2.
>
> For bulk move sublist the approach is the same, but when a bulk
> move sublist is moved to the tail, the iterator is also moved,
> causing us to skip parts of the list. That is undesirable.
> Patch 3 deals with that, and when iterator detects it is
> traversing a sublist, it registers with the ttm_lru_bulk_move
> struct using a linked list, and when that bulk move sublist
> is moved to the tail, any iterator registered with it will
> first be moved to the tail of the sublist.
> This is implemented in patch 3.
>
> The restartable property is used in patch 4 to restart swapout if
> needed, but the main purpose is this paves the way for
> shrinker- and exhaustive eviction.
>
> v2:
> - Rework patch 3 completely.
> v3:
> - Fix a NULL pointer dereference found by Xe CI.
> v4:
> - Remove some leftover code causing build problems.
>
> Cc: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
> Cc: Christian König <christian.koenig at amd.com>
> Cc: <dri-devel at lists.freedesktop.org>
>
> Thomas Hellström (4):
>    drm/ttm: Allow TTM LRU list nodes of different types
>    drm/ttm: Use LRU hitches
>    drm/ttm, drm/amdgpu, drm/xe: Consider hitch moves within bulk sublist
>      moves
>    drm/ttm: Allow continued swapout after -ENOSPC falure
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c |   4 +
>   drivers/gpu/drm/ttm/ttm_bo.c           |   1 +
>   drivers/gpu/drm/ttm/ttm_device.c       |  33 +++-
>   drivers/gpu/drm/ttm/ttm_resource.c     | 228 ++++++++++++++++++++-----
>   drivers/gpu/drm/xe/xe_vm.c             |   4 +
>   include/drm/ttm/ttm_device.h           |   2 +
>   include/drm/ttm/ttm_resource.h         |  96 +++++++++--
>   7 files changed, 308 insertions(+), 60 deletions(-)
>