[PATCH 0/4] TTM unlockable restartable LRU list iteration

Fri Feb 16 14:00:11 UTC 2024

Am 16.02.24 um 14:13 schrieb Thomas Hellström:
> This patch-set is a prerequisite for a standalone TTM shrinker
> and for exhaustive TTM eviction using sleeping dma_resv locks,
> which is the motivation it.
>
> Currently when unlocking the TTM lru list lock, iteration needs
> to be restarted from the beginning, rather from the next LRU list
> node. This can potentially be a big problem, because if eviction
> or shrinking fails for whatever reason after unlock, restarting
> is likely to cause the same failure over and over again.

Oh, yes please. I have been working on that problem before as well, but 
wasn't able to come up with something working.

> There are various schemes to be able to continue the list
> iteration from where we left off. One such scheme used by the
> GEM LRU list traversal is to pull items already considered off
> the LRU list and reinsert them when iteration is done.
> This has the drawback that concurrent list iteration doesn't see
> the complete list (which is bad for exhaustive eviction) and also
> doesn't lend itself well to bulk-move sublists since these will
> be split in the process where items from those lists are
> temporarily pulled from the list and moved to the list tail.

Completely agree that this is not a desirable solution.

> The approach taken here is that list iterators insert themselves
> into the list next position using a special list node. Iteration
> is then using that list node as starting point when restarting.
> Concurrent iterators just skip over the special list nodes.
>
> This is implemented in patch 1 and 2.
>
> For bulk move sublist the approach is the same, but when a bulk
> move sublist is moved to the tail, the iterator is also moved,
> causing us to skip parts of the list. That is undesirable.
> Patch 3 deals with that, and when iterator detects it is
> traversing a sublist, it inserts a second restarting point just
> after the sublist and if the sublist is moved to the tail,
> it just uses the second restarting point instead.
>
> This is implemented in patch 3.

Interesting approach, that is probably even better than what I tried.

My approach was basically to not only lock the current BO, but also the 
next one. Since only a locked BO can move on the LRU we effectively 
created an anchor.

Before I dig into the code a couple of questions:
1. How do you distinct BOs and iteration anchors on the LRU?
2. How do you detect that a bulk list moved on the LRU?
3. How do you remove the iteration anchors from the bulk list?

Regards,
Christian.

>
> The restartable property is used in patch 4 to restart swapout if
> needed, but the main purpose is this paves the way for
> shrinker- and exhaustive eviction.
>
> Cc: Christian König <christian.koenig at amd.com>
> Cc: <dri-devel at lists.freedesktop.org>
>
> Thomas Hellström (4):
>    drm/ttm: Allow TTM LRU list nodes of different types
>    drm/ttm: Use LRU hitches
>    drm/ttm: Consider hitch moves within bulk sublist moves
>    drm/ttm: Allow continued swapout after -ENOSPC falure
>
>   drivers/gpu/drm/ttm/ttm_bo.c       |   1 +
>   drivers/gpu/drm/ttm/ttm_device.c   |  33 +++--
>   drivers/gpu/drm/ttm/ttm_resource.c | 202 +++++++++++++++++++++++------
>   include/drm/ttm/ttm_resource.h     |  91 +++++++++++--
>   4 files changed, 267 insertions(+), 60 deletions(-)
>