[Intel-xe] bulk_move in ttm_resource manager

Thu Oct 5 08:36:39 UTC 2023

On Wed, 2023-10-04 at 14:44 +0200, Christian König wrote:
> Am 04.10.23 um 09:17 schrieb Thomas Hellström:
> > On Wed, 2023-10-04 at 03:52 +0000, Zeng, Oak wrote:
> > > Hi Christian,
> > >   
> > > As a follow up to this thread:
> > > https://www.spinics.net/lists/dri-devel/msg410740.html, I started
> > > the
> > > work of moving the lru out of ttm_resource_manager and make it a
> > > common library for both ttm and svm. While look into the details
> > > of
> > > the bulk_move in ttm resource manager, I found a potential
> > > problem:
> > >   
> > > For simplicity, let’s say we only have one memory type and one
> > > priority, so ttm resource manager only maintains one global lru
> > > list.
> > > Let’s say this list has 10 nodes, node1 to node10.
> > >   
> > > But the lru_bulk_move is per vm. Let’s say vm1 has a bulk_move
> > > covering node range [node4, node7] and vm2 has a bulk_move
> > > covering
> > > node range [node6, node9]. Notice those two range has an overlap.
> > > Since two vm can simultaneously add nodes to lru, I think this
> > > scenario can happen.
> 
> That can't happen. See what ttm_resource_move_to_lru_tail() does when
> the BO has a bulk move associated with it.
> 
> > >   
> > > Now if we perform a bulk move for vm1, moving [node4, node7] to
> > > the
> > > tail of the lru list. The lru after this bulk move will be:
> > > node1,
> > > node2, node3,node8,node9, node10, node4, node5, node6, node7. Now
> > > notice that for vm2’s bulk_move, the first pointer  (pointing to
> > > node6) is actually after the last pointer (pointing to node9),
> > > which
> > > doesn’t make sense.
> > >   
> > > Is this a real problem? As I understand it, with this issue, we
> > > only
> > > mess up the lru list order, but there won’t be any functional
> > > problem. If it is a real problem, should we make the bulk_move
> > > global
> > > instead of per vm based?
> > >   
> > > Thanks,
> > > Oak
> > >   
> > FWIW I have a patch set that converts the TTM bulk move code to
> > using
> > sublists; a list item is either a resource or a sublist, and when
> > performing a bulk move essentially the sublist is moved. Bumping
> > resource LRU within a VM would touch only the sublist.
> 
> That sounds like my very first attempt at bulk moves which we
> abandoned 
> for various reasons.
> 
> That's easily >5years ago, but the history of that should still be on
> the mailing list if I'm not completely mistaken.

This here?

https://lists.freedesktop.org/archives/amd-gfx/2018-August/025016.html

No, in that case it's very different. Or is it an even earlier version?

/Thomas

> 
> Regards,
> Christian.
> 
> > 
> > Currently functionality and TTM API is essentially the same but
> > when
> > experimenting with LRU traversal for exhaustive WW-locking eviction
> > this concept was easier to use. Also hopefully this would reduce
> > fragility and improve understanding since a scenario like the above
> > could really never happen...
> > 
> > Let me know if I should send it out as an RFC.
> > 
> > Code is here:
> > https://gitlab.freedesktop.org/drm/xe/kernel/-/merge_requests/351/commits
> > 
> > /Thomas
> > 
> > 
> > 
> > 
> > 
>