bulk_move in ttm_resource manager
Christian König
christian.koenig at amd.com
Thu Oct 5 10:44:30 UTC 2023
Am 05.10.23 um 10:36 schrieb Thomas Hellström:
> On Wed, 2023-10-04 at 14:44 +0200, Christian König wrote:
>> Am 04.10.23 um 09:17 schrieb Thomas Hellström:
>>> On Wed, 2023-10-04 at 03:52 +0000, Zeng, Oak wrote:
>>>> Hi Christian,
>>>>
>>>> As a follow up to this thread:
>>>> https://www.spinics.net/lists/dri-devel/msg410740.html, I started
>>>> the
>>>> work of moving the lru out of ttm_resource_manager and make it a
>>>> common library for both ttm and svm. While look into the details
>>>> of
>>>> the bulk_move in ttm resource manager, I found a potential
>>>> problem:
>>>>
>>>> For simplicity, let’s say we only have one memory type and one
>>>> priority, so ttm resource manager only maintains one global lru
>>>> list.
>>>> Let’s say this list has 10 nodes, node1 to node10.
>>>>
>>>> But the lru_bulk_move is per vm. Let’s say vm1 has a bulk_move
>>>> covering node range [node4, node7] and vm2 has a bulk_move
>>>> covering
>>>> node range [node6, node9]. Notice those two range has an overlap.
>>>> Since two vm can simultaneously add nodes to lru, I think this
>>>> scenario can happen.
>> That can't happen. See what ttm_resource_move_to_lru_tail() does when
>> the BO has a bulk move associated with it.
>>
>>>>
>>>> Now if we perform a bulk move for vm1, moving [node4, node7] to
>>>> the
>>>> tail of the lru list. The lru after this bulk move will be:
>>>> node1,
>>>> node2, node3,node8,node9, node10, node4, node5, node6, node7. Now
>>>> notice that for vm2’s bulk_move, the first pointer (pointing to
>>>> node6) is actually after the last pointer (pointing to node9),
>>>> which
>>>> doesn’t make sense.
>>>>
>>>> Is this a real problem? As I understand it, with this issue, we
>>>> only
>>>> mess up the lru list order, but there won’t be any functional
>>>> problem. If it is a real problem, should we make the bulk_move
>>>> global
>>>> instead of per vm based?
>>>>
>>>> Thanks,
>>>> Oak
>>>>
>>> FWIW I have a patch set that converts the TTM bulk move code to
>>> using
>>> sublists; a list item is either a resource or a sublist, and when
>>> performing a bulk move essentially the sublist is moved. Bumping
>>> resource LRU within a VM would touch only the sublist.
>> That sounds like my very first attempt at bulk moves which we
>> abandoned
>> for various reasons.
>>
>> That's easily >5years ago, but the history of that should still be on
>> the mailing list if I'm not completely mistaken.
> This here?
>
> https://lists.freedesktop.org/archives/amd-gfx/2018-August/025016.html
>
> No, in that case it's very different. Or is it an even earlier version?
No, that was even earlier. Basically the first version I discussed with
Chunming.
The issue was simple that when you have a hierarchically LRU you also
need a multi layer cursor and make sure that you have a single lock for
everything.
This is multi layer cursor is complicated to implement and contradicts
the idea that we want to walk the LRU with anchors and dropping locks in
between (not that we ever implemented that, but it would still be nice
to have).
In general when you use some hierarchical LRU you just move the
complexity from the insert function to the walk function. And I don't
think we would win much with that.
Regards,
Christian.
>
> /Thomas
>
>
>> Regards,
>> Christian.
>>
>>> Currently functionality and TTM API is essentially the same but
>>> when
>>> experimenting with LRU traversal for exhaustive WW-locking eviction
>>> this concept was easier to use. Also hopefully this would reduce
>>> fragility and improve understanding since a scenario like the above
>>> could really never happen...
>>>
>>> Let me know if I should send it out as an RFC.
>>>
>>> Code is here:
>>> https://gitlab.freedesktop.org/drm/xe/kernel/-/merge_requests/351/commits
>>>
>>> /Thomas
>>>
>>>
>>>
>>>
>>>
More information about the dri-devel
mailing list