[RFC PATCH 23/28] drm/xe: Add SVM VRAM migration

Mon Sep 2 11:01:45 UTC 2024

Am 30.08.24 um 00:12 schrieb Matthew Brost:
> On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
>> On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
>>> But as Sima pointed out in private communication, exhaustive eviction
>>> is not really needed for faulting to make (crawling) progress.
>>> Watermarks and VRAM trylock shrinking should suffice, since we're
>>> strictly only required to service a single gpu page granule at a time.
>>>
>>> However, ordinary bo-based jobs would still like to be able to
>>> completely evict SVM vram. Whether that is important enough to strive
>>> for is ofc up for discussion.
>> My take is that you don't win anything for exhaustive eviction by having
>> the dma_resv somewhere in there for svm allocations. Roughly for split lru
>> world, where svm ignores bo/dma_resv:
>>
>> When evicting vram from the ttm side we'll fairly switch between selecting
>> bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
>> will eventually succeed in vacuuming up everything (with a few retries
>> perhaps, if we're not yet at the head of the ww ticket queue).
>>
>> svm pages we need to try to evict anyway - there's no guarantee, becaue
>> the core mm might be holding temporary page references (which block
> Yea, but think you can could kill the app then - not suggesting we
> should but could. To me this is akin to a CPU fault and not being able
> to migrate the device pages - the migration layer doc says when this
> happens kick this to user space and segfault the app.

That's most likely a bad idea. That the core holds a temporary page 
reference can happen any time without any bad doing from the 
application. E.g. for direct I/O, swapping etc...

So you can't punish the application with a segfault if you happen to not 
be able to migrate a page because it has a reference.

Regards,
Christian.

>
> My last patch in the series adds some asserts to see if this ever
> happens, thus far never. If it occurs we could gracefully handle it by
> aborting the migration I guess... I think the user really needs to
> something a bit crazy to trigger this condition - I don't think the core
> randomly grabs refs to device pages but could be wrong.
>
>> migration) or have the page locked (which also block the migration). But
>> as long as those two steps succeed, we'll win and get the pages. There
>> might be some thrashing against concurrent svm faults stealing them again,
>> but they have a disadvantage since they can't steal dma_resv_locked bo.
>> And if it's still too much we can stall them in the page allocator.
>>
>> So it's not entirely reliable, but should be close enough.
>>
>> Now for bo based svm the picture isn't any different, because holding
>> dma_resv is not actually enough to migrate svm mappings. We still need to
>> hope there's no temporary page references around, and we still need to
>> succeed at locking the page. And the migration code only does trylocks,
>> because that's it's deadlock prevent algorithm if different migrations
>> needing the same set of pages, but acquiring them in a different order. So
>> we win nothing.
> Ok, maybe my statement above is false...
>
> Wouldn't be the only time this falls is if another migration is in
> flight (e.g. CPU fault) and they race? Then the eviction will naturally
> happen via refcount being dropped from the other migration. I guess I
> likely need to update my eviction code to not free the TTM resource if
> all pages are not migrated.
>
>> Worse, if dma_resv does actually hold up svm migration and reclaim, then
>> we potentially deadlock because that lock is for a bigger range than
>> individual pages (or folios). And the core mm assumes that it can get out
>> of a deadlock bind by (at least stochastically) eventually succeeding in
>> acquiring/locking down a single page.
>>
>> This means we cannot use dma_resv tricks to give the ttm world an
>> advantage in exhaustive eviction against concurrent svm faults. Or at
>> least not more than we can do without by just stalling svm faults that
>> need to allocate gpu memory (but that must happen without holding locks or
>> we're busted).
>>
> I'm a little lost here on the deadlock case. Do you mean when we try to
> evict SVM BO we trigger reclaim by allocating system pages and can
> deadlock? Doesn't TTM already have this dependency when evicting non-SVM
> BOs?
>
>> So the only benefit I'm seeing is the unified lru, which I'm not sure is
>> worth it. There's also a bit a lru design tension here, because for the bo
> Well also not rewriting the world...
>
> Matt
>
>> world we want objects that are locked to stay on the lru, so that the
>> competing processes can figure out who has the winning ww ticket. The core
>> mm design otoh does isolate pages and remove them from the lru when
>> they're acquired, so that they don't gunk up other processes from trying
>> to make forward progress and are better hidden. Which reduces temporary
>> page references (from lru walk) preventing migration and stuff like that.
>>
>> Cheers, Sima
>> -- 
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch