[RFC PATCH 23/28] drm/xe: Add SVM VRAM migration

Mon Sep 2 12:50:40 UTC 2024

On Mon, Sep 02, 2024 at 01:01:45PM +0200, Christian König wrote:
> Am 30.08.24 um 00:12 schrieb Matthew Brost:
> > On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
> > > On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> > > > But as Sima pointed out in private communication, exhaustive eviction
> > > > is not really needed for faulting to make (crawling) progress.
> > > > Watermarks and VRAM trylock shrinking should suffice, since we're
> > > > strictly only required to service a single gpu page granule at a time.
> > > > 
> > > > However, ordinary bo-based jobs would still like to be able to
> > > > completely evict SVM vram. Whether that is important enough to strive
> > > > for is ofc up for discussion.
> > > My take is that you don't win anything for exhaustive eviction by having
> > > the dma_resv somewhere in there for svm allocations. Roughly for split lru
> > > world, where svm ignores bo/dma_resv:
> > > 
> > > When evicting vram from the ttm side we'll fairly switch between selecting
> > > bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
> > > will eventually succeed in vacuuming up everything (with a few retries
> > > perhaps, if we're not yet at the head of the ww ticket queue).
> > > 
> > > svm pages we need to try to evict anyway - there's no guarantee, becaue
> > > the core mm might be holding temporary page references (which block
> > Yea, but think you can could kill the app then - not suggesting we
> > should but could. To me this is akin to a CPU fault and not being able
> > to migrate the device pages - the migration layer doc says when this
> > happens kick this to user space and segfault the app.
> 
> That's most likely a bad idea. That the core holds a temporary page
> reference can happen any time without any bad doing from the application.
> E.g. for direct I/O, swapping etc...
> 
> So you can't punish the application with a segfault if you happen to not be
> able to migrate a page because it has a reference.

See my other reply, it even happens as a direct consequence of a 2nd
thread trying to migrate the exact same page from vram to sram. And that
really is a core use case.

RESo yeah, we really can't SIGBUS on this case.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch