[RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation
Simona Vetter
simona.vetter at ffwll.ch
Wed Sep 25 11:41:41 UTC 2024
On Tue, Sep 24, 2024 at 07:36:21PM +0000, Matthew Brost wrote:
> On Tue, Sep 24, 2024 at 11:16:01AM +0200, Simona Vetter wrote:
> > On Tue, Aug 27, 2024 at 07:48:33PM -0700, Matthew Brost wrote:
> > > Continuation of SVM work by Oak Zeng [1][2] based on community feedback.
> > > Introduces GPU SVM layer and new Xe uAPI. Supports GPU page faults for
> > > system allocations (e.g., malloc), runtime allocations (e.g., binding a
> > > BO), migration to and from VRAM, and unified eviction (BO and SVM VRAM
> > > allocations can evict each other). Fully tested; more on this below.
> > >
> > > The patch breakdown is as follows:
> > >
> > > 1. Preparation patches already on the list [3].
> > > - Patches 1-3.
> > > - Please refrain from reviewing these here.
> > > 2. New migrate layer functionality
> > > - Patch 4.
> > > - Required for eviction to avoid locking inversion between
> > > dma-resv and mmap lock.
> > > 3. GPU SVM.
> > > - Patch 5.
> > > - This is what needs community review.
> > > - Inspired by GPUVM.
> > > - Kernel doc should explain design principles.
> > > - There is certainly room for optimization of the implementation
> > > and improvements with existing core MM interaction. Pulling in
> > > pending DMA mapping work [4] and additional core MM support
> > > for SVM is also likely desired. However, this serves as a good
> > > starting point for any SVM discussions and could be used as a
> > > stepping stone to future core MM work.
> > > 3. Basic SVM support in Xe (i.e., SRAM backing only).
> > > - Patches 6-15.
> > > - The uAPI in the patch could benefit from community input.
> > > 4. SVM VRAM migration support in Xe.
> > > - Patches 16-23.
> > > - Using TMM BOs for SVM VRAM allocations could use community
> > > input. Patch 23 has a detailed explaination of this design
> > > choice in the commit message.
> > > 5. SVM eviction support in Xe.
> > > - Patch 24.
> > > - Should work with exhaustive eviction [5] when it merges.
> > > 6. Xe SVM debug / tuning.
> > > - Patch 25-28.
> > >
> > > Kernel documentation and commit messages are relatively light, aside
> > > from GPU SVM and uAPI patches as this is an RFC.
> > >
> > > Testing has been conducted quite thoroughly with new IGT [6]. Various
> > > system allocation types (malloc, mmap, mmap flags, huge pages, different
> > > sizes, different alignments), mixing runtime allocations, unmapping
> > > corners, invalid faults, and eviction have been tested. Testing scales
> > > from single thread to multiple threads and multiple processes. Tests
> > > pass on LNL, BMG, PVC 1 tile, and PVC 2 tile.
> > >
> > > 1. Multiple GPU support.
> > > - This is likely to follow or occur in parallel to this work.
> > > 2. Userptr unification with GPU SVM.
> > > - This is essentially designed in my head (likely involving a
> > > few new GPU SVM layer functions) but would require some fairly
> > > invasive changes to Xe KMD to test out. Therefore, I would
> > > like GPU SVM to be reviewed first before proceeding with these
> > > changes.
> > > 3. Madvise and prefetch IOCTLs
> > > - This is likely to follow or occur in parallel to this work.
> > >
> > > Given the size of the series, I have pushed a GitLab branch for
> > > reference [7].
> > >
> > > Matt
> > >
> > > [1] https://patchwork.freedesktop.org/series/128910/
> > > [2] https://patchwork.freedesktop.org/series/132229/
> > > [3] https://patchwork.freedesktop.org/series/137805/
> > > [4] https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/
> > > [5] https://patchwork.freedesktop.org/series/133643/
> > > [6] https://patchwork.freedesktop.org/patch/610942/?series=137545&rev=2
> > > [7] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/tree/post?ref_type=heads
> >
> > Ok rather late, I wanted to type this up 2 weeks ago or so, but alas here
> > it is finally. I think with all the experiments and discussions I have
>
> Thanks for putting this together and all the initial reviews.
>
> > fairly clear understanding of the really tricky parts of svm (thanks a lot
> > to Matt for all the work done). From my side the key points, sorted
> > roughly in how important I think they are:
> >
>
> I've read this through and pretty much agree with everything here
> so won't have a detailed response to everything as there isn't much to
> say aside from I agree. A few minor comments below.
>
> > 1. migrate_to_ram path
> >
> > I think this is the absolute center piece of making sure we're aligned
> > with core mm and don't suffer from deadlocks or livelocks fundamental to
> > the gpusvm library design. So this part imo needs to be solid for the
> > first version we merge. But of course any core mm changes Matt prototyped
> > shouldn't gate merging the drm side since they're nicely decoupled, we
> > only need those to validate the design in testing.
> >
> > I think the key points are:
> >
> > - we rely on the migration pte, temporary page references and page lock
> > only, which with the core mm changes Matt worked on gives us guaranteed
> > reliable migration back to system memory. And we need that, or svm
> > essentially becomes unusable as a concept.
> >
> > - we need to support partial migration, including the worst case fallback
> > of only migrating that single page core mm managed to trylock for us
> > while holding the pagetable lock.
> >
> > Since we have guaranteed migration back to system memory we can make the
> > assumption on the gpu fault handling side that we will only ever handle
> > ranges that are entirely in vram (by throwing any partial migrations
> > out). Needs a retry loop for that in the gpu fault side, but I no logner
> > see an issue with that assumption on the gpu fault side otherwise, so
> > not needed for merging or even later until we have a driver that
> > requires partial vram support.
> >
>
> I think pretty quickly we will add partial vram support / mixed mappings
> but likely will not be in initial merge.
>
> > - no other driver locks related to that memory range in any way are
> > allowed, and ideally we also test with the forced fallback to mmap_read
> > lock in do_swap_page removed, i.e. calling migrate_to_ram with only
> > holding the read vma lock. Of course driver locks for blitting are
> > allowed, it's only locks related to managing physical memory which are
> > problematic and could result in deadlocks.
> >
> > - the drm side must uphold the guarantee of not having elevated page
> > references whithout holding the page lock. Otherwise there's a race and
> > we cannot guarantee migratione to sram.
> >
> > - also through the try_to_migrate maze we'll hit our own gpu pte
> > invalidate paths, so there's some requirements there too. But I've put
> > the discussion for that at the very bottom, since most of the gpu pte
> > locking questions are imo not that important, and definitely not
> > important for the first version we merge.
> >
> > Everything else below I think we can sort out post merge and just need
> > rough alignment on the design.
> >
> > 2. eviction
> >
> > Requirements much like migrate_to_ram, because otherwise we break the
> > migration gurantee:
> >
> > - Only looking at physical memory datastructures and locks, no looking at
> > mm/vma structs or relying on those being locked. We rely entirely on
> > reverse maps from try_to_migrate to find all the mappings on both cpu
> > and gpu side (cpu only zone device swap or migration pte entries ofc).
> >
> > - Partial migration needs to work to make sure we can get out of any
> > low memory bind.
> >
> > 3. gpu fault side
> >
> > - We can only rely on mmap_read for hmm_range_fault. And ideally should
> > assume that's not held anywhere else since with per-vma locking I'd
> > expect the mm/vma locking will move into hmm_range_fault. This also
> > means no looking at vma beyond just passing it around as currently
> > needed by core mm functions.
> >
> > - Big retry loop to handle all races with the mmu notifier under the gpu
> > pagetable locks/mmu notifier range lock/whatever we end up calling
> > those. Races (especially against concurrent eviction/migrate_to_ram)
> > should _not_ be handled on the fault side by trying to hold locks
> > instead.
> >
> > - Long term I think we need to be able to handle concurrent faults, even
> > on hw where there's only one gpu fault handling queue. For performance
> > we absolutely want to prefault aggressively, and that likely happens
> > through cpu ioctls that are entirely independent from the gpu fault
> > handling.
> >
>
> I agree the long term goal is handle concurrent GPU faults and with a
> bit of finer grained locking in Xe I have already made this work. The
> biggest part which needs to be parallel is migration code which is
> taking up roughly 98% of time in the GPU fault handler for a 2M fault
> with a split of 90% in migrate_vma_* function and 8% in the copy job.
> I've seen tests which mirrors multiple EUs from the same process taking
> concurrent GPU faults have large gains in perf. Also the CPU fault
> handler is concurrent so it makes a bit of sense that GPU faults are
> concurrent too.
>
> GPU faults being concurrent should also enable concurrent prefetches
> from CPU IOCTLs which also is likely desired.
>
> I'm not going to include this or any of the other optimizations I have
> worked on in what I try to merge initially though as I want to keep this
> as simple as possible and also don't want to throw more code at the list
> until a working baseline is in.
>
> > Short term enough (driver-side) locking to make sure this doesn't go
> > boom is enough, I think just some design goal documentation here on how
> > to achieve that is all we need.
> >
> > 4. physical memory to virtual backpointer
> >
> > No. Doesn't work :-) Also it's only used in in the eviction/migrate_to_ram
> > path and I think Matt already fixed this all anyway.
> >
> > 5. gpu pagetable locking
> >
> > Or mmu notifier range locking or whatever you want to call it. Like on the
> > cpu side this should _only_ protect the pagetable entries and additional
> > for us mmu notifier seqno tracking, nothing else.
> >
> > Any races due to concurrent eviction/migrate_to_ram/gpu fault/prefault
> > need to be handled by retrying outside of holding the pagetable locks. If
> > we try to impose additional consistency guarantees we'll fall over and
> > have a livelock/deadlock fight with core mm in migrate_to_ram. This part
> > is required I think for the first version, but we already need that anyway
> > to make migrate_to_ram work properly.
> >
> > For the actual data structure/locking design I think anything on the
> > design line between a single global lock and the radix tree over-the-top
> > scalable per-pagetable (spin)lock design of the core mm is fine.
> >
>
> I've seen the bind step in servicing GPU faults take barely any amount
> of time so having the GPU page tables protected by the VM's dma-resv
> lock seems fine in Xe. This really is up to each driver how it wants to
> handle this too.
I think if we go to a more fine-grained approach it'd make sense to have
this common, especially if we go with something more radix-tree based.
It's quick tricky to get right in the more extreme cases (*shudders
looking at cpu pgtable walking code), and with mmu notifier seqno handling
there's some additional complexity. And for drivers which don't need that
fine-grained approach it shouldn't hurt.
Also with pagetable lock here I meant the one you're also taking from the
mmu notifier invalidate callback, so unless I'm completely lost that's not
the dma_resv lock of the vm. That is iirc what you call "driver lock" and
I think for initial merging your current approach is entirely fine. It
would also need some changes for concurrent gpu pte (pre-)fault handling,
but that's not what I'm talking about here directly.
> > The design here with 3 levels (mmu notifer, range, struct page) wouldn't
> > be my first choice, but clearly fits on that line so imo is fine for
> > initial merging. We might want to make sure that the range locking (I
> > guess mostly relevant for the invalidate side, drivers don't see much
> > else) is somewhat abstracted so we can easily change that post-merge, but
> > not required imo at all.
> >
> > For consensus documentation I'd recommend a todo or design documentation
> > patch, where we put down both the current design and why it's like that,
> > and some of the longer term goals. Then get that acked (imo needs at least
> > one other driver that's seriously interested in this, plus I think an ack
> > from Danilo for gpuvm interactions), then merge that. SVM is tricky enough
> > that I think this would be really useful to make sure we're not
> > unecessarily stuck in limbo.
> >
>
> I'll include a TODO or design documentation in the next rev.
>
> > From my side again I think the only part we really have to get right from
> > the start is migrate_to_ram. And I'm confident we've got that now really
> > solid.
> >
>
> I think most of all 5 points will be addressed in my next rev. Anything
> that isn't falls into an 'optimization we can do later' category which
> the design should be coded in a way these optimizations can easily be
> added.
Just quickly read through your comments and aside from the pagetable
locking one (which I think is just a bit unclarity, not disagrement) I all
agree on all of them.
-Sima
> Matt
>
> > Oh also you need userspace ofc :-)
> >
> > Cheers, Sima
> >
> > > Matthew Brost (28):
> > > dma-buf: Split out dma fence array create into alloc and arm functions
> > > drm/xe: Invalidate media_gt TLBs in PT code
> > > drm/xe: Retry BO allocation
> > > mm/migrate: Add migrate_device_vma_range
> > > drm/gpusvm: Add support for GPU Shared Virtual Memory
> > > drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag
> > > drm/xe: Add SVM init / fini to faulting VMs
> > > drm/xe: Add dma_addr res cursor
> > > drm/xe: Add SVM range invalidation
> > > drm/gpuvm: Add DRM_GPUVA_OP_USER
> > > drm/xe: Add (re)bind to SVM page fault handler
> > > drm/xe: Add SVM garbage collector
> > > drm/xe: Add unbind to SVM garbage collector
> > > drm/xe: Do not allow system allocator VMA unbind if the GPU has
> > > bindings
> > > drm/xe: Enable system allocator uAPI
> > > drm/xe: Add migrate layer functions for SVM support
> > > drm/xe: Add SVM device memory mirroring
> > > drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions
> > > drm/xe: Update PT layer to understand ranges in VRAM
> > > drm/xe: Add Xe SVM populate_vram_pfn vfunc
> > > drm/xe: Add Xe SVM vram_release vfunc
> > > drm/xe: Add BO flags required for SVM
> > > drm/xe: Add SVM VRAM migration
> > > drm/xe: Basic SVM BO eviction
> > > drm/xe: Add SVM debug
> > > drm/xe: Add modparam for SVM notifier size
> > > drm/xe: Add modparam for SVM prefault
> > > drm/gpusvm: Ensure all pages migrated upon eviction
> > >
> > > drivers/dma-buf/dma-fence-array.c | 78 +-
> > > drivers/gpu/drm/xe/Makefile | 4 +-
> > > drivers/gpu/drm/xe/drm_gpusvm.c | 2213 ++++++++++++++++++++++++++
> > > drivers/gpu/drm/xe/drm_gpusvm.h | 415 +++++
> > > drivers/gpu/drm/xe/xe_bo.c | 54 +-
> > > drivers/gpu/drm/xe/xe_bo.h | 2 +
> > > drivers/gpu/drm/xe/xe_bo_types.h | 3 +
> > > drivers/gpu/drm/xe/xe_device_types.h | 8 +
> > > drivers/gpu/drm/xe/xe_gt_pagefault.c | 17 +-
> > > drivers/gpu/drm/xe/xe_migrate.c | 150 ++
> > > drivers/gpu/drm/xe/xe_migrate.h | 10 +
> > > drivers/gpu/drm/xe/xe_module.c | 7 +
> > > drivers/gpu/drm/xe/xe_module.h | 2 +
> > > drivers/gpu/drm/xe/xe_pt.c | 456 +++++-
> > > drivers/gpu/drm/xe/xe_pt.h | 3 +
> > > drivers/gpu/drm/xe/xe_pt_types.h | 2 +
> > > drivers/gpu/drm/xe/xe_res_cursor.h | 50 +-
> > > drivers/gpu/drm/xe/xe_svm.c | 775 +++++++++
> > > drivers/gpu/drm/xe/xe_svm.h | 70 +
> > > drivers/gpu/drm/xe/xe_tile.c | 5 +
> > > drivers/gpu/drm/xe/xe_vm.c | 286 +++-
> > > drivers/gpu/drm/xe/xe_vm.h | 15 +-
> > > drivers/gpu/drm/xe/xe_vm_types.h | 44 +
> > > include/drm/drm_gpuvm.h | 5 +
> > > include/linux/dma-fence-array.h | 6 +
> > > include/linux/migrate.h | 3 +
> > > include/uapi/drm/xe_drm.h | 19 +-
> > > mm/migrate_device.c | 53 +
> > > 28 files changed, 4615 insertions(+), 140 deletions(-)
> > > create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c
> > > create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h
> > > create mode 100644 drivers/gpu/drm/xe/xe_svm.c
> > > create mode 100644 drivers/gpu/drm/xe/xe_svm.h
> > >
> > > --
> > > 2.34.1
> > >
> >
> > --
> > Simona Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
--
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
More information about the Intel-xe
mailing list