[PATCH v2 29/29] drm/doc: gpusvm: Add GPU SVM documentation

Tue Dec 17 23:14:47 UTC 2024

On Mon, Dec 02, 2024 at 02:00:44PM +0100, Thomas Hellström wrote:
> On Tue, 2024-10-15 at 20:25 -0700, Matthew Brost wrote:
> > Add documentation for agree upon GPU SVM design principles, current
> > status, and future plans.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> > ---
> >  Documentation/gpu/rfc/gpusvm.rst | 70
> > ++++++++++++++++++++++++++++++++
> >  Documentation/gpu/rfc/index.rst  |  4 ++
> >  2 files changed, 74 insertions(+)
> >  create mode 100644 Documentation/gpu/rfc/gpusvm.rst
> > 
> > diff --git a/Documentation/gpu/rfc/gpusvm.rst
> > b/Documentation/gpu/rfc/gpusvm.rst
> > new file mode 100644
> > index 000000000000..2d3f79a6c30a
> > --- /dev/null
> > +++ b/Documentation/gpu/rfc/gpusvm.rst
> > @@ -0,0 +1,70 @@
> > +===============
> > +GPU SVM Section
> > +===============
> > +
> > +Agreed upon design principles
> > +=============================
> > +
> > +* migrate_to_ram path
> > +	* Rely on core MM concepts (migration ptes, page refs, and
> > page locking)
> > +	  only
> > +	* No driver specific locks other than locks for hardware
> > interaction in
> > +	  this path
> 
> We have previously been discussing the bo lock to protect the bo from
> eviction during migrate, if the vram allocation is bo-based. This is a
> cross-driver lock with a well-established locking order and I suggest
> we allow this. Apart from that I think the above statement needs some

Not allowing additional locks was suggested by Sima and I think it makes
sense but I do agree taking a dma-resv in migrate_to_ram would be safe.
However, the way GPU SVM is structured there is not any hooks to enable
this.

> elaboration: What is the problem we are trying to avoid with driver-
> specific locks, written so that it's easy to understand it's a bad
> idea.
> 

Sure, will try to elaborate.

> > +	* Partial migration is supported
> 
> Exactly what do we mean by partial migration.
> 

Will

> > +	* Driver handles mixed migrations via retry loops rather
> > than locking
> > +* Eviction
> > +	* Only looking at physical memory datastructures and locks
> as opposed to...
> 

As opposed looking at virtual memory data structures. Can elaborate why
this is a bad idea too - basically mremap + fork fall apart when you
start looking at virtual things.

> > +	* No looking at mm/vma structs or relying on those being
> > locked
> We're violating this with the current implementation, aren't we?
> 

Aside from when calling migrate_vma_* or creating initial range. Can
elaborate on this and certainly say drivers should not look at CPU VMA.

> 
> > +* GPU fault side
> > +	* mmap_read only used around core MM functions which require
> > this lock
> > +	* Big retry loop to handle all races with the mmu notifier
> > under the gpu
> > +	  pagetable locks/mmu notifier range lock/whatever we end up
> > calling
> > +          those
> > +	* Races (especially against concurrent
> > eviction/migrate_to_ram) should
> > +	  not be handled on the fault side by trying to hold locks
> 
> This actually contradicts my comment written above about using the bo
> lock to block eviction here. The alternative would be to pin vram
> allocations during migration until the mm_truct has references on the
> allocation, but it'd be good to clarify exactly why locking here is a
> bad idea, and why we can't rely on lockdep?
> 

I'll try to clarify.

> > +* Physical memory to virtual backpointer
> > +	* Does not work, no pointers from physical memory to virtual
> > should
> > +	  exist
> 
> We actually still have the private zdd structure, but it's strictly not
> to virtual but to the allocation metadata. Is it verified that the
> zone_device_data field is allowed to be modified by the pagemap between
> allocation and migration?
> 

We don't modify zdd between allocation and migration aside from the ref
count of the zdd.

> 
> > +* GPU pagetable locking
> > +	* Notifier lock only protects range tree, pages, pagetable
> > entries, and
> > +	  mmu notifier seqno tracking, it is not a global lock to
> > protect
> > +          against races
> > +	* All races handled with big retry as mentioned above
> 
> Adding a note here about "pages valid" for subranges rather than
> relying on the wider notifer seqno. I.E. a subrange can be valid even
> if the notifier seqno says otherwise.
> 

Sure.

> Performance considerations:
> Perhaps mention that notifier (core mm) performance is more important
> than gpu fault (driver) performance when considering optimizations that
> improves one at the cost of the other?
> 

Hmm, that is kinda speculation IMO. I have heard that feedback but
unsure if I agree with it nor do we have any data to backup that claim.
I'd rather not write something down like this that is based on
speculation. I do agree we should profile the code to really understand
the hot spots and write down our findings once we have done that.

> > +
> > +Overview of current design
> > +==========================
> > +
> > +Current design is simple as possible to get a working basline in
> > which can built
> 
> can be built
> 

+1 

Matt

> > +upon.
> > +
> > +.. kernel-doc:: drivers/gpu/drm/xe/drm_gpusvm.c
> > +   :doc: Overview
> > +   :doc: Locking
> > +   :doc: Migrataion
> > +   :doc: Partial Unmapping of Ranges
> > +   :doc: Examples
> > +
> > +Possible future design features
> > +===============================
> > +
> > +* Concurrent GPU faults
> > +	* CPU faults are concurrent so makes sense to have
> > concurrent GPU faults
> > +	* Should be possible with fined grained locking in the
> > driver GPU
> > +	  fault handler
> > +	* No expected GPU SVM changes required
> > +* Ranges with mixed system and device pages
> > +	* Can be added if required to drm_gpusvm_get_pages fairly
> > easily
> > +* Multi-GPU support
> > +	* Work in progress and patches expected after initially
> > landing on GPU
> > +	  SVM
> > +	* Ideally can be done with little to no changes to GPU SVM
> > +* Drop ranges in favor of radix tree
> > +	* May be desirable for faster notifiers
> > +* Compound device pages
> > +	* Nvidia, AMD, and Intel all have agreed expensive core MM
> > functions in
> > +	  migrate device layer are a performance bottleneck, having
> > compound
> > +	  device pages should help increase performance by reducing
> > the number
> > +	  of these expensive calls
> > +* Higher order dma mapping for migration
> > +	* 4k dma mapping adversely affects migration performance on
> > Intel
> > +	  hardware, higher order (2M) dma mapping should help here
> > diff --git a/Documentation/gpu/rfc/index.rst
> > b/Documentation/gpu/rfc/index.rst
> > index 476719771eef..396e535377fb 100644
> > --- a/Documentation/gpu/rfc/index.rst
> > +++ b/Documentation/gpu/rfc/index.rst
> > @@ -16,6 +16,10 @@ host such documentation:
> >  * Once the code has landed move all the documentation to the right
> > places in
> >    the main core, helper or driver sections.
> >  
> > +.. toctree::
> > +
> > +    gpusvm.rst
> > +
> >  .. toctree::
> >  
> >      i915_gem_lmem.rst
> 
> Thanks,
> Thomas
>