Separating xe_vma- and page-table state

Thu Mar 14 08:52:23 UTC 2024

On Wed, 2024-03-13 at 17:06 +0000, Zeng, Oak wrote:
> Hi Thomas,
> 
> For simplicity of the discussion, let's forget about BO vm_bind,
> forget about memory attributes for a moment... Only consider system
> allocator. So with the scheme below, we have a gigantic xe_vma in the
> background holding some immutable state, never split. And we have
> mutable page-table state which is created during GPU access and
> destroyed during CPU munmap/invalidation, dynamically
> 
> For the mutable page-table state, you would maintain another RB-tree
> so you can searching it, as I did  in POC, the tree is in xe_svm. For
> BO driver, you don’t need this extra tree, you just need the xe_vma
> tree as xe_vma has 1:1 mapping with page-table-state for BO driver...
> 
> I saw this scheme can be aligned with my POC....
> 
> Mapping this scheme to the userptr "free without vm_unbind" thing, I
> can see when user free, we can destroy page-table-state during mmu
> notifier callback, while keep the xe_vma. Is this also how you look
> at it? 
> 
> Need to say, the "free without vm_unbind" thing should only affect
> our decision temporarily: once system allocator is ready, UMD
> wouldn't need the userptr vm_bind anymore, so the problem will be
> more perfectly solved with system allocator - umd just remove the
> vm_bind, things would magically work with system allocator. I guess
> what user need is really a system allocator but we don't have it at
> that time, so userptr technology is used. For long term, system
> allocator should eventually replace userptr.

I mostly agree on the above, I think.

> 
> One thing I can't picture clearly is, how hard is it to change the
> current xekmd to separate xe_vma into mutable and unmutable?

It's not that hard at all, it's mostly changing the xe_pt.c interfaces.
An obstacle, though, is that we don't want to do this before Matt's big
vm_bind refactoring is reviewed and in place.

>  
> 
> Is the split scheme with xe_vma maintaining both mutable and
> unmutable simpler? It doesn't have xe_svm concept. No
> xe_svm_range/page_table-state, single RB tree per gpuvm, no need to
> re-construct xe_vma....depending on how we want to solve the multiple
> device problem, the xe_svm concept can come back though...

For the ordinary VMA types we have today, Userptr / Bo /NULL it's
neither simpler nor more complex IMO, but it makes the code clearer and
hopefully easier to maintain.

For hmmptr/SVM system it's too early to answer. Here it depends really
on whether 1) we do an 1:1 mapping between xe_vma and svm_range, or
whether 2) we do an 1:N mapping of xe_vma and svm_range. Probably both
approaches have their benefits so I'd tend to favour Matt's suggestion
there that we start off with 1), make it work and then do a POC with 2)
to see what it looks like.

Comments, suggestions?

/Thomas

> 
> Oak
> 
> > -----Original Message-----
> > From: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> > Sent: Wednesday, March 13, 2024 6:56 AM
> > To: Brost, Matthew <matthew.brost at intel.com>; Zeng, Oak
> > <oak.zeng at intel.com>
> > Cc: intel-xe at lists.freedesktop.org
> > Subject: Re: Separating xe_vma- and page-table state
> > 
> > On Wed, 2024-03-13 at 01:27 +0000, Matthew Brost wrote:
> > > On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> > > > Hi Thomas,
> > > 
> > > 
> > 
> > ....
> > 
> > > Thomas:
> > > 
> > > I like the idea of VMAs in the PT code function being marked as
> > > const
> > > and having the xe_pt_state as non const. It makes ownership very
> > > clear.
> > > 
> > > Not sure how that will fit into [1] as that series passes around
> > > a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It
> > > does
> > > this
> > > to make "struct xe_vm_ops" a single atomic operation. The VMAs
> > > are
> > > extracted either the GPUVM base operation or "struct xe_vma_op".
> > > Maybe
> > > these can be const? I'll look into that but this might not work
> > > out
> > > in
> > > practice.
> > > 
> > > Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship
> > > fits in
> > > hmmptrs. Could you explain your thinking here?
> > 
> > There is a need for hmmptrs to be sparse. When we fault we create a
> > chunk of PTEs that we populate. This chunk could potentially be
> > large
> > and covering the whole CPU vma or it could be limited to, say 2MiB
> > and
> > aligned to allow for large page-table entries. In Oak's POC these
> > chunks are called "svm ranges"
> > 
> > So the question arises, how do we map that to the current vma
> > management and page-table code? There are basically two ways:
> > 
> > 1) Split VMAs so they are either fully populated or unpopulated,
> > each
> > svm_range becomes an xe_vma.
> > 2) Create xe_pt_range / xe_pt_state whatever with an 1:1 mapping
> > with
> > the svm_mange and a 1:N mapping with xe_vmas.
> > 
> > Initially my thinking was that 1) Would be the simplest approach
> > with
> > the code we have today. I lifted that briefly with Sima and he
> > answered
> > "And why would we want to do that?", and the answer at hand was ofc
> > that the page-table code worked with vmas. Or rather that we mix
> > vma
> > state (the hmmptr range / attributes) and page-table state (the
> > regions
> > of the hmmptr that are actually populated), so it would be a
> > consequence of our current implementation (limitations).
> > 
> > With the suggestion to separate vma state and pt state, the xe_svm
> > ranges map to pt state and are managed per hmmptr vma. The vmas
> > would
> > then be split mainly as a result of UMD mapping something else (bo)
> > on
> > top, or UMD giving new memory attributes for a range (madvise type
> > of
> > operations).
> > 
> > /Thomas
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
>