Separating xe_vma- and page-table state

Thu Mar 14 16:00:53 UTC 2024

Hi Thomas, Matt,

So separating xe_vma and page-table-state is a good thing to have, as it clean the current code, makes it easier to maintain. We want to do it regardless whether there is a requirement from system allocator/hmmptr design. Since I wasn't able to see through how hard it is, and how much it can cleanup the current codes, I also asked Himal to take a closer look of it.

Here is my thoughts on the two approaches for system allocator:

1)split gigantic vma during gpu page fault(so maintain 1:1 mapping b/t xe_vma and page-table-state/svm_range):
	Since Matt already have some POC codes, I should be able to quickly put together a series out for review. From design perspective, I think this is cleaner/simpler than 2): you have single RB-tree for one GPUVM, which I think we should.

2) non-split (so 1:N mapping):
	This will be depends on the separating of xe-vma and page-table-state work which depends on Matt's big vm bind refactory series. So time wise this is going to take long time. This is the main reason I want to go 1) first.
	Once we have cleaned up xe-vma/page-table-state, this won't be hard based on my POC series
	As Matt mentioned, a benefit would be less vma segmentation
	From design perspective, I think this matches the gigantic vm_bind and sparse page table population design. We will have to maintain another RB-tree for page-table-state/svm-range. Two RB-tree for one GPUVM is a little awkward to me.

So here is my plan in summary: I will continue the work with approach 1); at the same time Himal will give an evaluation of the separating xe-vma and pt-state.

Thanks,
Oak

> -----Original Message-----
> From: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> Sent: Thursday, March 14, 2024 4:52 AM
> To: Zeng, Oak <oak.zeng at intel.com>; Brost, Matthew
> <matthew.brost at intel.com>
> Cc: intel-xe at lists.freedesktop.org
> Subject: Re: Separating xe_vma- and page-table state
> 
> On Wed, 2024-03-13 at 17:06 +0000, Zeng, Oak wrote:
> > Hi Thomas,
> >
> > For simplicity of the discussion, let's forget about BO vm_bind,
> > forget about memory attributes for a moment... Only consider system
> > allocator. So with the scheme below, we have a gigantic xe_vma in the
> > background holding some immutable state, never split. And we have
> > mutable page-table state which is created during GPU access and
> > destroyed during CPU munmap/invalidation, dynamically
> >
> > For the mutable page-table state, you would maintain another RB-tree
> > so you can searching it, as I did  in POC, the tree is in xe_svm. For
> > BO driver, you don’t need this extra tree, you just need the xe_vma
> > tree as xe_vma has 1:1 mapping with page-table-state for BO driver...
> >
> > I saw this scheme can be aligned with my POC....
> >
> > Mapping this scheme to the userptr "free without vm_unbind" thing, I
> > can see when user free, we can destroy page-table-state during mmu
> > notifier callback, while keep the xe_vma. Is this also how you look
> > at it?
> >
> > Need to say, the "free without vm_unbind" thing should only affect
> > our decision temporarily: once system allocator is ready, UMD
> > wouldn't need the userptr vm_bind anymore, so the problem will be
> > more perfectly solved with system allocator - umd just remove the
> > vm_bind, things would magically work with system allocator. I guess
> > what user need is really a system allocator but we don't have it at
> > that time, so userptr technology is used. For long term, system
> > allocator should eventually replace userptr.
> 
> I mostly agree on the above, I think.
> 
> >
> > One thing I can't picture clearly is, how hard is it to change the
> > current xekmd to separate xe_vma into mutable and unmutable?
> 
> It's not that hard at all, it's mostly changing the xe_pt.c interfaces.
> An obstacle, though, is that we don't want to do this before Matt's big
> vm_bind refactoring is reviewed and in place.
> 
> >
> >
> > Is the split scheme with xe_vma maintaining both mutable and
> > unmutable simpler? It doesn't have xe_svm concept. No
> > xe_svm_range/page_table-state, single RB tree per gpuvm, no need to
> > re-construct xe_vma....depending on how we want to solve the multiple
> > device problem, the xe_svm concept can come back though...
> 
> For the ordinary VMA types we have today, Userptr / Bo /NULL it's
> neither simpler nor more complex IMO, but it makes the code clearer and
> hopefully easier to maintain.
> 
> For hmmptr/SVM system it's too early to answer. Here it depends really
> on whether 1) we do an 1:1 mapping between xe_vma and svm_range, or
> whether 2) we do an 1:N mapping of xe_vma and svm_range. Probably both
> approaches have their benefits so I'd tend to favour Matt's suggestion
> there that we start off with 1), make it work and then do a POC with 2)
> to see what it looks like.
> 
> Comments, suggestions?
> 
> /Thomas
> 
> 
> >
> > Oak
> >
> > > -----Original Message-----
> > > From: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> > > Sent: Wednesday, March 13, 2024 6:56 AM
> > > To: Brost, Matthew <matthew.brost at intel.com>; Zeng, Oak
> > > <oak.zeng at intel.com>
> > > Cc: intel-xe at lists.freedesktop.org
> > > Subject: Re: Separating xe_vma- and page-table state
> > >
> > > On Wed, 2024-03-13 at 01:27 +0000, Matthew Brost wrote:
> > > > On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> > > > > Hi Thomas,
> > > >
> > > >
> > >
> > > ....
> > >
> > > > Thomas:
> > > >
> > > > I like the idea of VMAs in the PT code function being marked as
> > > > const
> > > > and having the xe_pt_state as non const. It makes ownership very
> > > > clear.
> > > >
> > > > Not sure how that will fit into [1] as that series passes around
> > > > a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It
> > > > does
> > > > this
> > > > to make "struct xe_vm_ops" a single atomic operation. The VMAs
> > > > are
> > > > extracted either the GPUVM base operation or "struct xe_vma_op".
> > > > Maybe
> > > > these can be const? I'll look into that but this might not work
> > > > out
> > > > in
> > > > practice.
> > > >
> > > > Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship
> > > > fits in
> > > > hmmptrs. Could you explain your thinking here?
> > >
> > > There is a need for hmmptrs to be sparse. When we fault we create a
> > > chunk of PTEs that we populate. This chunk could potentially be
> > > large
> > > and covering the whole CPU vma or it could be limited to, say 2MiB
> > > and
> > > aligned to allow for large page-table entries. In Oak's POC these
> > > chunks are called "svm ranges"
> > >
> > > So the question arises, how do we map that to the current vma
> > > management and page-table code? There are basically two ways:
> > >
> > > 1) Split VMAs so they are either fully populated or unpopulated,
> > > each
> > > svm_range becomes an xe_vma.
> > > 2) Create xe_pt_range / xe_pt_state whatever with an 1:1 mapping
> > > with
> > > the svm_mange and a 1:N mapping with xe_vmas.
> > >
> > > Initially my thinking was that 1) Would be the simplest approach
> > > with
> > > the code we have today. I lifted that briefly with Sima and he
> > > answered
> > > "And why would we want to do that?", and the answer at hand was ofc
> > > that the page-table code worked with vmas. Or rather that we mix
> > > vma
> > > state (the hmmptr range / attributes) and page-table state (the
> > > regions
> > > of the hmmptr that are actually populated), so it would be a
> > > consequence of our current implementation (limitations).
> > >
> > > With the suggestion to separate vma state and pt state, the xe_svm
> > > ranges map to pt state and are managed per hmmptr vma. The vmas
> > > would
> > > then be split mainly as a result of UMD mapping something else (bo)
> > > on
> > > top, or UMD giving new memory attributes for a range (madvise type
> > > of
> > > operations).
> > >
> > > /Thomas
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >