Separating xe_vma- and page-table state

Wed Mar 13 17:06:30 UTC 2024

Hi Thomas,

For simplicity of the discussion, let's forget about BO vm_bind, forget about memory attributes for a moment... Only consider system allocator. So with the scheme below, we have a gigantic xe_vma in the background holding some immutable state, never split. And we have mutable page-table state which is created during GPU access and destroyed during CPU munmap/invalidation, dynamically

For the mutable page-table state, you would maintain another RB-tree so you can searching it, as I did  in POC, the tree is in xe_svm. For BO driver, you don’t need this extra tree, you just need the xe_vma tree as xe_vma has 1:1 mapping with page-table-state for BO driver...

I saw this scheme can be aligned with my POC....

Mapping this scheme to the userptr "free without vm_unbind" thing, I can see when user free, we can destroy page-table-state during mmu notifier callback, while keep the xe_vma. Is this also how you look at it? 

Need to say, the "free without vm_unbind" thing should only affect our decision temporarily: once system allocator is ready, UMD wouldn't need the userptr vm_bind anymore, so the problem will be more perfectly solved with system allocator - umd just remove the vm_bind, things would magically work with system allocator. I guess what user need is really a system allocator but we don't have it at that time, so userptr technology is used. For long term, system allocator should eventually replace userptr.

One thing I can't picture clearly is, how hard is it to change the current xekmd to separate xe_vma into mutable and unmutable? 

Is the split scheme with xe_vma maintaining both mutable and unmutable simpler? It doesn't have xe_svm concept. No xe_svm_range/page_table-state, single RB tree per gpuvm, no need to re-construct xe_vma....depending on how we want to solve the multiple device problem, the xe_svm concept can come back though...

Oak

> -----Original Message-----
> From: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> Sent: Wednesday, March 13, 2024 6:56 AM
> To: Brost, Matthew <matthew.brost at intel.com>; Zeng, Oak
> <oak.zeng at intel.com>
> Cc: intel-xe at lists.freedesktop.org
> Subject: Re: Separating xe_vma- and page-table state
> 
> On Wed, 2024-03-13 at 01:27 +0000, Matthew Brost wrote:
> > On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> > > Hi Thomas,
> >
> >
> 
> ....
> 
> > Thomas:
> >
> > I like the idea of VMAs in the PT code function being marked as const
> > and having the xe_pt_state as non const. It makes ownership very
> > clear.
> >
> > Not sure how that will fit into [1] as that series passes around
> > a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It does
> > this
> > to make "struct xe_vm_ops" a single atomic operation. The VMAs are
> > extracted either the GPUVM base operation or "struct xe_vma_op".
> > Maybe
> > these can be const? I'll look into that but this might not work out
> > in
> > practice.
> >
> > Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship fits in
> > hmmptrs. Could you explain your thinking here?
> 
> There is a need for hmmptrs to be sparse. When we fault we create a
> chunk of PTEs that we populate. This chunk could potentially be large
> and covering the whole CPU vma or it could be limited to, say 2MiB and
> aligned to allow for large page-table entries. In Oak's POC these
> chunks are called "svm ranges"
> 
> So the question arises, how do we map that to the current vma
> management and page-table code? There are basically two ways:
> 
> 1) Split VMAs so they are either fully populated or unpopulated, each
> svm_range becomes an xe_vma.
> 2) Create xe_pt_range / xe_pt_state whatever with an 1:1 mapping with
> the svm_mange and a 1:N mapping with xe_vmas.
> 
> Initially my thinking was that 1) Would be the simplest approach with
> the code we have today. I lifted that briefly with Sima and he answered
> "And why would we want to do that?", and the answer at hand was ofc
> that the page-table code worked with vmas. Or rather that we mix vma
> state (the hmmptr range / attributes) and page-table state (the regions
> of the hmmptr that are actually populated), so it would be a
> consequence of our current implementation (limitations).
> 
> With the suggestion to separate vma state and pt state, the xe_svm
> ranges map to pt state and are managed per hmmptr vma. The vmas would
> then be split mainly as a result of UMD mapping something else (bo) on
> top, or UMD giving new memory attributes for a range (madvise type of
> operations).
> 
> /Thomas
> 
> 
> 
> 
> 
> 
> 
>