Separating xe_vma- and page-table state

Thu Mar 14 08:57:12 UTC 2024

Hi, Matt,

 Wed, 2024-03-13 at 19:43 +0000, Matthew Brost wrote:
> On Wed, Mar 13, 2024 at 11:56:12AM +0100, Thomas Hellström wrote:
> > On Wed, 2024-03-13 at 01:27 +0000, Matthew Brost wrote:
> > > On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> > > > Hi Thomas,
> > > 
> > > 
> > 
> > ....
> > 
> > > Thomas:
> > > 
> > > I like the idea of VMAs in the PT code function being marked as
> > > const
> > > and having the xe_pt_state as non const. It makes ownership very
> > > clear.
> > > 
> > > Not sure how that will fit into [1] as that series passes around
> > > a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It
> > > does
> > > this
> > > to make "struct xe_vm_ops" a single atomic operation. The VMAs
> > > are
> > > extracted either the GPUVM base operation or "struct xe_vma_op".
> > > Maybe
> > > these can be const? I'll look into that but this might not work
> > > out
> > > in
> > > practice.
> > > 
> > > Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship
> > > fits in
> > > hmmptrs. Could you explain your thinking here?
> > 
> > There is a need for hmmptrs to be sparse. When we fault we create a
> > chunk of PTEs that we populate. This chunk could potentially be
> > large
> > and covering the whole CPU vma or it could be limited to, say 2MiB
> > and
> > aligned to allow for large page-table entries. In Oak's POC these
> > chunks are called "svm ranges"
> > 
> > So the question arises, how do we map that to the current vma
> > management and page-table code? There are basically two ways:
> > 
> > 1) Split VMAs so they are either fully populated or unpopulated,
> > each
> > svm_range becomes an xe_vma.
> > 2) Create xe_pt_range / xe_pt_state whatever with an 1:1 mapping
> > with
> > the svm_mange and a 1:N mapping with xe_vmas.
> > 
> > Initially my thinking was that 1) Would be the simplest approach
> > with
> > the code we have today. I lifted that briefly with Sima and he
> > answered
> > "And why would we want to do that?", and the answer at hand was ofc
> > that the page-table code worked with vmas. Or rather that we mix
> > vma
> > state (the hmmptr range / attributes) and page-table state (the
> > regions
> > of the hmmptr that are actually populated), so it would be a
> > consequence of our current implementation (limitations).
> > 
> > With the suggestion to separate vma state and pt state, the xe_svm
> > ranges map to pt state and are managed per hmmptr vma. The vmas
> > would
> > then be split mainly as a result of UMD mapping something else (bo)
> > on
> > top, or UMD giving new memory attributes for a range (madvise type
> > of
> > operations).
> > 
> 
> Thanks for the explaination.
> 
> My 2 cents.
> 
> Do an initial implementation is based on 1) - Split VMAs so they are
> either fully populated or unpopulated, each svm_range becomes an
> xe_vma.
> 
> This is what I did in a quick PoC [A] and got basic system allocator
> working in ~500 loc. Is it fully baked? No, but also it not all that
> far
> off.
> 
> We get the uAPI in place, agree on all the semantics, get IGTs in
> place,
> and basic UMD support. We walk before we run...
> 
> Then transform to code to 2). The real benefit of 2) is going to be
> the
> memory footprint of all the state and the fragmentation of GPUVM RB
> tree
> that comes with spliting VMAs. I'll state one obvious benefit,
> xe_pt_state < xe_vma in size.
> 
> I'll give a quick example of the fragmentation issue too.
> 
> User binds 0-2^46 - 1 xe_vma
> Fault - 3 xe_vma (split into 2 unpopulated xe_vma, 1 populated
> xe_vma)
> Free - 3 xe_vma (replace 1 populated xe_vma with 1 unpopulated
> xe_vma)
> 
> Do this a couple of million times and we will tons of fragmentation
> and
> unnecessary xe_vma floating around. So pursuing 2) is probably worth
> while.
> 
> Is 2) going to quite a bit more complicated implementation. Let me
> give
> fairly simple example.
> 
> User binds 0-2^46 - 1 xe_vma
> Tons of occurs - 1 xe_vma, tons of xe_pt_states
> User binds a BO - 3 xe_vma, need to split xe_pt_states between 2 new
> xe_vma, possibly unmap some xe_pt_states, yikes...
> 
> While 2) is more complicated I don't think it is complete rewrite
> from
> 1) either. [A] is built on top of [B] which introduces the concept of
> PT
> OPs (in addition to existing concepts of GPUVM op / xe_vma_op). With
> a
> bit more of seperation I think getting 2) to work won't be all the
> difficult and the uppers layer (IOCTLs, migrate code, fault handler,
> etc..) should largely be unchanged from 1) -> 2).
> 
> With a staging plan of 1) -> 2) I think this doable. Thoughts?
> 
> [A]
> https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-/commits/system_allocator_poc/?ref_type=heads
> [B] https://patchwork.freedesktop.org/series/125608/ 
> 
> Matt

Without looking at the code ablve, yes I think this sounds like a good
plan. Oak, it would be good to hear your comments on this as well. I
think, given 1) Implementing 2) as a POC wouldn't be too hard. Then we
also need to figure how both approaches tie into the possibility to
make GPUVM hepers out of this...

/Thomas

> 
> > /Thomas
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >