Separating xe_vma- and page-table state

Wed Mar 13 19:43:31 UTC 2024

On Wed, Mar 13, 2024 at 11:56:12AM +0100, Thomas Hellström wrote:
> On Wed, 2024-03-13 at 01:27 +0000, Matthew Brost wrote:
> > On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> > > Hi Thomas,
> > 
> > 
> 
> ....
> 
> > Thomas:
> > 
> > I like the idea of VMAs in the PT code function being marked as const
> > and having the xe_pt_state as non const. It makes ownership very
> > clear.
> > 
> > Not sure how that will fit into [1] as that series passes around
> > a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It does
> > this
> > to make "struct xe_vm_ops" a single atomic operation. The VMAs are
> > extracted either the GPUVM base operation or "struct xe_vma_op".
> > Maybe
> > these can be const? I'll look into that but this might not work out
> > in
> > practice.
> > 
> > Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship fits in
> > hmmptrs. Could you explain your thinking here?
> 
> There is a need for hmmptrs to be sparse. When we fault we create a
> chunk of PTEs that we populate. This chunk could potentially be large
> and covering the whole CPU vma or it could be limited to, say 2MiB and
> aligned to allow for large page-table entries. In Oak's POC these
> chunks are called "svm ranges"
> 
> So the question arises, how do we map that to the current vma
> management and page-table code? There are basically two ways:
> 
> 1) Split VMAs so they are either fully populated or unpopulated, each
> svm_range becomes an xe_vma.
> 2) Create xe_pt_range / xe_pt_state whatever with an 1:1 mapping with
> the svm_mange and a 1:N mapping with xe_vmas.
> 
> Initially my thinking was that 1) Would be the simplest approach with
> the code we have today. I lifted that briefly with Sima and he answered
> "And why would we want to do that?", and the answer at hand was ofc
> that the page-table code worked with vmas. Or rather that we mix vma
> state (the hmmptr range / attributes) and page-table state (the regions
> of the hmmptr that are actually populated), so it would be a
> consequence of our current implementation (limitations).
> 
> With the suggestion to separate vma state and pt state, the xe_svm
> ranges map to pt state and are managed per hmmptr vma. The vmas would
> then be split mainly as a result of UMD mapping something else (bo) on
> top, or UMD giving new memory attributes for a range (madvise type of
> operations).
> 

Thanks for the explaination.

My 2 cents.

Do an initial implementation is based on 1) - Split VMAs so they are
either fully populated or unpopulated, each svm_range becomes an xe_vma.

This is what I did in a quick PoC [A] and got basic system allocator
working in ~500 loc. Is it fully baked? No, but also it not all that far
off.

We get the uAPI in place, agree on all the semantics, get IGTs in place,
and basic UMD support. We walk before we run...

Then transform to code to 2). The real benefit of 2) is going to be the
memory footprint of all the state and the fragmentation of GPUVM RB tree
that comes with spliting VMAs. I'll state one obvious benefit,
xe_pt_state < xe_vma in size.

I'll give a quick example of the fragmentation issue too.

User binds 0-2^46 - 1 xe_vma
Fault - 3 xe_vma (split into 2 unpopulated xe_vma, 1 populated xe_vma)
Free - 3 xe_vma (replace 1 populated xe_vma with 1 unpopulated xe_vma)

Do this a couple of million times and we will tons of fragmentation and
unnecessary xe_vma floating around. So pursuing 2) is probably worth
while.

Is 2) going to quite a bit more complicated implementation. Let me give
fairly simple example.

User binds 0-2^46 - 1 xe_vma
Tons of occurs - 1 xe_vma, tons of xe_pt_states
User binds a BO - 3 xe_vma, need to split xe_pt_states between 2 new xe_vma, possibly unmap some xe_pt_states, yikes...

While 2) is more complicated I don't think it is complete rewrite from
1) either. [A] is built on top of [B] which introduces the concept of PT
OPs (in addition to existing concepts of GPUVM op / xe_vma_op). With a
bit more of seperation I think getting 2) to work won't be all the
difficult and the uppers layer (IOCTLs, migrate code, fault handler,
etc..) should largely be unchanged from 1) -> 2).

With a staging plan of 1) -> 2) I think this doable. Thoughts?

[A] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-/commits/system_allocator_poc/?ref_type=heads
[B] https://patchwork.freedesktop.org/series/125608/ 

Matt

> /Thomas
> 
> 
> 
> 
> 
> 
> 
> 
>