Separating xe_vma- and page-table state

Wed Mar 13 02:16:24 UTC 2024

Hi Matt,

> -----Original Message-----
> From: Brost, Matthew <matthew.brost at intel.com>
> Sent: Tuesday, March 12, 2024 9:27 PM
> To: Zeng, Oak <oak.zeng at intel.com>
> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>; intel-
> xe at lists.freedesktop.org
> Subject: Re: Separating xe_vma- and page-table state
> 
> On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> > Hi Thomas,
> 
> Going to reply to both Thomas and Oak to not split this thread.
> 
> >
> > > -----Original Message-----
> > > From: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> > > Sent: Tuesday, March 12, 2024 3:43 AM
> > > To: intel-xe at lists.freedesktop.org
> > > Cc: Brost, Matthew <matthew.brost at intel.com>; Zeng, Oak
> > > <oak.zeng at intel.com>
> > > Subject: Separating xe_vma- and page-table state
> > >
> > > Hi,
> > >
> > > It's IMO become apparent both in the system allocator discussion and in
> > > the patch that enables the presence of invalid vmas
> > > that we need to be
> > > better at separating xe_vma and page-table state, so that xe_vma state
> > > would contain things that are mostly immutable and that the user
> > > requested: PAT index, memory attributes, requested tile presence etc,
> > > whereas the page-table state would contain mutable state like actual
> > > tile presence, invalidaton state and MMU notifier.
> 
> Thomas:
> 
> 100% agree with the idea. There are 2 distict parts of xe_vma - what I
> call GPUVM state and PT state. In addition, there is what I call IOCTL
> state (sync objs, etc...) that is updated in various places in xe_vm.c.
> 
> Functions to update these respective states should be clearly seperated.
> In the current implementaion is a complete mess in this regard, you
> mention xe_vm_unbind_vma below, in its current state there is no way
> this function could be easily be reused.
> 
> My series [1] vastly improves this seperation. It still could be split a
> bit better though. In my next rebase I can look at this series through
> the lens of clearly maintaining seperation + likely update [2] to do an
> unbind rather than a invalidation.
> 
> [1] https://patchwork.freedesktop.org/series/125608/
> [2] https://patchwork.freedesktop.org/series/130935/
> 
> >
> > It is a valid reasoning to me... if we want to do what community want us to do
> with system allocator,
> > And if we want to meet our umd's requirement of "free w/o vm_unbind", yes,
> we need this "invalid" vma concept.
> >
> > The strange thing is, it seems Matt can still achieve the goal without introducing
> invalid vma concept... it doesn't look like he has
> > This concept in his patch....
> >
> 
> Oak:
> 
> I'm a little confused by what mean by "invalid" vma concept. Does
> that mean no GPU backing or page tables? That is essentially what I call
> system allocator VMA in [3], right?

Thomas also mentioned this "invalid vmas" concept. Here is my understanding:

1) from the system allocator discussion: we introduced a gigantic vma which is invalid because address in this gigantic vma are not a valid gpu address.
When we create a vma in gpu page fault handler, we carve out a vma from gigantic vma, this vma is a valid vma. 

Note there is subtle difference b/t invalid vma and vma without backing or page table. A valid vma can be migrated to system memory and at that time it lost its gpu backing or gpu page table but it is still a valid vma.

In the POC code you give, you don't the migration logic, so you don't need the note of "invalid vma".

2) from the userptr discussion "free userptr without vm_unbind": in your patch, when user free userptr without vm_unbind, the code behavior is, you invalidate that userptr from gpu page table and put it into a repin list. Userptr vma will be repined on the next exec submission. I guess Thomas call such vma (freed, but not vm_unbind) as "invalid vma"

For me this behavior causes unnecessary complication. I would destroy the userptr vma when user free it. We can create a new userptr vma on next umd vm_bind. Is this simpler?

Oak

> 
> Also based on Thomas say below related to xe_vm_unbind_vma and [2] we
> should be able to have VMA that points to a backing store (either
> userptr or BO) but doesn't have GPU mappings. We do have this for
> faulting VMs before the initial bind of a VMA but we cannot currently
> dynamically change this (we can invalidate page tables but cannot remove
> them without destroying the VMA). We should fix that.
> 
> [3] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> /commit/3fcc83c9b075364a9d83415ca73a9f9625543d7c
> 
> >
> > >
> > > So far we have had no reason to separate the two, but with hmmptr we
> > > would likely end up with multiple page-table regions per xe-vma,
> >
> > Can we still maintain 1 xe-vma : 1 page table region relationship for simplicity?
> >
> > This requires vma splitting. i.e., if you have a hmmptr vma cover range [0~100],
> > And fault happens at range [40~60], then we will end up with 3 vmas:
> > [0~40], a dummy hmmptr vma, not mapped gpu
> > [40~60], hmmptr vma mapped to gp
> > [60~100], dummy, not mapped to gpu.
> >
> > Does this work for you? Or do you see a benefit of not splitting vma?
> >
> 
> Thomas / Oak:
> 
> Agree for at least the initial implementation a 1 to 1 relationship will
> be easier. Also unsure what the benefit of not splitting VMAs is?
> 
> That being said, I don't think this is something we need to decide now.
> 
> >
> >
> > and
> > > with the patch discussed earlier we could've easily reused
> > > xe_vm_unbind_vma() that only touches the mutable page-table state and
> > > does the correct locking.
> > >
> > > The page table could would then typically take a const xe_vma *, and
> > > and xe_pt_state *, or whatever we choose to call it. All xe_vmas except
> > > hmmptr ones would have an 1:1 xe_vma <-> xe_pt_state relationship.
> 
> Thomas:
> 
> I like the idea of VMAs in the PT code function being marked as const
> and having the xe_pt_state as non const. It makes ownership very clear.
> 
> Not sure how that will fit into [1] as that series passes around
> a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It does this
> to make "struct xe_vm_ops" a single atomic operation. The VMAs are
> extracted either the GPUVM base operation or "struct xe_vma_op". Maybe
> these can be const? I'll look into that but this might not work out in
> practice.
> 
> Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship fits in
> hmmptrs. Could you explain your thinking here?
> 
> > >
> >
> > Matt has POC codes which works for me for system allocator w/o this
> vma/pt_state splitting: https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-
> system-allocator/-/commits/system_allocator_poc?ref_type=heads
> >
> > Can you take a look also...
> >
> 
> Thomas:
> 
> It might be helpful to look at that branch too, the last 3 patches [3]
> [4] [5] implement a PoC system allocator design built on top of [1],
> using existing userptrs for now, and is based on our discussions in the
> system allocator design doc. Coding is sometimes the easiest way to
> convay ideas and this roughly what I had in mind. Not complete by any
> means and still tons of work to do to make this a working solution.
> 
> [4] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> /commit/3fcc83c9b075364a9d83415ca73a9f9625543d7c
> [5] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> /commit/d2371e7c007406fb5e84e6605da238d12caa5c62
> [6] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> /commit/91cf8b38428ae9180c92435d3519193d074e837a
> 
> > When I worked on system allocator HLD, I pictured the invalid vma concept also.
> But somehow Matt was able to make it working without such concept....
> >
> 
> Oak:
> 
> Still unclear on this. Maybe we can chat offline to clear this up.
> 
> Matt
> 
> > Oak
> >
> > > Thoughts, comments?
> > >
> > > Thanks,
> > > Thomas