Separating xe_vma- and page-table state

Wed Mar 13 03:16:50 UTC 2024

On Tue, Mar 12, 2024 at 08:16:24PM -0600, Zeng, Oak wrote:
> Hi Matt,
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost at intel.com>
> > Sent: Tuesday, March 12, 2024 9:27 PM
> > To: Zeng, Oak <oak.zeng at intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>; intel-
> > xe at lists.freedesktop.org
> > Subject: Re: Separating xe_vma- and page-table state
> > 
> > On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> > > Hi Thomas,
> > 
> > Going to reply to both Thomas and Oak to not split this thread.
> > 
> > >
> > > > -----Original Message-----
> > > > From: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> > > > Sent: Tuesday, March 12, 2024 3:43 AM
> > > > To: intel-xe at lists.freedesktop.org
> > > > Cc: Brost, Matthew <matthew.brost at intel.com>; Zeng, Oak
> > > > <oak.zeng at intel.com>
> > > > Subject: Separating xe_vma- and page-table state
> > > >
> > > > Hi,
> > > >
> > > > It's IMO become apparent both in the system allocator discussion and in
> > > > the patch that enables the presence of invalid vmas
> > > > that we need to be
> > > > better at separating xe_vma and page-table state, so that xe_vma state
> > > > would contain things that are mostly immutable and that the user
> > > > requested: PAT index, memory attributes, requested tile presence etc,
> > > > whereas the page-table state would contain mutable state like actual
> > > > tile presence, invalidaton state and MMU notifier.
> > 
> > Thomas:
> > 
> > 100% agree with the idea. There are 2 distict parts of xe_vma - what I
> > call GPUVM state and PT state. In addition, there is what I call IOCTL
> > state (sync objs, etc...) that is updated in various places in xe_vm.c.
> > 
> > Functions to update these respective states should be clearly seperated.
> > In the current implementaion is a complete mess in this regard, you
> > mention xe_vm_unbind_vma below, in its current state there is no way
> > this function could be easily be reused.
> > 
> > My series [1] vastly improves this seperation. It still could be split a
> > bit better though. In my next rebase I can look at this series through
> > the lens of clearly maintaining seperation + likely update [2] to do an
> > unbind rather than a invalidation.
> > 
> > [1] https://patchwork.freedesktop.org/series/125608/
> > [2] https://patchwork.freedesktop.org/series/130935/
> > 
> > >
> > > It is a valid reasoning to me... if we want to do what community want us to do
> > with system allocator,
> > > And if we want to meet our umd's requirement of "free w/o vm_unbind", yes,
> > we need this "invalid" vma concept.
> > >
> > > The strange thing is, it seems Matt can still achieve the goal without introducing
> > invalid vma concept... it doesn't look like he has
> > > This concept in his patch....
> > >
> > 
> > Oak:
> > 
> > I'm a little confused by what mean by "invalid" vma concept. Does
> > that mean no GPU backing or page tables? That is essentially what I call
> > system allocator VMA in [3], right?
> 
> Thomas also mentioned this "invalid vmas" concept. Here is my understanding:
> 
> 1) from the system allocator discussion: we introduced a gigantic vma which is invalid because address in this gigantic vma are not a valid gpu address.
> When we create a vma in gpu page fault handler, we carve out a vma from gigantic vma, this vma is a valid vma. 
> 
> Note there is subtle difference b/t invalid vma and vma without backing or page table. A valid vma can be migrated to system memory and at that time it lost its gpu backing or gpu page table but it is still a valid vma.
> 
> In the POC code you give, you don't the migration logic, so you don't need the note of "invalid vma".
> 

We have gotten a little off topic from original point of this thread...

Anyways we already have this concept on tip, see xe_vm_invalidate_vma.
When a BO is moved or a userptr is invalidated we call
xe_vm_invalidate_vma which invalidates the page tables. Next GPU access
will result in a fault and page fault handler will rebind the VMA. Any
new migration code would call xe_vm_invalidate_vma too.

> 2) from the userptr discussion "free userptr without vm_unbind": in your patch, when user free userptr without vm_unbind, the code behavior is, you invalidate that userptr from gpu page table and put it into a repin list. Userptr vma will be repined on the next exec submission. I guess Thomas call such vma (freed, but not vm_unbind) as "invalid vma"
> 
> For me this behavior causes unnecessary complication. I would destroy the userptr vma when user free it. We can create a new userptr vma on next umd vm_bind. Is this simpler?

Still not following but let's table this as again this is off topic.

Matt

> 
> Oak
> 
> 
> > 
> > Also based on Thomas say below related to xe_vm_unbind_vma and [2] we
> > should be able to have VMA that points to a backing store (either
> > userptr or BO) but doesn't have GPU mappings. We do have this for
> > faulting VMs before the initial bind of a VMA but we cannot currently
> > dynamically change this (we can invalidate page tables but cannot remove
> > them without destroying the VMA). We should fix that.
> > 
> > [3] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> > /commit/3fcc83c9b075364a9d83415ca73a9f9625543d7c
> > 
> > >
> > > >
> > > > So far we have had no reason to separate the two, but with hmmptr we
> > > > would likely end up with multiple page-table regions per xe-vma,
> > >
> > > Can we still maintain 1 xe-vma : 1 page table region relationship for simplicity?
> > >
> > > This requires vma splitting. i.e., if you have a hmmptr vma cover range [0~100],
> > > And fault happens at range [40~60], then we will end up with 3 vmas:
> > > [0~40], a dummy hmmptr vma, not mapped gpu
> > > [40~60], hmmptr vma mapped to gp
> > > [60~100], dummy, not mapped to gpu.
> > >
> > > Does this work for you? Or do you see a benefit of not splitting vma?
> > >
> > 
> > Thomas / Oak:
> > 
> > Agree for at least the initial implementation a 1 to 1 relationship will
> > be easier. Also unsure what the benefit of not splitting VMAs is?
> > 
> > That being said, I don't think this is something we need to decide now.
> > 
> > >
> > >
> > > and
> > > > with the patch discussed earlier we could've easily reused
> > > > xe_vm_unbind_vma() that only touches the mutable page-table state and
> > > > does the correct locking.
> > > >
> > > > The page table could would then typically take a const xe_vma *, and
> > > > and xe_pt_state *, or whatever we choose to call it. All xe_vmas except
> > > > hmmptr ones would have an 1:1 xe_vma <-> xe_pt_state relationship.
> > 
> > Thomas:
> > 
> > I like the idea of VMAs in the PT code function being marked as const
> > and having the xe_pt_state as non const. It makes ownership very clear.
> > 
> > Not sure how that will fit into [1] as that series passes around
> > a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It does this
> > to make "struct xe_vm_ops" a single atomic operation. The VMAs are
> > extracted either the GPUVM base operation or "struct xe_vma_op". Maybe
> > these can be const? I'll look into that but this might not work out in
> > practice.
> > 
> > Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship fits in
> > hmmptrs. Could you explain your thinking here?
> > 
> > > >
> > >
> > > Matt has POC codes which works for me for system allocator w/o this
> > vma/pt_state splitting: https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-
> > system-allocator/-/commits/system_allocator_poc?ref_type=heads
> > >
> > > Can you take a look also...
> > >
> > 
> > Thomas:
> > 
> > It might be helpful to look at that branch too, the last 3 patches [3]
> > [4] [5] implement a PoC system allocator design built on top of [1],
> > using existing userptrs for now, and is based on our discussions in the
> > system allocator design doc. Coding is sometimes the easiest way to
> > convay ideas and this roughly what I had in mind. Not complete by any
> > means and still tons of work to do to make this a working solution.
> > 
> > [4] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> > /commit/3fcc83c9b075364a9d83415ca73a9f9625543d7c
> > [5] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> > /commit/d2371e7c007406fb5e84e6605da238d12caa5c62
> > [6] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> > /commit/91cf8b38428ae9180c92435d3519193d074e837a
> > 
> > > When I worked on system allocator HLD, I pictured the invalid vma concept also.
> > But somehow Matt was able to make it working without such concept....
> > >
> > 
> > Oak:
> > 
> > Still unclear on this. Maybe we can chat offline to clear this up.
> > 
> > Matt
> > 
> > > Oak
> > >
> > > > Thoughts, comments?
> > > >
> > > > Thanks,
> > > > Thomas