[PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
Matthew Brost
matthew.brost at intel.com
Wed Apr 24 02:31:36 UTC 2024
On Tue, Apr 23, 2024 at 03:17:03PM -0600, Zeng, Oak wrote:
> Hi Jason,
>
> Sorry for a late reply. I have been working on a v2 of this series: https://patchwork.freedesktop.org/series/132229/. This version addressed some of your concerns, such as removing the global character device, removing svm process concept (need further clean up per Matt's feedback).
>
> But the main concern you raised is not addressed yet. I need to further make sure I understand your concerns, See inline.
>
A few extra comments with references below.
>
>
> > -----Original Message-----
> > From: Jason Gunthorpe <jgg at nvidia.com>
> > Sent: Tuesday, April 9, 2024 1:24 PM
> > To: Zeng, Oak <oak.zeng at intel.com>
> > Cc: dri-devel at lists.freedesktop.org; intel-xe at lists.freedesktop.org; Brost, Matthew
> > <matthew.brost at intel.com>; Thomas.Hellstrom at linux.intel.com; Welty, Brian
> > <brian.welty at intel.com>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray at intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu at intel.com>; Vishwanathapura, Niranjana
> > <niranjana.vishwanathapura at intel.com>; Leon Romanovsky <leon at kernel.org>
> > Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> > hmm range
> >
> > On Tue, Apr 09, 2024 at 04:45:22PM +0000, Zeng, Oak wrote:
> >
> > > > I saw, I am saying this should not be done. You cannot unmap bits of
> > > > a sgl mapping if an invalidation comes in.
> > >
> > > You are right, if we register a huge mmu interval notifier to cover
> > > the whole address space, then we should use dma map/unmap pages to
> > > map bits of the address space. We will explore this approach.
> > >
> > > Right now, in xe driver, mmu interval notifier is dynamically
> > > registered with small address range. We map/unmap the whole small
> > > address ranges each time. So functionally it still works. But it
> > > might not be as performant as the method you said.
> >
> > Please don't do this, it is not how hmm_range_fault() should be
> > used.
> >
> > It is intended to be page by page and there is no API surface
> > available to safely try to construct covering ranges. Drivers
> > definately should not try to invent such a thing.
>
> I need your help to understand this comment. Our gpu mirrors the whole CPU virtual address space. It is the first design pattern in your previous reply (entire exclusive owner of a single device private page table and fully mirrors the mm page table into the device table.)
>
> What do you mean by "page by page"/" no API surface available to safely try to construct covering ranges"? As I understand it, hmm_range_fault take a virtual address range (defined in hmm_range struct), and walk cpu page table in this range. It is a range based API.
>
> From your previous reply ("So I find it a quite strange that this RFC is creating VMA's, notifiers and ranges on the fly "), it seems you are concerning why we are creating vma and register mmu interval notifier on the fly. Let me try to explain it. Xe_vma is a very fundamental concept in xe driver. The gpu page table update, invalidation are all vma-based. This concept exists before this svm work. For svm, we create a 2M (the size is user configurable) vma during gpu page fault handler and register this 2M range to mmu interval notifier.
>
> Now I try to figure out if we don't create vma, what can we do? We can map one page (which contains the gpu fault address) to gpu page table. But that doesn't work for us because the GPU cache and TLB would not be performant for 4K page each time. One way to think of the vma is a chunk size which is good for GPU HW performance.
>
> And the mmu notifier... if we don't register the mmu notifier on the fly, do we register one mmu notifier to cover the whole CPU virtual address space (which would be huge, e.g., 0~2^56 on a 57 bit machine, if we have half half user space kernel space split)? That won't be performant as well because for any address range that is unmapped from cpu program, but even if they are never touched by GPU, gpu driver still got a invalidation callback. In our approach, we only register a mmu notifier for address range that we know gpu would touch it.
>
AMD seems to register notifiers on demand for parts of the address space
[1], I think Nvidia's open source driver does this too (can look this up
if needed). We (Intel) also do this in Xe and the i915 for userptrs
(explictly binding a user address via IOCTL) too and it seems to work
quite well.
[1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c#L130
> >
> > > > Please don't use sg list at all for this.
> > >
> > > As explained, we use sg list for device private pages so we can
> > > re-used the gpu page table update codes.
> >
> > I'm asking you not to use SGL lists for that too. SGL lists are not
> > generic data structures to hold DMA lists.
>
> Matt mentioned to use drm_buddy_block. I will see how that works out.
>
Probably actually build a iterator (xe_res_cursor) for the device pages
returned from hmm_range_fault now that I think about this more.
> >
> > > > This is not what I'm talking about. The GPU VMA is bound to a specific
> > > > MM VA, it should not be created on demand.
> > >
> > > Today we have two places where we create gpu vma: 1) create gpu vma
> > > during a vm_bind ioctl 2) create gpu vma during a page fault of the
> > > system allocator range (this will be in v2 of this series).
> >
> > Don't do 2.
You have to create something, actually 2 things, on a GPU page fault.
Something to track the page table state and something to track VRAM
memory allocation. Both AMD and Nvidia's open source driver do this.
In AMD's driver the page table state is svm_range [2] and VRAM state is
svm_range_bo [3].
Nvidia's open source driver also does something similar (again can track
down a ref if needed).
Conceptually Xe will do something similiar, these are trending towards
an xe_vma and xe_bo respectfully, TBD on exact details but concept is
solid.
[2] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdkfd/kfd_svm.h#L109
[3] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdkfd/kfd_svm.h#L42
>
> As said, we will try the approach of one gigantic gpu vma with N page table states. We will create page table states in page fault handling. But this is only planned for stage 2.
>
> >
> > > I suspect something dynamic is still necessary, either a vma or a
> > > page table state (1 vma but many page table state created
> > > dynamically, as planned in our stage 2).
> >
> > I'm expecting you'd populate the page table memory on demand.
>
> We do populate gpu page table on demand. When gpu access a virtual address, we populate the gpu page table.
>
>
> >
> > > The reason is, we still need some gpu corresponding structure to
> > > match the cpu vm_area_struct.
> >
> > Definately not.
>
> See explanation above.
>
Agree GPU doesn't need to match vm_area_struct but the allocation must
be subset (or equal) to a vm_area_struct. Again other driver do this
too.
e.g. You can't allocate a 2MB chunk if the vma_area_struct looked up is
only 64k.
> >
> > > For example, when gpu page fault happens, you look
> > > up the cpu vm_area_struct for the fault address and create a
> > > corresponding state/struct. And people can have as many cpu
> > > vm_area_struct as they want.
> >
> > No you don't.
Yes you do. See below.
>
> See explains above. I need your help to understand how we can do it without a vma (or chunk). One thing GPU driver is different from RDMA driver is, RDMA doesn't have device private memory so no migration. It only need to dma-map the system memory pages and use them to fill RDMA page table. so RDMA don't need another memory manager such as our buddy. RDMA only deal with system memory which is completely struct page based management. Page by page make 100 % sense for RDMA.
>
> But for gpu, we need a way to use device local memory efficiently. This is the main reason we have vma/chunk concept.
>
> Thanks,
> Oak
>
>
> >
> > You call hmm_range_fault() and it does everything for you. A driver
> > should never touch CPU VMAs and must not be aware of them in any away.
> >
struct vm_area_struct is an argument to the migrate_vma* functions [4], so
yes drivers need to be aware of CPU VMAs.
Again AMD [5], Nouveau [6] , and Nvidia's open source driver (again no
ref, but can dig one up) all lookup CPU VMAs on a GPU page fault or SVM
bind IOCTL.
[4] https://elixir.bootlin.com/linux/latest/source/include/linux/migrate.h#L186
[5] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c#L522
[6] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouveau_svm.c#L182
Matt
> > Jason
More information about the dri-devel
mailing list