[PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr

Sat Mar 16 01:35:15 UTC 2024

> -----Original Message-----
> From: Brost, Matthew <matthew.brost at intel.com>
> Sent: Thursday, March 14, 2024 4:25 PM
> To: Zeng, Oak <oak.zeng at intel.com>
> Cc: intel-xe at lists.freedesktop.org; Hellstrom, Thomas
> <thomas.hellstrom at intel.com>; airlied at gmail.com; Welty, Brian
> <brian.welty at intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray at intel.com>
> Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
> 
> On Wed, Mar 13, 2024 at 11:35:52PM -0400, Oak Zeng wrote:
> > Add a helper function xe_hmm_populate_range to populate
> > a a userptr or hmmptr range. This functions calls hmm_range_fault
> > to read CPU page tables and populate all pfns/pages of this
> > virtual address range.
> >
> > If the populated page is system memory page, dma-mapping is performed
> > to get a dma-address which can be used later for GPU to access pages.
> >
> > If the populated page is device private page, we calculate the dpa (
> > device physical address) of the page.
> >
> > The dma-address or dpa is then saved in userptr's sg table. This is
> > prepare work to replace the get_user_pages_fast code in userptr code
> > path. The helper function will also be used to populate hmmptr later.
> >
> > Signed-off-by: Oak Zeng <oak.zeng at intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura at intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura at intel.com>
> > Cc: Matthew Brost <matthew.brost at intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom at intel.com>
> > Cc: Brian Welty <brian.welty at intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile |   3 +-
> >  drivers/gpu/drm/xe/xe_hmm.c | 213
> ++++++++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_hmm.h |  12 ++
> >  3 files changed, 227 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
> >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
> >
> > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > index 840467080e59..29dcbc938b01 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -143,7 +143,8 @@ xe-y += xe_bb.o \
> >  	xe_wait_user_fence.o \
> >  	xe_wa.o \
> >  	xe_wopcm.o \
> > -	xe_svm_devmem.o
> > +	xe_svm_devmem.o \
> > +	xe_hmm.o
> >
> >  # graphics hardware monitoring (HWMON) support
> >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
> > new file mode 100644
> > index 000000000000..c45c2447d386
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_hmm.c
> > @@ -0,0 +1,213 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#include <linux/mmu_notifier.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/memremap.h>
> > +#include <linux/swap.h>
> > +#include <linux/mm.h>
> > +#include "xe_hmm.h"
> > +#include "xe_vm.h"
> > +
> > +/**
> > + * mark_range_accessed() - mark a range is accessed, so core mm
> > + * have such information for memory eviction or write back to
> > + * hard disk
> > + *
> > + * @range: the range to mark
> > + * @write: if write to this range, we mark pages in this range
> > + * as dirty
> > + */
> > +static void mark_range_accessed(struct hmm_range *range, bool write)
> > +{
> > +	struct page *page;
> > +	u64 i, npages;
> > +
> > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> PAGE_SHIFT) + 1;
> > +	for (i = 0; i < npages; i++) {
> > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > +		if (write) {
> > +			lock_page(page);
> > +			set_page_dirty(page);
> > +			unlock_page(page);
> > +		}
> > +		mark_page_accessed(page);
> > +	}
> > +}
> > +
> > +/**
> > + * build_sg() - build a scatter gather table for all the physical pages/pfn
> > + * in a hmm_range. dma-address is save in sg table and will be used to
> program
> > + * GPU page table later.
> > + *
> > + * @xe: the xe device who will access the dma-address in sg table
> > + * @range: the hmm range that we build the sg table from. range-
> >hmm_pfns[]
> > + * has the pfn numbers of pages that back up this hmm address range.
> > + * @st: pointer to the sg table.
> > + * @write: whether we write to this range. This decides dma map direction
> > + * for system pages. If write we map it bi-diretional; otherwise
> > + * DMA_TO_DEVICE
> > + *
> > + * All the contiguous pfns will be collapsed into one entry in
> > + * the scatter gather table. This is for the convenience of
> > + * later on operations to bind address range to GPU page table.
> > + *
> > + * The dma_address in the sg table will later be used by GPU to
> > + * access memory. So if the memory is system memory, we need to
> > + * do a dma-mapping so it can be accessed by GPU/DMA. If the memory
> > + * is GPU local memory (of the GPU who is going to access memory),
> > + * we need gpu dpa (device physical address), and there is no need
> > + * of dma-mapping.
> > + *
> > + * FIXME: dma-mapping for peer gpu device to access remote gpu's
> > + * memory. Add this when you support p2p
> > + *
> > + * This function allocates the storage of the sg table. It is
> > + * caller's responsibility to free it calling sg_free_table.
> > + *
> > + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> > + */
> > +static int build_sg(struct xe_device *xe, struct hmm_range *range,
> 
> xe is unused.

It is used in below line
> 
> > +			     struct sg_table *st, bool write)
> > +{
> > +	struct device *dev = xe->drm.dev;

Used here

> > +	struct scatterlist *sg;
> > +	u64 i, npages;
> > +
> > +	sg = NULL;
> > +	st->nents = 0;
> > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> PAGE_SHIFT) + 1;
> > +
> > +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> > +		return -ENOMEM;
> > +
> > +	for (i = 0; i < npages; i++) {
> > +		struct page *page;
> > +		unsigned long addr;
> > +		struct xe_mem_region *mr;
> > +
> > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > +		if (is_device_private_page(page)) {
> > +			mr = page_to_mem_region(page);
> 
> Not seeing where page_to_mem_region is defined.

Yah,,,, I forgot to pick up this patch. Will pick up...

> 
> > +			addr = vram_pfn_to_dpa(mr, range->hmm_pfns[i]);
> > +		} else {
> > +			addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> > +					write ? DMA_BIDIRECTIONAL :
> DMA_TO_DEVICE);
> > +		}
> > +
> > +		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
> > +			sg->length += PAGE_SIZE;
> > +			sg_dma_len(sg) += PAGE_SIZE;
> > +			continue;
> > +		}
> > +
> > +		sg =  sg ? sg_next(sg) : st->sgl;
> > +		sg_dma_address(sg) = addr;
> > +		sg_dma_len(sg) = PAGE_SIZE;
> > +		sg->length = PAGE_SIZE;
> > +		st->nents++;
> > +	}
> > +
> > +	sg_mark_end(sg);
> > +	return 0;
> > +}
> > +
> 
> Hmm, this looks way to open coded to me.
> 
> Can't we do something like this:
> 
> struct page **pages = convert from range->hmm_pfns
> sg_alloc_table_from_pages_segment
> if (is_device_private_page())
>         populatue sg table via vram_pfn_to_dpa
> else
>         dma_map_sgtable
> free(pages)
> 
> This assume we are not mixing is_device_private_page & system memory
> pages in a single struct hmm_range.

that is exactly I pictured. We actually can mixing vram and system memory... the reason is, migration of a hmm range from system memory to vram can fail for whatever reason. And it can end up a range is partially migrated.... And migration is best effort in such case. We just map the system memory in that range to gpu for such case. This will come up in the coming system allocator patches...

This case was found during i915 test...

I also checked amd's code. They assume the same, just take a look of function svm_range_dma_map_dev.

agree that code is not nice. But I don't have a better way assuming above

> 
> 
> > +/**
> > + * xe_hmm_populate_range() - Populate physical pages of a virtual
> > + * address range
> > + *
> > + * @vma: vma has information of the range to populate. only vma
> > + * of userptr and hmmptr type can be populated.
> > + * @hmm_range: pointer to hmm_range struct. hmm_rang->hmm_pfns
> > + * will hold the populated pfns.
> > + * @write: populate pages with write permission
> > + *
> > + * This function populate the physical pages of a virtual
> > + * address range. The populated physical pages is saved in
> > + * userptr's sg table. It is similar to get_user_pages but call
> > + * hmm_range_fault.
> > + *
> > + * This function also read mmu notifier sequence # (
> > + * mmu_interval_read_begin), for the purpose of later
> > + * comparison (through mmu_interval_read_retry).
> > + *
> > + * This must be called with mmap read or write lock held.
> > + *
> > + * This function allocates the storage of the userptr sg table.
> > + * It is caller's responsibility to free it calling sg_free_table.
> 
> I'd add a helper to free the sg_free_table & unmap the dma pages if
> needed.

Ok, due to the reason I explained above, there will be a little complication. I will do it in a separate patch.
> 
> > + *
> > + * returns: 0 for succuss; negative error no on failure
> > + */
> > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> *hmm_range,
> > +						bool write)
> > +{
> 
> The layering is all wrong here, we shouldn't be touching struct xe_vma
> directly in hmm layer.

I have to admit we don't have a clear layering here. This function is supposed to be a shared function which will be used by both hmmptr and userptr.

Maybe you got an impression from my POC series that we don't have xe_vma in system allocator codes. That was true. But after the design discussion, we have decided to unify userptr code and hmmptr codes, so we will have xe_vma also in system allocator code. Basically xe_vma will replace the xe_svm_range concept in my patch series.

Of course, per Thomas we can further optimize by splitting xe_vma into mutable and unmutable... but that should be step 2.

> 
> Pass in a populated hmm_range and sgt. Or alternatively pass in
> arguments and then populate a hmm_range locally.

Xe_vma is the input parameter here.

I will remove Hmm_range from function parameter and make it a local. I figured I don't need this as an output anymore. All we need is already in sg table. 

> 
> > +	unsigned long timeout =
> > +		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > +	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
> > +	struct xe_userptr_vma *userptr_vma;
> > +	struct xe_userptr *userptr;
> > +	u64 start = vma->gpuva.va.addr;
> > +	u64 end = start + vma->gpuva.va.range;
> 
> We have helper - xe_vma_start and xe_vma_end but I think either of these
> are correct in this case.
> 
> xe_vma_start is the address which this bound to the GPU, we want the
> userptr address.
> 
> So I think it would be:
> 
> start = xe_vma_userptr()
> end = xe_vma_userptr() + xe_vma_size()

You are correct. Will fix. I mixed this because, in system allocator, cpu address always == gpu address 
> 
> 
> > +	struct xe_vm *vm = xe_vma_vm(vma);
> > +	u64 npages;
> > +	int ret;
> > +
> > +	userptr_vma = to_userptr_vma(vma);
> > +	userptr = &userptr_vma->userptr;
> > +	mmap_assert_locked(userptr->notifier.mm);
> > +
> > +	npages = ((end - 1) >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1;
> 
> This math is done above, if you need this math in next rev add a helper.

Will do.
> 
> > +	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
> > +	if (unlikely(!pfns))
> > +		return -ENOMEM;
> > +
> > +	if (write)
> > +		flags |= HMM_PFN_REQ_WRITE;
> > +
> > +	memset64((u64 *)pfns, (u64)flags, npages);
> 
> Why is this needed, can't we just set hmm_range->default_flags?

In this case, set default_flags also work.

This is some codes I inherited from Niranjana. Per my test before it works.

Basically there are two way to control the flags:
Default is the coarse grain way applying to all pfns
The way I am using here is a per pfn fine grained flag setting. It also works.

> 
> > +	hmm_range->hmm_pfns = pfns;
> > +	hmm_range->notifier_seq = mmu_interval_read_begin(&userptr-
> >notifier);
> 
> We need a userptr->notifier == userptr->notifier_seq check that just
> returns, right?

Yes this sequence number is used to decide whether we need to retry. See function xe_pt_userptr_pre_commit
> 
> > +	hmm_range->notifier = &userptr->notifier;
> > +	hmm_range->start = start;
> > +	hmm_range->end = end;
> > +	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT |
> HMM_PFN_REQ_WRITE;
> 
> Is this needed? AMD and Nouveau do not set this argument.

As explained above. Amd and nouveau only use coarse grain setting

> 
> > +	/**
> > +	 * FIXME:
> > +	 * Set the the dev_private_owner can prevent hmm_range_fault to fault
> > +	 * in the device private pages owned by caller. See function
> > +	 * hmm_vma_handle_pte. In multiple GPU case, this should be set to the
> > +	 * device owner of the best migration destination. e.g., device0/vm0
> > +	 * has a page fault, but we have determined the best placement of
> > +	 * the fault address should be on device1, we should set below to
> > +	 * device1 instead of device0.
> > +	 */
> > +	hmm_range->dev_private_owner = vm->xe->drm.dev;
> > +
> > +	while (true) {
> 
> mmap_read_lock(mm);
> 
> > +		ret = hmm_range_fault(hmm_range);
> 
> mmap_read_unlock(mm);

This need to be called in caller. The reason is, in the system allocator code path, mmap lock is already hold before calling into this helper.

I will check patch 5 for this purpose.

> 
> > +		if (time_after(jiffies, timeout))
> > +			break;
> > +
> > +		if (ret == -EBUSY)
> > +			continue;
> > +		break;
> > +	}
> > +
> > +	if (ret)
> > +		goto free_pfns;
> > +
> > +	ret = build_sg(vm->xe, hmm_range, &userptr->sgt, write);
> > +	if (ret)
> > +		goto free_pfns;
> > +
> > +	mark_range_accessed(hmm_range, write);
> > +	userptr->sg = &userptr->sgt;
> 
> Again this should be set in caller after this function return.

why we can't set it here? It is shared b/t userptr and hmmptr

> 
> > +	userptr->notifier_seq = hmm_range->notifier_seq;
> 
> This is could be a pass by reference I guess and set here.

Sorry, I don't understand this comment.

> 
> > +
> > +free_pfns:
> > +	kvfree(pfns);
> > +	return ret;
> > +}
> > +
> > diff --git a/drivers/gpu/drm/xe/xe_hmm.h b/drivers/gpu/drm/xe/xe_hmm.h
> > new file mode 100644
> > index 000000000000..960f3f6d36ae
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_hmm.h
> > @@ -0,0 +1,12 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#include <linux/types.h>
> > +#include <linux/hmm.h>
> > +#include "xe_vm_types.h"
> > +#include "xe_svm.h"
> 
> As per the previous patches no need to xe_*.h files, just forward
> declare any arguments.

Will do.

Oak
> 
> Matt
> 
> > +
> > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> *hmm_range,
> > +						bool write);
> > --
> > 2.26.3
> >