[v2 27/31] drm/xe/svm: Handle CPU page fault

Fri Jun 7 04:30:28 UTC 2024

> -----Original Message-----
> From: Brost, Matthew <matthew.brost at intel.com>
> Sent: Wednesday, April 10, 2024 10:07 PM
> To: Zeng, Oak <oak.zeng at intel.com>
> Cc: intel-xe at lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray at intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu at intel.com>; Thomas.Hellstrom at linux.intel.com; Welty,
> Brian <brian.welty at intel.com>
> Subject: Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
> 
> On Tue, Apr 09, 2024 at 04:17:38PM -0400, Oak Zeng wrote:
> > Under the picture of svm, CPU and GPU program share one same
> > virtual address space. The backing store of this virtual address
> > space can be either in system memory or device memory. Since GPU
> > device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
> > Any CPU access to device memory causes a page fault. Implement
> > a page fault handler to migrate memory back to system memory and
> > map it to CPU page table so the CPU program can proceed.
> >
> > Also unbind this page from GPU side, and free the original GPU
> > device page
> >
> > Signed-off-by: Oak Zeng <oak.zeng at intel.com>
> > Co-developed-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura at intel.com>
> > Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura at intel.com>
> > Cc: Matthew Brost <matthew.brost at intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom at intel.com>
> > Cc: Brian Welty <brian.welty at intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile         |   1 +
> >  drivers/gpu/drm/xe/xe_svm.h         |   8 +-
> >  drivers/gpu/drm/xe/xe_svm_devmem.c  |   7 +-
> >  drivers/gpu/drm/xe/xe_svm_migrate.c | 222
> ++++++++++++++++++++++++++++
> >  4 files changed, 230 insertions(+), 8 deletions(-)
> >  create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c
> >
> > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > index f89d77b6d654..65289acdd563 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -131,6 +131,7 @@ xe-y += xe_bb.o \
> >  	xe_step.o \
> >  	xe_svm.o \
> >  	xe_svm_devmem.o \
> > +	xe_svm_migrate.o \
> 
> See comments about file org, same thing applies here. Let's put all of
> the svm implementation in a single file.

Did this in v3.

> 
> >  	xe_sync.o \
> >  	xe_tile.o \
> >  	xe_tile_sysfs.o \
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> > index f601dffe3fc1..c9e4239c44b4 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -7,11 +7,11 @@
> >  #define __XE_SVM_H
> >
> >  #include <linux/mm_types.h>
> > +#include <linux/mm.h>
> >  #include "xe_device_types.h"
> >  #include "xe_device.h"
> >  #include "xe_assert.h"
> > -
> > -struct xe_vm;
> > +#include "xe_vm_types.h"
> >
> >  /**
> >   * struct xe_svm - data structure to represent a shared
> > @@ -31,6 +31,9 @@ struct xe_svm {
> >  	struct list_head vm_list;
> >  };
> >
> > +#define xe_svm_for_each_vm(svm, vm)
> 	\
> > +		list_for_each_entry(vm, &svm->vm_list, svm_link)
> > +
> 
> Don't think this is need, see below.
> 
> >  extern struct xe_svm *xe_create_svm(void);
> >  void xe_destroy_svm(struct xe_svm *svm);
> >  extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
> > @@ -79,4 +82,5 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> >
> >  void xe_devm_free_blocks(struct list_head *blocks);
> >  void xe_devm_page_free(struct page *page);
> > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> >  #endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > index 088ac209ad80..32ada458f1dd 100644
> > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > @@ -37,11 +37,6 @@ struct xe_svm_block_meta {
> >  	unsigned long bitmap[];
> >  };
> >
> > -static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > -{
> > -	return 0;
> > -}
> > -
> >  static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> >  {
> >  	/** DRM buddy's block offset is 0-based*/
> > @@ -168,7 +163,7 @@ void xe_devm_free_blocks(struct list_head *blocks)
> >
> >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> >  	.page_free = xe_devm_page_free,
> > -	.migrate_to_ram = xe_devm_migrate_to_ram,
> > +	.migrate_to_ram = xe_svm_migrate_to_sram,
> 
> Again single file so this will be static function, no reason to export
> this.

Agreed.

> 
> >  };
> >
> >  /**
> > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > new file mode 100644
> > index 000000000000..0db831af098e
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > @@ -0,0 +1,222 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2023 Intel Corporation
> > + */
> > +
> > +#include <linux/gfp.h>
> > +#include <linux/migrate.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/dma-fence.h>
> > +#include <linux/bitops.h>
> > +#include <linux/bitmap.h>
> > +#include <linux/kernel.h>
> > +#include <linux/slab.h>
> > +#include <drm/drm_buddy.h>
> > +#include "xe_device_types.h"
> > +#include "xe_device.h"
> > +#include "xe_trace.h"
> > +#include "xe_migrate.h"
> > +#include "xe_ttm_vram_mgr_types.h"
> > +#include "xe_assert.h"
> > +#include "xe_pt.h"
> > +#include "xe_svm.h"
> > +#include "xe_vm.h"
> > +
> > +
> > +/**
> > + * alloc_host_page() - allocate one host page for the fault vma
> > + *
> > + * @dev: (GPU) device that will access the allocated page
> > + * @vma: the fault vma that we need allocate page for
> > + * @addr: the fault address. The allocated page is for this address
> > + * @dma_addr: used to output the dma address of the allocated page.
> > + * This dma address will be used for gpu to access this page. GPU
> > + * access host page through a dma mapped address.
> > + * @pfn: used to output the pfn of the allocated page.
> > + *
> > + * This function allocate one host page for the specified vma. It
> > + * also does some prepare work for GPU to access this page, such
> > + * as map this page to iommu (by calling dma_map_page).
> > + *
> > + * When this function returns, the page is locked.
> > + *
> > + * Return struct page pointer when success
> > + * NULL otherwise
> > + */
> > +static struct page *alloc_host_page(struct device *dev,
> > +							 struct
> vm_area_struct *vma,
> > +							 unsigned long addr,
> > +							 dma_addr_t
> *dma_addr,
> > +							 unsigned long *pfn)
> 
> Weird alignment, also I don't think we are want to allocate a page at
> time...
> 
> Beyond that, can't say I'm a fan of 2 arguments being return and
> populated here either (dma_addr_t *dma_addr, unsigned long *pfn). I
> haven't seen a lot that style function in Linux.
> 
> Probably makes more sense to have a function which allocates pages,
> locks them, and populates the pfn array (migrate_pfn) rather than doing
> this a page at a time.

Agreed. In v3, I adopted Nvidia's new dma-mapping API which also requires a 2 steps way: 1) allocate pages 2) dma-mapping

> 
> > +{
> > +	struct page *page;
> > +
> > +	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> > +	if (unlikely(!page))
> > +		return NULL;
> > +
> > +	/**Lock page per hmm requirement, see hmm.rst*/
> > +	lock_page(page);
> > +	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> DMA_FROM_DEVICE);
> 
> The device is writing to these pages so I think DMA_BIDIRECTIONAL is
> needed, right? 

We are copying from device memory to system memory here. Yes device writes to those pages, but the direction parameter is about the copy direction, not about read or write. I think FROM_DEVICE is still correct.

As mentioned above I think this should be broken out into
> a different step too.

Agreed.

> 
> > +	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
> > +		unlock_page(page);
> > +		__free_page(page);
> > +		return NULL;
> > +	}
> > +
> > +	*pfn = migrate_pfn(page_to_pfn(page));
> > +	return page;
> > +}
> > +
> > +static void free_host_page(struct page *page)
> > +{
> > +	unlock_page(page);
> > +	put_page(page);
> > +}
> > +
> > +/**
> > + * migrate_page_vram_to_ram() - migrate one page from vram to ram
> > + *
> > + * @vma: The vma that the page is mapped to
> > + * @addr: The virtual address that the page is mapped to
> > + * @src_pfn: src page's page frame number
> > + * @dst_pfn: used to return dstination page (in system ram)'s pfn
> > + *
> > + * Allocate one page in system ram and copy memory from device
> memory
> > + * to system ram.
> > + *
> > + * Return: 0 if this page is already in sram (no need to migrate)
> > + * 1: successfully migrated this page from vram to sram.
> > + * error code otherwise
> > + */
> > +static int migrate_page_vram_to_ram(struct vm_area_struct *vma,
> unsigned long addr,
> > +						unsigned long src_pfn,
> unsigned long *dst_pfn)
> > +{
> 
> We definitely don't want to copy 1 page at time. I touch on this in [1].
> Basically this going to perform poorly unless we use larger copies, the
> migrate code supports non-contigous sram addresses, and vram addresses
> will likely be contigous due to the buddy allocator.

Totally agreed we need to use one blitter command to copy many pages.. I was aware of this when I wrote this v2. And this is addressed in v3.

But also be aware, we might end up migrate multiple pages from system memory to device memory on each gpu page fault. But on CPU page fault, we only want to migrate 1 page to cover the fault address. I know in this v2, I migrated whole CPU VMA. I realized this is not correct after read more core mm and hmm code. The main reason is, the core mm only program one pte entry (covers 4k), even if we migrate multiple pages. See logic in function handle_pte_fault and do_swap_page. So I switched to one page scheme on v3. At least this is my current understanding.

Regarding vram contiguous, I think if we don't pass DRM_BUDDY_CONTIGUOUS_ALLOCATION, we still have chance run into non-contiguous vram. Even if the original allocation is contiguous, a portion of the contiguous vram can be migrated to system memory (and freed from device side), then migrate back to vram. We end up with non-contiguous vram in such case. 

I am thinking not using identity mapping for vram ppgtt mapping for this case, just like what we did for system memory ppgtt mapping. This way we don't care whether vram is contiguous or not, one shot blitter command can do the work. Otherwise, we will still have to split the copy task per contiguous/non-contiguous vram. This is marked as FIXME in v3. What do you think? 

> 
> [1] https://patchwork.freedesktop.org/patch/588548/?series=132229&rev=1
> 
> > +	struct xe_mem_region *mr;
> > +	struct xe_tile *tile;
> > +	struct xe_device *xe;
> > +	struct device *dev;
> > +	dma_addr_t dma_addr = 0;
> > +	struct dma_fence *fence;
> > +	struct page *host_page;
> > +	struct page *src_page;
> > +	u64 src_dpa;
> > +
> > +	src_page = migrate_pfn_to_page(src_pfn);
> > +	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))
> 
> I'm going to say this is a bug if !src_page ||
> !is_zone_device_page(src_page) || !(src_pfn & MIGRATE_PFN_MIGRATE)
> and
> we return -EFAULT (or another error code if that makes more sense). We
> are migrating from the device where we know we have backing store from
> the original fault.

Agreed under the context of this v2.

In v3, I am moving to a more page centric scheme. Think of this scenario: a virtual address range, let's say 2M, is migrated gpu device memory. Now CPU accessed one page in the middle of this 2M range, this single 4k page will be migrate back to system memory. HMM also triggers a invalidation callback to driver to invalidate this 4k range. Now gpu access this 4k range. Let's say our GPU migration granularity is 2M (the concept of chunk in your series) and migration policy says we need to migrate back to device memory, the migrate_vma_setup will be set up to cover a 2M range but hmm will tell us there is only one 4k page in system memory need to be migrated while all the rest (2M - 4K) need no migration. In this scenario MIGRATE_PFN_MIGRATE is not set for the 2M-4k. of course this scenario doesn't apply to above vram_to_sram migration.

> 
> > +		return 0;
> > +
> > +	mr = xe_page_to_mem_region(src_page);
> > +	tile = xe_mem_region_to_tile(mr);
> > +	xe = tile_to_xe(tile);
> > +	dev = xe->drm.dev;
> > +
> > +	src_dpa = xe_mem_region_pfn_to_dpa(mr, src_pfn);
> > +	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
> > +	if (!host_page)
> > +		return -ENOMEM;
> > +
> > +	fence = xe_migrate_pa(tile->migrate, src_dpa, true,
> > +						dma_addr, false, PAGE_SIZE);
> > +	if (IS_ERR(fence)) {
> > +		dma_unmap_page(dev, dma_addr, PAGE_SIZE,
> DMA_FROM_DEVICE);
> > +		free_host_page(host_page);
> > +		return PTR_ERR(fence);
> > +	}
> > +
> > +	dma_fence_wait(fence, false);
> 
> Even if we did want to migrate a page at a time, we only need to wait on
> the last fence due to the ordered nature of exec queues.

Sure. As said we migrate all pages in one shot in v3. Above comment doesn't apply anymore.

> 
> > +	dma_fence_put(fence);
> > +	dma_unmap_page(dev, dma_addr, PAGE_SIZE,
> DMA_FROM_DEVICE);
> 
> With above, will likely unmap all dma pages in a single function once
> the last fence is signaled.

Yes, this was handled properly in v3.

> 
> > +	return 1;
> > +}
> > +
> > +/**
> > + * xe_svm_migrate_to_sram() - Migrate memory back to sram on CPU
> page fault
> > + *
> > + * @vmf: cpu vm fault structure, contains fault information such as vma
> etc.
> > + *
> > + * Note, this is in CPU's vm fault handler, caller holds the mmap read lock.
> > + *
> > + * This function migrate one gpu vma which contains the fault address to
> sram.
> > + * We try to maintain a 1:1 mapping b/t the CPU vma and gpu vma (i.e.,
> create one
> > + * gpu vma for one cpu vma initially and try not to split it). So this scheme
> end
> > + * up migrate at the vma granularity. This might not be the best
> performant scheme
> > + *
> > + * This can be tunned with a migration granularity for  performance, for
> example,
> > + * migration 2M for each CPU page fault, or let user specify how much to
> migrate.
> > + * This is more complex due to vma splitting.
> > + *
> > + * This function should also update GPU page table, so the fault virtual
> address
> > + * points to the same sram location from GPU side. This is TBD.
> > + *
> > + * Return:
> > + * 0 on success
> > + * VM_FAULT_SIGBUS: failed to migrate page to system memory,
> application
> > + * will be signaled a SIGBUG
> > + */
> > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
> > +{
> > +	struct xe_mem_region *mr = xe_page_to_mem_region(vmf->page);
> > +	struct xe_tile *tile = xe_mem_region_to_tile(mr);
> > +	struct xe_device *xe = tile_to_xe(tile);
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);
> 
> I don't think this is needed... More below.

Yes. I removed the xe_svm concept in v3. 

In my v3, we have a mm_struct pointer in gpuvm. If a gpuvm participate svm, the mm struct pointer will be set. This is from your PoC and I think this is the simplest implementation.

Depending on how people view it, we might still invite back the xe_svm or drm_gpusvm (as in your new series) though. It not a huge deal. Let's see how things go.

> 
> > +	unsigned long addr = vma->vm_start;
> > +	u64 npages = vma_pages(vma);
> > +	struct xe_vma *xe_vma;
> > +	vm_fault_t ret = 0;
> > +	struct xe_vm *vm;
> > +	void *buf;
> > +	int i;
> > +
> > +	struct migrate_vma migrate_vma = {
> > +		.vma		= vmf->vma,
> > +		.start		= vma->vm_start,
> > +		.end		= vma->vm_end,
> 
> So I know in my PoC I had the fault user pointer (xe_vma) == struct
> vm_area_struct of the GPU fault. That is definitely wrong. 

Can you explain why this is definitely wrong? I still think creating xe_vma/fault userptr to cover the whole struct vm_area_struct range is not a bad idea. Our mmu notififer is xe-vma based, so basically for each vm_area_struct/xe-vma we have a mmu interval notifier.

The migration and gpu page table update can be any sub-range of the xe-vma, as long as the boundary of the sub-range is page aligned. 

We likely
> want to allocate sub-range of vm_area_struct for the xe_vma, we can call
> this a chunk size. Logical chunks sizes would be aligned 2MB, 64k, and
> finally 4k in sizes trying the largest first... The chunk sizes are
> trivial as we likely can just have table with values, the key here is
> the vm_area_struct vm_start / vm_end are not what we want to use here
> rather xe_vma_start(vma) and xe_vma_end(vma). 

This is cpu page fault handler. I think it is just fine to use vma_area_struct.vm_start/end. In my opinion, referring to some gpu structure xe-vma-start/end in cpu page fault handler is a little strange. 

As said, I have moved to one-page based scheme in cpu page fault handling. But even if we migrate whole vma, I don't see a problem of above codes. Can you explain why you want to use xe-vma here? My guess is, in your picture, you want one xe-vma be either backed by system memory, or gpu device memory but not a mixture. I am moving away from this scheme and adopting a page centric design where xe-vma can be backed by a mixture of sram and vram.

I think we get the
> xe_vma
> from the faulting page vmf->page->zone_device_data field unless you have
> another use that field...

I know this scheme works for you. But it is a design choice of how to use zone-device-data. In my v3, this is still used to hold buddy-block. I am not saying we can't do what you did, just we can do it differently.

Struct page is all about physical memory management. Couple it with virtual memory structure such as xe-vma is not a good sign to me.

> 
> I also comment on my patch with my suggestion implement chunk sizes too.

I did look at your patch and basically you are creating svm-rang at chunk size...

In my v3, I am staying with the same concept of v2: create xe-vma per cpu vma, but changed the migration/gpu pt mapping to be chunk size based, roughly.

> 
> > +		.pgmap_owner	= xe,
> 
> Again helper for this.

Addressed in v3
> 
> > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > +		.fault_page = vmf->page,
> > +	};
> > +
> > +	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
> > +	migrate_vma.src = buf;
> > +	migrate_vma.dst = buf + npages;
> > +	if (migrate_vma_setup(&migrate_vma) < 0) {
> > +		ret = VM_FAULT_SIGBUS;
> > +		goto free_buf;
> > +	}
> > +
> > +	if (!migrate_vma.cpages)
> 
> This is an error, need to set a return value.

Agreed

> 
> > +		goto free_buf;
> > +
> 
> We probably should check migrate.cpages != npages too as I also think
> this is an error.

In the scheme of "no mixture placement of vma", yes. As said, I am moving away from this scheme. Let's see how well it moves.

> 
> > +	for (i = 0; i < npages; i++) {
> > +		ret = migrate_page_vram_to_ram(vma, addr,
> migrate_vma.src[i],
> > +							migrate_vma.dst + i);
> > +		if (ret < 0) {
> > +			ret = VM_FAULT_SIGBUS;
> > +			break;
> > +		}
> > +
> > +		/** Migration has been successful, free source page */
> > +		if (ret == 1) {
> > +			struct page *src_page =
> migrate_pfn_to_page(migrate_vma.src[i]);
> > +
> > +			xe_devm_page_free(src_page);
> > +		}
> > +
> > +		addr += PAGE_SIZE;
> > +	}
> 
> I touch on this above, this should be reworked to roughly:
> 
> - alloc pages and populate migrate_vma.dst
> - dma map sram pages
> - migrate a chunk of contigous vram addresses at a time
> - wait on last dma fence from migrate
> - unmap dma pages
> - unlock and free all pages

In v3, I roughly have above process. But a lot of things have changed, let's review it once it is ready.

> 
> > +
> > +	xe_svm_for_each_vm(svm, vm) {
> > +		xe_assert(xe, vm->mm == mm);
> > +		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
> > +		if (xe_vma)
> > +			xe_vm_invalidate_vma(xe_vma);
> > +	}
> 
> I've touched on why this isn't needed... I think one of these
> migrate_vma_* functions will trigger all MMU notifiers registered for
> the range. The notifier owns the invalidate then.
> 
> Beyond this, maybe I'm confused about a few things and how this fits all
> together. Doesn't every user process have its own unique mm, fd, and vm
> (e.g. own address space)? If a user want a shared address space then use
> threads with a single mm, fd, and vm.
> 
> So even if we had to resolve the xe_vma's here and do an invalidate here
> very confused what this is doing. This is this the case with multiple
> devices and each VM points to a different device? Again so that case I
> don't think a xe_svm structure would be needed, on GPU fault we should
> be to detect from the faulting page zone_device_data and pgmap owner
> if the fault already has a physical backing on another GPU and resolve
> how to map it into GPU with a fault... Jason suggests this in the
> following thread [2] and I think I agree with him.
> 
> [2] https://lore.kernel.org/all/5495090e-dee1-4c8e-91bc-
> 240632fd3e35 at amd.com/T/

Thanks, I agree with above analysis. Yes invalidation is already triggered during migrate_vma_setup, and above codes are deleted in v3.

> 
> > +	migrate_vma_pages(&migrate_vma);
> 
> This logic is going to change but ...
> 
> On an error I think we only want to call migrate_vma_finalize to revert
> pages back to the original state (i.e. migrate_vma_pages commits the
> page changes which we don't want to do on an error).

Agreed. I found this logic in your series. I will fix it in v3. 

> 
> > +	migrate_vma_finalize(&migrate_vma);
> > +free_buf:
> > +	kvfree(buf);
> > +	return 0;
> 
> I don't think 0 should blindly be return here, if there is an error
> return VM_FAULT_SIGBUS. We likely want a high level error message too.

That is correct. Fixed in v3.

Regards,
Oak
> 
> Matt
> 
> > +}
> > --
> > 2.26.3
> >