[v2 27/31] drm/xe/svm: Handle CPU page fault

Thu Apr 11 02:07:27 UTC 2024

On Tue, Apr 09, 2024 at 04:17:38PM -0400, Oak Zeng wrote:
> Under the picture of svm, CPU and GPU program share one same
> virtual address space. The backing store of this virtual address
> space can be either in system memory or device memory. Since GPU
> device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
> Any CPU access to device memory causes a page fault. Implement
> a page fault handler to migrate memory back to system memory and
> map it to CPU page table so the CPU program can proceed.
> 
> Also unbind this page from GPU side, and free the original GPU
> device page
> 
> Signed-off-by: Oak Zeng <oak.zeng at intel.com>
> Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura at intel.com>
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura at intel.com>
> Cc: Matthew Brost <matthew.brost at intel.com>
> Cc: Thomas Hellström <thomas.hellstrom at intel.com>
> Cc: Brian Welty <brian.welty at intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile         |   1 +
>  drivers/gpu/drm/xe/xe_svm.h         |   8 +-
>  drivers/gpu/drm/xe/xe_svm_devmem.c  |   7 +-
>  drivers/gpu/drm/xe/xe_svm_migrate.c | 222 ++++++++++++++++++++++++++++
>  4 files changed, 230 insertions(+), 8 deletions(-)
>  create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index f89d77b6d654..65289acdd563 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -131,6 +131,7 @@ xe-y += xe_bb.o \
>  	xe_step.o \
>  	xe_svm.o \
>  	xe_svm_devmem.o \
> +	xe_svm_migrate.o \

See comments about file org, same thing applies here. Let's put all of
the svm implementation in a single file.

>  	xe_sync.o \
>  	xe_tile.o \
>  	xe_tile_sysfs.o \
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index f601dffe3fc1..c9e4239c44b4 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -7,11 +7,11 @@
>  #define __XE_SVM_H
>  
>  #include <linux/mm_types.h>
> +#include <linux/mm.h>
>  #include "xe_device_types.h"
>  #include "xe_device.h"
>  #include "xe_assert.h"
> -
> -struct xe_vm;
> +#include "xe_vm_types.h"
>  
>  /**
>   * struct xe_svm - data structure to represent a shared
> @@ -31,6 +31,9 @@ struct xe_svm {
>  	struct list_head vm_list;
>  };
>  
> +#define xe_svm_for_each_vm(svm, vm)					\
> +		list_for_each_entry(vm, &svm->vm_list, svm_link)
> +

Don't think this is need, see below.

>  extern struct xe_svm *xe_create_svm(void);
>  void xe_destroy_svm(struct xe_svm *svm);
>  extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
> @@ -79,4 +82,5 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
>  
>  void xe_devm_free_blocks(struct list_head *blocks);
>  void xe_devm_page_free(struct page *page);
> +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c b/drivers/gpu/drm/xe/xe_svm_devmem.c
> index 088ac209ad80..32ada458f1dd 100644
> --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> @@ -37,11 +37,6 @@ struct xe_svm_block_meta {
>  	unsigned long bitmap[];
>  };
>  
> -static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> -{
> -	return 0;
> -}
> -
>  static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
>  {
>  	/** DRM buddy's block offset is 0-based*/
> @@ -168,7 +163,7 @@ void xe_devm_free_blocks(struct list_head *blocks)
>  
>  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
>  	.page_free = xe_devm_page_free,
> -	.migrate_to_ram = xe_devm_migrate_to_ram,
> +	.migrate_to_ram = xe_svm_migrate_to_sram,

Again single file so this will be static function, no reason to export
this.

>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c b/drivers/gpu/drm/xe/xe_svm_migrate.c
> new file mode 100644
> index 000000000000..0db831af098e
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> @@ -0,0 +1,222 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#include <linux/gfp.h>
> +#include <linux/migrate.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/dma-fence.h>
> +#include <linux/bitops.h>
> +#include <linux/bitmap.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <drm/drm_buddy.h>
> +#include "xe_device_types.h"
> +#include "xe_device.h"
> +#include "xe_trace.h"
> +#include "xe_migrate.h"
> +#include "xe_ttm_vram_mgr_types.h"
> +#include "xe_assert.h"
> +#include "xe_pt.h"
> +#include "xe_svm.h"
> +#include "xe_vm.h"
> +
> +
> +/**
> + * alloc_host_page() - allocate one host page for the fault vma
> + *
> + * @dev: (GPU) device that will access the allocated page
> + * @vma: the fault vma that we need allocate page for
> + * @addr: the fault address. The allocated page is for this address
> + * @dma_addr: used to output the dma address of the allocated page.
> + * This dma address will be used for gpu to access this page. GPU
> + * access host page through a dma mapped address.
> + * @pfn: used to output the pfn of the allocated page.
> + *
> + * This function allocate one host page for the specified vma. It
> + * also does some prepare work for GPU to access this page, such
> + * as map this page to iommu (by calling dma_map_page).
> + *
> + * When this function returns, the page is locked.
> + *
> + * Return struct page pointer when success
> + * NULL otherwise
> + */
> +static struct page *alloc_host_page(struct device *dev,
> +							 struct vm_area_struct *vma,
> +							 unsigned long addr,
> +							 dma_addr_t *dma_addr,
> +							 unsigned long *pfn)

Weird alignment, also I don't think we are want to allocate a page at
time...

Beyond that, can't say I'm a fan of 2 arguments being return and
populated here either (dma_addr_t *dma_addr, unsigned long *pfn). I
haven't seen a lot that style function in Linux.

Probably makes more sense to have a function which allocates pages,
locks them, and populates the pfn array (migrate_pfn) rather than doing
this a page at a time.

> +{
> +	struct page *page;
> +
> +	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> +	if (unlikely(!page))
> +		return NULL;
> +
> +	/**Lock page per hmm requirement, see hmm.rst*/
> +	lock_page(page);
> +	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE, DMA_FROM_DEVICE);

The device is writing to these pages so I think DMA_BIDIRECTIONAL is
needed, right? As mentioned above I think this should be broken out into
a different step too.

> +	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
> +		unlock_page(page);
> +		__free_page(page);
> +		return NULL;
> +	}
> +
> +	*pfn = migrate_pfn(page_to_pfn(page));
> +	return page;
> +}
> +
> +static void free_host_page(struct page *page)
> +{
> +	unlock_page(page);
> +	put_page(page);
> +}
> +
> +/**
> + * migrate_page_vram_to_ram() - migrate one page from vram to ram
> + *
> + * @vma: The vma that the page is mapped to
> + * @addr: The virtual address that the page is mapped to
> + * @src_pfn: src page's page frame number
> + * @dst_pfn: used to return dstination page (in system ram)'s pfn
> + *
> + * Allocate one page in system ram and copy memory from device memory
> + * to system ram.
> + *
> + * Return: 0 if this page is already in sram (no need to migrate)
> + * 1: successfully migrated this page from vram to sram.
> + * error code otherwise
> + */
> +static int migrate_page_vram_to_ram(struct vm_area_struct *vma, unsigned long addr,
> +						unsigned long src_pfn, unsigned long *dst_pfn)
> +{

We definitely don't want to copy 1 page at time. I touch on this in [1].
Basically this going to perform poorly unless we use larger copies, the
migrate code supports non-contigous sram addresses, and vram addresses
will likely be contigous due to the buddy allocator.

[1] https://patchwork.freedesktop.org/patch/588548/?series=132229&rev=1

> +	struct xe_mem_region *mr;
> +	struct xe_tile *tile;
> +	struct xe_device *xe;
> +	struct device *dev;
> +	dma_addr_t dma_addr = 0;
> +	struct dma_fence *fence;
> +	struct page *host_page;
> +	struct page *src_page;
> +	u64 src_dpa;
> +
> +	src_page = migrate_pfn_to_page(src_pfn);
> +	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))

I'm going to say this is a bug if !src_page ||
!is_zone_device_page(src_page) || !(src_pfn & MIGRATE_PFN_MIGRATE) and
we return -EFAULT (or another error code if that makes more sense). We
are migrating from the device where we know we have backing store from
the original fault.

> +		return 0;
> +
> +	mr = xe_page_to_mem_region(src_page);
> +	tile = xe_mem_region_to_tile(mr);
> +	xe = tile_to_xe(tile);
> +	dev = xe->drm.dev;
> +
> +	src_dpa = xe_mem_region_pfn_to_dpa(mr, src_pfn);
> +	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
> +	if (!host_page)
> +		return -ENOMEM;
> +
> +	fence = xe_migrate_pa(tile->migrate, src_dpa, true,
> +						dma_addr, false, PAGE_SIZE);
> +	if (IS_ERR(fence)) {
> +		dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
> +		free_host_page(host_page);
> +		return PTR_ERR(fence);
> +	}
> +
> +	dma_fence_wait(fence, false);

Even if we did want to migrate a page at a time, we only need to wait on
the last fence due to the ordered nature of exec queues.

> +	dma_fence_put(fence);
> +	dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);

With above, will likely unmap all dma pages in a single function once
the last fence is signaled.

> +	return 1;
> +}
> +
> +/**
> + * xe_svm_migrate_to_sram() - Migrate memory back to sram on CPU page fault
> + *
> + * @vmf: cpu vm fault structure, contains fault information such as vma etc.
> + *
> + * Note, this is in CPU's vm fault handler, caller holds the mmap read lock.
> + *
> + * This function migrate one gpu vma which contains the fault address to sram.
> + * We try to maintain a 1:1 mapping b/t the CPU vma and gpu vma (i.e., create one
> + * gpu vma for one cpu vma initially and try not to split it). So this scheme end
> + * up migrate at the vma granularity. This might not be the best performant scheme
> + *
> + * This can be tunned with a migration granularity for  performance, for example,
> + * migration 2M for each CPU page fault, or let user specify how much to migrate.
> + * This is more complex due to vma splitting.
> + *
> + * This function should also update GPU page table, so the fault virtual address
> + * points to the same sram location from GPU side. This is TBD.
> + *
> + * Return:
> + * 0 on success
> + * VM_FAULT_SIGBUS: failed to migrate page to system memory, application
> + * will be signaled a SIGBUG
> + */
> +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
> +{
> +	struct xe_mem_region *mr = xe_page_to_mem_region(vmf->page);
> +	struct xe_tile *tile = xe_mem_region_to_tile(mr);
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);

I don't think this is needed... More below.

> +	unsigned long addr = vma->vm_start;
> +	u64 npages = vma_pages(vma);
> +	struct xe_vma *xe_vma;
> +	vm_fault_t ret = 0;
> +	struct xe_vm *vm;
> +	void *buf;
> +	int i;
> +
> +	struct migrate_vma migrate_vma = {
> +		.vma		= vmf->vma,
> +		.start		= vma->vm_start,
> +		.end		= vma->vm_end,

So I know in my PoC I had the fault user pointer (xe_vma) == struct
vm_area_struct of the GPU fault. That is definitely wrong. We likely
want to allocate sub-range of vm_area_struct for the xe_vma, we can call
this a chunk size. Logical chunks sizes would be aligned 2MB, 64k, and
finally 4k in sizes trying the largest first... The chunk sizes are
trivial as we likely can just have table with values, the key here is
the vm_area_struct vm_start / vm_end are not what we want to use here
rather xe_vma_start(vma) and xe_vma_end(vma). I think we get the xe_vma
from the faulting page vmf->page->zone_device_data field unless you have
another use that field...

I also comment on my patch with my suggestion implement chunk sizes too.

> +		.pgmap_owner	= xe,

Again helper for this.

> +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> +		.fault_page = vmf->page,
> +	};
> +
> +	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
> +	migrate_vma.src = buf;
> +	migrate_vma.dst = buf + npages;
> +	if (migrate_vma_setup(&migrate_vma) < 0) {
> +		ret = VM_FAULT_SIGBUS;
> +		goto free_buf;
> +	}
> +
> +	if (!migrate_vma.cpages)

This is an error, need to set a return value.

> +		goto free_buf;
> +

We probably should check migrate.cpages != npages too as I also think
this is an error.

> +	for (i = 0; i < npages; i++) {
> +		ret = migrate_page_vram_to_ram(vma, addr, migrate_vma.src[i],
> +							migrate_vma.dst + i);
> +		if (ret < 0) {
> +			ret = VM_FAULT_SIGBUS;
> +			break;
> +		}
> +
> +		/** Migration has been successful, free source page */
> +		if (ret == 1) {
> +			struct page *src_page = migrate_pfn_to_page(migrate_vma.src[i]);
> +
> +			xe_devm_page_free(src_page);
> +		}
> +
> +		addr += PAGE_SIZE;
> +	}

I touch on this above, this should be reworked to roughly:

- alloc pages and populate migrate_vma.dst
- dma map sram pages
- migrate a chunk of contigous vram addresses at a time
- wait on last dma fence from migrate
- unmap dma pages
- unlock and free all pages

> +
> +	xe_svm_for_each_vm(svm, vm) {
> +		xe_assert(xe, vm->mm == mm);
> +		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
> +		if (xe_vma)
> +			xe_vm_invalidate_vma(xe_vma);
> +	}

I've touched on why this isn't needed... I think one of these
migrate_vma_* functions will trigger all MMU notifiers registered for
the range. The notifier owns the invalidate then.

Beyond this, maybe I'm confused about a few things and how this fits all
together. Doesn't every user process have its own unique mm, fd, and vm
(e.g. own address space)? If a user want a shared address space then use
threads with a single mm, fd, and vm.

So even if we had to resolve the xe_vma's here and do an invalidate here
very confused what this is doing. This is this the case with multiple
devices and each VM points to a different device? Again so that case I
don't think a xe_svm structure would be needed, on GPU fault we should
be to detect from the faulting page zone_device_data and pgmap owner
if the fault already has a physical backing on another GPU and resolve
how to map it into GPU with a fault... Jason suggests this in the
following thread [2] and I think I agree with him.

[2] https://lore.kernel.org/all/5495090e-dee1-4c8e-91bc-240632fd3e35@amd.com/T/

> +	migrate_vma_pages(&migrate_vma);

This logic is going to change but ... 

On an error I think we only want to call migrate_vma_finalize to revert
pages back to the original state (i.e. migrate_vma_pages commits the
page changes which we don't want to do on an error).

> +	migrate_vma_finalize(&migrate_vma);
> +free_buf:
> +	kvfree(buf);
> +	return 0;

I don't think 0 should blindly be return here, if there is an error
return VM_FAULT_SIGBUS. We likely want a high level error message too.

Matt

> +}
> -- 
> 2.26.3
>