[v2 27/31] drm/xe/svm: Handle CPU page fault

Fri Apr 12 18:39:20 UTC 2024

> -----Original Message-----
> From: Brost, Matthew <matthew.brost at intel.com>
> Sent: Friday, April 12, 2024 2:11 PM
> To: Zeng, Oak <oak.zeng at intel.com>
> Cc: intel-xe at lists.freedesktop.org; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray at intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu at intel.com>; Thomas.Hellstrom at linux.intel.com; Welty,
> Brian <brian.welty at intel.com>
> Subject: Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
> 
> On Fri, Apr 12, 2024 at 11:24:06AM -0600, Zeng, Oak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost at intel.com>
> > > Sent: Wednesday, April 10, 2024 10:07 PM
> > > To: Zeng, Oak <oak.zeng at intel.com>
> > > Cc: intel-xe at lists.freedesktop.org; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray at intel.com>; Bommu, Krishnaiah
> > > <krishnaiah.bommu at intel.com>; Thomas.Hellstrom at linux.intel.com; Welty,
> > > Brian <brian.welty at intel.com>
> > > Subject: Re: [v2 27/31] drm/xe/svm: Handle CPU page fault
> > >
> > > On Tue, Apr 09, 2024 at 04:17:38PM -0400, Oak Zeng wrote:
> > > > Under the picture of svm, CPU and GPU program share one same
> > > > virtual address space. The backing store of this virtual address
> > > > space can be either in system memory or device memory. Since GPU
> > > > device memory is remaped as DEVICE_PRIVATE, CPU can't access it.
> > > > Any CPU access to device memory causes a page fault. Implement
> > > > a page fault handler to migrate memory back to system memory and
> > > > map it to CPU page table so the CPU program can proceed.
> > > >
> > > > Also unbind this page from GPU side, and free the original GPU
> > > > device page
> > > >
> > > > Signed-off-by: Oak Zeng <oak.zeng at intel.com>
> > > > Co-developed-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura at intel.com>
> > > > Signed-off-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura at intel.com>
> > > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > > Cc: Thomas Hellström <thomas.hellstrom at intel.com>
> > > > Cc: Brian Welty <brian.welty at intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/Makefile         |   1 +
> > > >  drivers/gpu/drm/xe/xe_svm.h         |   8 +-
> > > >  drivers/gpu/drm/xe/xe_svm_devmem.c  |   7 +-
> > > >  drivers/gpu/drm/xe/xe_svm_migrate.c | 222
> > > ++++++++++++++++++++++++++++
> > > >  4 files changed, 230 insertions(+), 8 deletions(-)
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_svm_migrate.c
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > > > index f89d77b6d654..65289acdd563 100644
> > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > @@ -131,6 +131,7 @@ xe-y += xe_bb.o \
> > > >  	xe_step.o \
> > > >  	xe_svm.o \
> > > >  	xe_svm_devmem.o \
> > > > +	xe_svm_migrate.o \
> > >
> > > See comments about file org, same thing applies here. Let's put all of
> > > the svm implementation in a single file.
> > >
> > > >  	xe_sync.o \
> > > >  	xe_tile.o \
> > > >  	xe_tile_sysfs.o \
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > > index f601dffe3fc1..c9e4239c44b4 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > @@ -7,11 +7,11 @@
> > > >  #define __XE_SVM_H
> > > >
> > > >  #include <linux/mm_types.h>
> > > > +#include <linux/mm.h>
> > > >  #include "xe_device_types.h"
> > > >  #include "xe_device.h"
> > > >  #include "xe_assert.h"
> > > > -
> > > > -struct xe_vm;
> > > > +#include "xe_vm_types.h"
> > > >
> > > >  /**
> > > >   * struct xe_svm - data structure to represent a shared
> > > > @@ -31,6 +31,9 @@ struct xe_svm {
> > > >  	struct list_head vm_list;
> > > >  };
> > > >
> > > > +#define xe_svm_for_each_vm(svm, vm)
> > > 	\
> > > > +		list_for_each_entry(vm, &svm->vm_list, svm_link)
> > > > +
> > >
> > > Don't think this is need, see below.
> > >
> > > >  extern struct xe_svm *xe_create_svm(void);
> > > >  void xe_destroy_svm(struct xe_svm *svm);
> > > >  extern struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
> > > > @@ -79,4 +82,5 @@ int xe_devm_alloc_pages(struct xe_tile *tile,
> > > >
> > > >  void xe_devm_free_blocks(struct list_head *blocks);
> > > >  void xe_devm_page_free(struct page *page);
> > > > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf);
> > > >  #endif
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > index 088ac209ad80..32ada458f1dd 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > +++ b/drivers/gpu/drm/xe/xe_svm_devmem.c
> > > > @@ -37,11 +37,6 @@ struct xe_svm_block_meta {
> > > >  	unsigned long bitmap[];
> > > >  };
> > > >
> > > > -static vm_fault_t xe_devm_migrate_to_ram(struct vm_fault *vmf)
> > > > -{
> > > > -	return 0;
> > > > -}
> > > > -
> > > >  static u64 block_offset_to_pfn(struct xe_mem_region *mr, u64 offset)
> > > >  {
> > > >  	/** DRM buddy's block offset is 0-based*/
> > > > @@ -168,7 +163,7 @@ void xe_devm_free_blocks(struct list_head
> *blocks)
> > > >
> > > >  static const struct dev_pagemap_ops xe_devm_pagemap_ops = {
> > > >  	.page_free = xe_devm_page_free,
> > > > -	.migrate_to_ram = xe_devm_migrate_to_ram,
> > > > +	.migrate_to_ram = xe_svm_migrate_to_sram,
> > >
> > > Again single file so this will be static function, no reason to export
> > > this.
> > >
> > > >  };
> > > >
> > > >  /**
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > new file mode 100644
> > > > index 000000000000..0db831af098e
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_svm_migrate.c
> > > > @@ -0,0 +1,222 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2023 Intel Corporation
> > > > + */
> > > > +
> > > > +#include <linux/gfp.h>
> > > > +#include <linux/migrate.h>
> > > > +#include <linux/dma-mapping.h>
> > > > +#include <linux/dma-fence.h>
> > > > +#include <linux/bitops.h>
> > > > +#include <linux/bitmap.h>
> > > > +#include <linux/kernel.h>
> > > > +#include <linux/slab.h>
> > > > +#include <drm/drm_buddy.h>
> > > > +#include "xe_device_types.h"
> > > > +#include "xe_device.h"
> > > > +#include "xe_trace.h"
> > > > +#include "xe_migrate.h"
> > > > +#include "xe_ttm_vram_mgr_types.h"
> > > > +#include "xe_assert.h"
> > > > +#include "xe_pt.h"
> > > > +#include "xe_svm.h"
> > > > +#include "xe_vm.h"
> > > > +
> > > > +
> > > > +/**
> > > > + * alloc_host_page() - allocate one host page for the fault vma
> > > > + *
> > > > + * @dev: (GPU) device that will access the allocated page
> > > > + * @vma: the fault vma that we need allocate page for
> > > > + * @addr: the fault address. The allocated page is for this address
> > > > + * @dma_addr: used to output the dma address of the allocated page.
> > > > + * This dma address will be used for gpu to access this page. GPU
> > > > + * access host page through a dma mapped address.
> > > > + * @pfn: used to output the pfn of the allocated page.
> > > > + *
> > > > + * This function allocate one host page for the specified vma. It
> > > > + * also does some prepare work for GPU to access this page, such
> > > > + * as map this page to iommu (by calling dma_map_page).
> > > > + *
> > > > + * When this function returns, the page is locked.
> > > > + *
> > > > + * Return struct page pointer when success
> > > > + * NULL otherwise
> > > > + */
> > > > +static struct page *alloc_host_page(struct device *dev,
> > > > +							 struct vm_area_struct
> > > *vma,
> > > > +							 unsigned long addr,
> > > > +							 dma_addr_t
> > > *dma_addr,
> > > > +							 unsigned long *pfn)
> > >
> > > Weird alignment, also I don't think we are want to allocate a page at
> > > time...
> > >
> > > Beyond that, can't say I'm a fan of 2 arguments being return and
> > > populated here either (dma_addr_t *dma_addr, unsigned long *pfn). I
> > > haven't seen a lot that style function in Linux.
> > >
> > > Probably makes more sense to have a function which allocates pages,
> > > locks them, and populates the pfn array (migrate_pfn) rather than doing
> > > this a page at a time.
> > >
> > > > +{
> > > > +	struct page *page;
> > > > +
> > > > +	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> > > > +	if (unlikely(!page))
> > > > +		return NULL;
> > > > +
> > > > +	/**Lock page per hmm requirement, see hmm.rst*/
> > > > +	lock_page(page);
> > > > +	*dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
> > > DMA_FROM_DEVICE);
> > >
> > > The device is writing to these pages so I think DMA_BIDIRECTIONAL is
> > > needed, right? As mentioned above I think this should be broken out into
> > > a different step too.
> > >
> > > > +	if (unlikely(dma_mapping_error(dev, *dma_addr))) {
> > > > +		unlock_page(page);
> > > > +		__free_page(page);
> > > > +		return NULL;
> > > > +	}
> > > > +
> > > > +	*pfn = migrate_pfn(page_to_pfn(page));
> > > > +	return page;
> > > > +}
> > > > +
> > > > +static void free_host_page(struct page *page)
> > > > +{
> > > > +	unlock_page(page);
> > > > +	put_page(page);
> > > > +}
> > > > +
> > > > +/**
> > > > + * migrate_page_vram_to_ram() - migrate one page from vram to ram
> > > > + *
> > > > + * @vma: The vma that the page is mapped to
> > > > + * @addr: The virtual address that the page is mapped to
> > > > + * @src_pfn: src page's page frame number
> > > > + * @dst_pfn: used to return dstination page (in system ram)'s pfn
> > > > + *
> > > > + * Allocate one page in system ram and copy memory from device
> memory
> > > > + * to system ram.
> > > > + *
> > > > + * Return: 0 if this page is already in sram (no need to migrate)
> > > > + * 1: successfully migrated this page from vram to sram.
> > > > + * error code otherwise
> > > > + */
> > > > +static int migrate_page_vram_to_ram(struct vm_area_struct *vma,
> > > unsigned long addr,
> > > > +						unsigned long src_pfn,
> > > unsigned long *dst_pfn)
> > > > +{
> > >
> > > We definitely don't want to copy 1 page at time. I touch on this in [1].
> > > Basically this going to perform poorly unless we use larger copies, the
> > > migrate code supports non-contigous sram addresses, and vram addresses
> > > will likely be contigous due to the buddy allocator.
> > >
> > > [1]
> https://patchwork.freedesktop.org/patch/588548/?series=132229&rev=1
> > >
> > > > +	struct xe_mem_region *mr;
> > > > +	struct xe_tile *tile;
> > > > +	struct xe_device *xe;
> > > > +	struct device *dev;
> > > > +	dma_addr_t dma_addr = 0;
> > > > +	struct dma_fence *fence;
> > > > +	struct page *host_page;
> > > > +	struct page *src_page;
> > > > +	u64 src_dpa;
> > > > +
> > > > +	src_page = migrate_pfn_to_page(src_pfn);
> > > > +	if (unlikely(!src_page || !(src_pfn & MIGRATE_PFN_MIGRATE)))
> > >
> > > I'm going to say this is a bug if !src_page ||
> > > !is_zone_device_page(src_page) || !(src_pfn & MIGRATE_PFN_MIGRATE)
> and
> > > we return -EFAULT (or another error code if that makes more sense). We
> > > are migrating from the device where we know we have backing store from
> > > the original fault.
> > >
> > > > +		return 0;
> > > > +
> > > > +	mr = xe_page_to_mem_region(src_page);
> > > > +	tile = xe_mem_region_to_tile(mr);
> > > > +	xe = tile_to_xe(tile);
> > > > +	dev = xe->drm.dev;
> > > > +
> > > > +	src_dpa = xe_mem_region_pfn_to_dpa(mr, src_pfn);
> > > > +	host_page = alloc_host_page(dev, vma, addr, &dma_addr, dst_pfn);
> > > > +	if (!host_page)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	fence = xe_migrate_pa(tile->migrate, src_dpa, true,
> > > > +						dma_addr, false, PAGE_SIZE);
> > > > +	if (IS_ERR(fence)) {
> > > > +		dma_unmap_page(dev, dma_addr, PAGE_SIZE,
> > > DMA_FROM_DEVICE);
> > > > +		free_host_page(host_page);
> > > > +		return PTR_ERR(fence);
> > > > +	}
> > > > +
> > > > +	dma_fence_wait(fence, false);
> > >
> > > Even if we did want to migrate a page at a time, we only need to wait on
> > > the last fence due to the ordered nature of exec queues.
> > >
> > > > +	dma_fence_put(fence);
> > > > +	dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_FROM_DEVICE);
> > >
> > > With above, will likely unmap all dma pages in a single function once
> > > the last fence is signaled.
> > >
> > > > +	return 1;
> > > > +}
> > > > +
> > > > +/**
> > > > + * xe_svm_migrate_to_sram() - Migrate memory back to sram on CPU
> page
> > > fault
> > > > + *
> > > > + * @vmf: cpu vm fault structure, contains fault information such as vma
> etc.
> > > > + *
> > > > + * Note, this is in CPU's vm fault handler, caller holds the mmap read
> lock.
> > > > + *
> > > > + * This function migrate one gpu vma which contains the fault address to
> > > sram.
> > > > + * We try to maintain a 1:1 mapping b/t the CPU vma and gpu vma (i.e.,
> > > create one
> > > > + * gpu vma for one cpu vma initially and try not to split it). So this
> scheme
> > > end
> > > > + * up migrate at the vma granularity. This might not be the best
> performant
> > > scheme
> > > > + *
> > > > + * This can be tunned with a migration granularity for  performance, for
> > > example,
> > > > + * migration 2M for each CPU page fault, or let user specify how much
> to
> > > migrate.
> > > > + * This is more complex due to vma splitting.
> > > > + *
> > > > + * This function should also update GPU page table, so the fault virtual
> > > address
> > > > + * points to the same sram location from GPU side. This is TBD.
> > > > + *
> > > > + * Return:
> > > > + * 0 on success
> > > > + * VM_FAULT_SIGBUS: failed to migrate page to system memory,
> > > application
> > > > + * will be signaled a SIGBUG
> > > > + */
> > > > +vm_fault_t xe_svm_migrate_to_sram(struct vm_fault *vmf)
> > > > +{
> > > > +	struct xe_mem_region *mr = xe_page_to_mem_region(vmf->page);
> > > > +	struct xe_tile *tile = xe_mem_region_to_tile(mr);
> > > > +	struct xe_device *xe = tile_to_xe(tile);
> > > > +	struct vm_area_struct *vma = vmf->vma;
> > > > +	struct mm_struct *mm = vma->vm_mm;
> > > > +	struct xe_svm *svm = xe_lookup_svm_by_mm(mm);
> > >
> > > I don't think this is needed... More below.
> > >
> > > > +	unsigned long addr = vma->vm_start;
> > > > +	u64 npages = vma_pages(vma);
> > > > +	struct xe_vma *xe_vma;
> > > > +	vm_fault_t ret = 0;
> > > > +	struct xe_vm *vm;
> > > > +	void *buf;
> > > > +	int i;
> > > > +
> > > > +	struct migrate_vma migrate_vma = {
> > > > +		.vma		= vmf->vma,
> > > > +		.start		= vma->vm_start,
> > > > +		.end		= vma->vm_end,
> > >
> > > So I know in my PoC I had the fault user pointer (xe_vma) == struct
> > > vm_area_struct of the GPU fault. That is definitely wrong. We likely
> > > want to allocate sub-range of vm_area_struct for the xe_vma, we can call
> > > this a chunk size. Logical chunks sizes would be aligned 2MB, 64k, and
> > > finally 4k in sizes trying the largest first... The chunk sizes are
> > > trivial as we likely can just have table with values, the key here is
> > > the vm_area_struct vm_start / vm_end are not what we want to use here
> > > rather xe_vma_start(vma) and xe_vma_end(vma). I think we get the
> xe_vma
> 
> After I typed this, realized I made a mistake here...
> 
> s/xe_vma_start/xe_vma_userptr/
> s/xe_vma_end/xe_vma_userptr + xe_vma_size/
> 
> But you get the idea - the zone_device_data points Xe specific chunk
> data (currently xe_vma, could be xe_pt_state our something later if we
> switch to 1:N).
> 
> Check AMD's + Nvidia's drivers and they both use this field in a similar
> way.
> 
> > > from the faulting page vmf->page->zone_device_data field unless you have
> > > another use that field...
> >
> > You are right. Such work is planned in the memory attributes part that Himal
> is working on. We have a migration_granularity attribute which allow user to
> set the chunk size.
> >
> 
> That makes sense. The chunk size is always just hint though, right?

I believe we should have a default chunk size, such as 2M, and user can overwrite it through hints.

> 
> > >
> > > I also comment on my patch with my suggestion implement chunk sizes too.
> > >
> > > > +		.pgmap_owner	= xe,
> > >
> > > Again helper for this.
> > >
> > > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > > +		.fault_page = vmf->page,
> > > > +	};
> > > > +
> > > > +	buf = kvcalloc(npages, 2* sizeof(*migrate_vma.src), GFP_KERNEL);
> > > > +	migrate_vma.src = buf;
> > > > +	migrate_vma.dst = buf + npages;
> > > > +	if (migrate_vma_setup(&migrate_vma) < 0) {
> > > > +		ret = VM_FAULT_SIGBUS;
> > > > +		goto free_buf;
> > > > +	}
> > > > +
> > > > +	if (!migrate_vma.cpages)
> > >
> > > This is an error, need to set a return value.
> > >
> > > > +		goto free_buf;
> > > > +
> > >
> > > We probably should check migrate.cpages != npages too as I also think
> > > this is an error.
> > >
> > > > +	for (i = 0; i < npages; i++) {
> > > > +		ret = migrate_page_vram_to_ram(vma, addr,
> > > migrate_vma.src[i],
> > > > +							migrate_vma.dst + i);
> > > > +		if (ret < 0) {
> > > > +			ret = VM_FAULT_SIGBUS;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		/** Migration has been successful, free source page */
> > > > +		if (ret == 1) {
> > > > +			struct page *src_page =
> > > migrate_pfn_to_page(migrate_vma.src[i]);
> > > > +
> > > > +			xe_devm_page_free(src_page);
> > > > +		}
> > > > +
> > > > +		addr += PAGE_SIZE;
> > > > +	}
> > >
> > > I touch on this above, this should be reworked to roughly:
> > >
> > > - alloc pages and populate migrate_vma.dst
> > > - dma map sram pages
> > > - migrate a chunk of contigous vram addresses at a time
> > > - wait on last dma fence from migrate
> > > - unmap dma pages
> > > - unlock and free all pages
> > >
> > > > +
> > > > +	xe_svm_for_each_vm(svm, vm) {
> > > > +		xe_assert(xe, vm->mm == mm);
> > > > +		xe_vma = xe_vm_lookup_vma(vm, vmf->address);
> > > > +		if (xe_vma)
> > > > +			xe_vm_invalidate_vma(xe_vma);
> > > > +	}
> > >
> > > I've touched on why this isn't needed... I think one of these
> > > migrate_vma_* functions will trigger all MMU notifiers registered for
> > > the range. The notifier owns the invalidate then.
> >
> > Very good point. Yes after read migrate_vma_setup function, I agree this
> function will call mmu notifiers with MMU_NOTIFY_MIGRATE event. Today we
> invalidate vma with this event. So yes, above codes are not needed.
> >
> > I do identified one potential improvement: when mmu notifier is called back
> with MMU_NOTIFY_MIGRATE event, if the migrate_vma_setup is called from
> the gpu page fault path, we can ignore the gpu vma invalidation as we will
> update gpu page table later after migration anyway. I think an page table
> invalidation is not needed in such case. But this should be just a minor
> improvement.
> >
> 
> We skip invalidations if the initial_bind flag is clear which should
> cover at the initial GPU fault. There is certainly room for improvement
> / optimizations in the MMU notifier though, it is kinda messy right now
> too. IMO work like this can be done once the basic design is working +
> tests in place to verify changes / optimizations.
> 

agreed

> >
> > >
> > > Beyond this, maybe I'm confused about a few things and how this fits all
> > > together. Doesn't every user process have its own unique mm, fd, and vm
> > > (e.g. own address space)? If a user want a shared address space then use
> > > threads with a single mm, fd, and vm.
> >
> > Yes, this is also my understanding. Each user process has its own mm struct
> and device fds.
> >
> > In a shared address space case, such as there are multiple pthread created
> through pthread_create in one process, all those pthreads should have different
> kernel task_struct, but all those task_struct (say we get it from "current" macro)
> should share one same mm struct, which means they all lives inside one cpu
> address space.
> >
> > Now with this work, we are now basically extending this shared cpu address
> space to gpu program. So both cpu program and gpu program can share one
> address space.
> >
> > Since we allow user to create multiple gpu vm for one device (lets focus on
> one device for now), so each shared address space can have multiple gpu vm...
> each gpuvm should be able to register its own mmu notifier to core mm, even
> if those notifier has the same address range. But I will have to test this out. If
> all this works, above codes are not needed. If different gpuvm can't register
> mmu notifier for same address range, then we would need a fix....
> >
> 
> The mmu notifier code is implemented with the interval tree which
> supports overlapping rangers (i.e. we can have multiple VM's register
> notifiers with the sam address range in a single MM).

Ok, that is great. I will delete the xe_svm struct.

Oak

> 
> >
> > >
> > > So even if we had to resolve the xe_vma's here and do an invalidate here
> > > very confused what this is doing. This is this the case with multiple
> > > devices and each VM points to a different device?
> >
> > Right now I only focus on single device. See above. This is to solve one gpu
> device but multiple gpu vm case. But as said above, for now I don't think this is
> needed. I need to test more on the mmu notifier behavior: whether it allow us
> to insert two notifiers for the same range for one mm....
> >
> 
> Agree that our focus should be on single device now. If that design it
> well thought out I don't think extending this to multiple devices will
> be a huge change either.
> 
> Matt
> 
> > Oak
> >
> > Again so that case I
> > > don't think a xe_svm structure would be needed, on GPU fault we should
> > > be to detect from the faulting page zone_device_data and pgmap owner
> > > if the fault already has a physical backing on another GPU and resolve
> > > how to map it into GPU with a fault... Jason suggests this in the
> > > following thread [2] and I think I agree with him.
> > >
> > > [2] https://lore.kernel.org/all/5495090e-dee1-4c8e-91bc-
> > > 240632fd3e35 at amd.com/T/
> > >
> > > > +	migrate_vma_pages(&migrate_vma);
> > >
> > > This logic is going to change but ...
> > >
> > > On an error I think we only want to call migrate_vma_finalize to revert
> > > pages back to the original state (i.e. migrate_vma_pages commits the
> > > page changes which we don't want to do on an error).
> > >
> > > > +	migrate_vma_finalize(&migrate_vma);
> > > > +free_buf:
> > > > +	kvfree(buf);
> > > > +	return 0;
> > >
> > > I don't think 0 should blindly be return here, if there is an error
> > > return VM_FAULT_SIGBUS. We likely want a high level error message too.
> > >
> > > Matt
> > >
> > > > +}
> > > > --
> > > > 2.26.3
> > > >