[PATCH 2/3] drm/xe: Clear scratch page on vm_bind

Thu Feb 20 21:09:46 UTC 2025

On Wed, Feb 19, 2025 at 12:46:52PM -0800, Matthew Brost wrote:
> On Wed, Feb 19, 2025 at 01:19:10PM -0700, Zeng, Oak wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost at intel.com>
> > > Sent: February 19, 2025 12:47 PM
> > > To: Zeng, Oak <oak.zeng at intel.com>
> > > Cc: intel-xe at lists.freedesktop.org;
> > > Thomas.Hellstrom at linux.intel.com; Cavitt, Jonathan
> > > <jonathan.cavitt at intel.com>
> > > Subject: Re: [PATCH 2/3] drm/xe: Clear scratch page on vm_bind
> > > 
> > > On Wed, Feb 12, 2025 at 09:23:30PM -0500, Oak Zeng wrote:
> > > > When a vm runs under fault mode, if scratch page is enabled, we
> > > need
> > > > to clear the scratch page mapping on vm_bind for the vm_bind
> > > address
> > > > range. Under fault mode, we depend on recoverable page fault to
> > > > establish mapping in page table. If scratch page is not cleared, GPU
> > > > access of address won't cause page fault because it always hits the
> > > > existing scratch page mapping.
> > > >
> > > > When vm_bind with IMMEDIATE flag, there is no need of clearing as
> > > > immediate bind can overwrite the scratch page mapping.
> > > >
> > > > So far only is xe2 and xe3 products are allowed to enable scratch
> > > page
> > > > under fault mode. On other platform we don't allow scratch page
> > > under
> > > > fault mode, so no need of such clearing.
> > > >
> > > > v2: Rework vm_bind pipeline to clear scratch page mapping. This is
> > > similar
> > > > to a map operation, with the exception that PTEs are cleared
> > > instead of
> > > > pointing to valid physical pages. (Matt, Thomas)
> > > >
> > > > TLB invalidation is needed after clear scratch page mapping as larger
> > > > scratch page mapping could be backed by physical page and cached
> > > in
> > > > TLB. (Matt, Thomas)
> > > >
> > > > v3: Fix the case of clearing huge pte (Thomas)
> > > >
> > > > Improve commit message (Thomas)
> > > >
> > > > v4: TLB invalidation on all LR cases, not only the clear on bind
> > > > cases (Thomas)
> > > >
> > > > Signed-off-by: Oak Zeng <oak.zeng at intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_pt.c       | 53
> > > ++++++++++++++++++++++++--------
> > > >  drivers/gpu/drm/xe/xe_pt_types.h |  2 ++
> > > >  drivers/gpu/drm/xe/xe_vm.c       | 29 ++++++++++++++---
> > > >  drivers/gpu/drm/xe/xe_vm_types.h |  2 ++
> > > >  4 files changed, 70 insertions(+), 16 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_pt.c
> > > b/drivers/gpu/drm/xe/xe_pt.c
> > > > index 1ddcc7e79a93..319769056893 100644
> > > > --- a/drivers/gpu/drm/xe/xe_pt.c
> > > > +++ b/drivers/gpu/drm/xe/xe_pt.c
> > > > @@ -268,6 +268,8 @@ struct xe_pt_stage_bind_walk {
> > > >  	 * granularity.
> > > >  	 */
> > > >  	bool needs_64K;
> > > > +	/* @clear_pt: clear page table entries during the bind walk */
> > > > +	bool clear_pt;
> > > >  	/**
> > > >  	 * @vma: VMA being mapped
> > > >  	 */
> > > > @@ -415,6 +417,10 @@ static bool xe_pt_hugepte_possible(u64
> > > addr, u64 next, unsigned int level,
> > > >  	if (xe_vma_is_null(xe_walk->vma))
> > > >  		return true;
> > > >
> > > > +	/* if we are clearing page table, no dma addresses*/
> > > > +	if (xe_walk->clear_pt)
> > > > +		return true;
> > > > +
> > > >  	/* Is the DMA address huge PTE size aligned? */
> > > >  	size = next - addr;
> > > >  	dma = addr - xe_walk->va_curs_start +
> > > xe_res_dma(xe_walk->curs);
> > > > @@ -497,16 +503,19 @@ xe_pt_stage_bind_entry(struct xe_ptw
> > > *parent, pgoff_t offset,
> > > >
> > > >  		XE_WARN_ON(xe_walk->va_curs_start != addr);
> > > >
> > > > -		pte = vm->pt_ops->pte_encode_vma(is_null ? 0 :
> > > > -						 xe_res_dma(curs) +
> > > xe_walk->dma_offset,
> > > > -						 xe_walk->vma,
> > > pat_index, level);
> > > > +		pte = vm->pt_ops->pte_encode_vma(is_null ||
> > > xe_walk->clear_pt ?
> > > > +				0 : xe_res_dma(curs) + xe_walk-
> > > >dma_offset,
> > > > +				xe_walk->vma, pat_index, level);
> > > >  		pte |= xe_walk->default_pte;
> > > >
> > > > +		if (xe_walk->clear_pt)
> > > > +			pte = 0;
> > > 
> > > Is there a way to move this clear to later in the pipeline? e.g. At the
> > > step which actually programs the page tables via a GPU job or the CPU?
> > 
> > Isn't the design here generate all the page table entries in this stage_bind stage,
> > And actual programming of page table using cpu/gpu just focus on writing the
> > Page table entries to the actual page table? From this perspective, it is more
> > Natural to set the entries to 0 here for clear_pt case.
> > 
> 
> Yes, but for userptr an invalidation can occur *after* the stage_bind step.
> 
> > Of cause you can do it later, but it is at the cost of sacrifice the design.
> > 
> > 
> > > I
> > > ask as I think our userptr code is broken for invalidations which could
> > > hook into clearing to fix this. 
> > 
> > Can you elaborate how the userptr invalidation is broken? Is it broken by this patch?
> 
> The existing code is broken. One option being discussed to fix this is
> convert a bind to an invalidation at the very last stage of the bind.
> Checking here to see if this viable. Hopefully Thomas and I agree on a
> plan in the next day or so.
> 
> > I don't follow here. As I understand it, the userptr invalidation doesn't go to this 
> > Stage_bind function. It goes to xe_pt_zap_ptes. So I don't follow how userptr invalidate
> > Is broken...
> >
> 
> It is about if you have large array of binds and single userptr is
> invalidated between xe_userptr_pin rather than restart the entire array
> of bind just convert the single bind to an invalidation and fixup pages
> in the rebind worker, exec, or page fault. The code tries to do this but
> it is wrong from security standpoint and may have a possible UAF.
> 
> Anyways let me get back to you once we figure out the direction we want
> to go to fix this - we may land on just restart the entire array of
> binds when this occurs making my comment here irrelevant.
> 

After some further thought, I will be fixing userptr issue converting
userptr invalidations which raced into PTE clears at later step in the
bind process. However, this is a rather costly operation as it has to
search the OPs list to do so, so I think having your solution in this
patch for expected PTE clears also makes sense. Sorry for the noise.

Matt

> Matt
> 
> > Oak  
> > 
> > 
> > Also I think prefetch for SVM could also
> > > make use of this too.
> > > 
> > > I believe for this use case this patch is correct though.
> > > 
> > > Matt
> > > 
> > > > +
> > > >  		/*
> > > >  		 * Set the XE_PTE_PS64 hint if possible, otherwise if
> > > >  		 * this device *requires* 64K PTE size for VRAM, fail.
> > > >  		 */
> > > > -		if (level == 0 && !xe_parent->is_compact) {
> > > > +		if (level == 0 && !xe_parent->is_compact
> > > && !xe_walk->clear_pt) {
> > > >  			if (xe_pt_is_pte_ps64K(addr, next, xe_walk))
> > > {
> > > >  				xe_walk->vma->gpuva.flags |=
> > > XE_VMA_PTE_64K;
> > > >  				pte |= XE_PTE_PS64;
> > > > @@ -519,7 +528,7 @@ xe_pt_stage_bind_entry(struct xe_ptw
> > > *parent, pgoff_t offset,
> > > >  		if (unlikely(ret))
> > > >  			return ret;
> > > >
> > > > -		if (!is_null)
> > > > +		if (!is_null && !xe_walk->clear_pt)
> > > >  			xe_res_next(curs, next - addr);
> > > >  		xe_walk->va_curs_start = next;
> > > >  		xe_walk->vma->gpuva.flags |= (XE_VMA_PTE_4K <<
> > > level);
> > > > @@ -589,6 +598,7 @@ static const struct xe_pt_walk_ops
> > > xe_pt_stage_bind_ops = {
> > > >   * @vma: The vma indicating the address range.
> > > >   * @entries: Storage for the update entries used for connecting the
> > > tree to
> > > >   * the main tree at commit time.
> > > > + * @clear_pt: Clear the page table entries.
> > > >   * @num_entries: On output contains the number of @entries
> > > used.
> > > >   *
> > > >   * This function builds a disconnected page-table tree for a given
> > > address
> > > > @@ -602,7 +612,8 @@ static const struct xe_pt_walk_ops
> > > xe_pt_stage_bind_ops = {
> > > >   */
> > > >  static int
> > > >  xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > > -		 struct xe_vm_pgtable_update *entries, u32
> > > *num_entries)
> > > > +		 struct xe_vm_pgtable_update *entries,
> > > > +		 bool clear_pt, u32 *num_entries)
> > > >  {
> > > >  	struct xe_device *xe = tile_to_xe(tile);
> > > >  	struct xe_bo *bo = xe_vma_bo(vma);
> > > > @@ -622,10 +633,19 @@ xe_pt_stage_bind(struct xe_tile *tile,
> > > struct xe_vma *vma,
> > > >  		.vma = vma,
> > > >  		.wupd.entries = entries,
> > > >  		.needs_64K = (xe_vma_vm(vma)->flags &
> > > XE_VM_FLAG_64K) && is_devmem,
> > > > +		.clear_pt = clear_pt,
> > > >  	};
> > > >  	struct xe_pt *pt = xe_vma_vm(vma)->pt_root[tile->id];
> > > >  	int ret;
> > > >
> > > > +	if (clear_pt) {
> > > > +		ret = xe_pt_walk_range(&pt->base, pt->level,
> > > xe_vma_start(vma),
> > > > +				       xe_vma_end(vma),
> > > &xe_walk.base);
> > > > +
> > > > +		*num_entries = xe_walk.wupd.num_used_entries;
> > > > +		return ret;
> > > > +	}
> > > > +
> > > >  	/**
> > > >  	 * Default atomic expectations for different allocation
> > > scenarios are as follows:
> > > >  	 *
> > > > @@ -981,12 +1001,14 @@ static void xe_pt_free_bind(struct
> > > xe_vm_pgtable_update *entries,
> > > >
> > > >  static int
> > > >  xe_pt_prepare_bind(struct xe_tile *tile, struct xe_vma *vma,
> > > > -		   struct xe_vm_pgtable_update *entries, u32
> > > *num_entries)
> > > > +		   struct xe_vm_pgtable_update *entries,
> > > > +		   bool invalidate_on_bind, u32 *num_entries)
> > > >  {
> > > >  	int err;
> > > >
> > > >  	*num_entries = 0;
> > > > -	err = xe_pt_stage_bind(tile, vma, entries, num_entries);
> > > > +	err = xe_pt_stage_bind(tile, vma, entries, invalidate_on_bind,
> > > > +			       num_entries);
> > > >  	if (!err)
> > > >  		xe_tile_assert(tile, *num_entries);
> > > >
> > > > @@ -1661,6 +1683,7 @@ static int bind_op_prepare(struct xe_vm
> > > *vm, struct xe_tile *tile,
> > > >  		return err;
> > > >
> > > >  	err = xe_pt_prepare_bind(tile, vma, pt_op->entries,
> > > > +				 pt_update_ops->invalidate_on_bind,
> > > >  				 &pt_op->num_entries);
> > > >  	if (!err) {
> > > >  		xe_tile_assert(tile, pt_op->num_entries <=
> > > > @@ -1681,11 +1704,11 @@ static int bind_op_prepare(struct
> > > xe_vm *vm, struct xe_tile *tile,
> > > >  		 * If !rebind, and scratch enabled VMs, there is a
> > > chance the scratch
> > > >  		 * PTE is already cached in the TLB so it needs to be
> > > invalidated.
> > > >  		 * On !LR VMs this is done in the ring ops preceding a
> > > batch, but on
> > > > -		 * non-faulting LR, in particular on user-space batch
> > > buffer chaining,
> > > > -		 * it needs to be done here.
> > > > +		 * LR, in particular on user-space batch buffer chaining,
> > > it needs to
> > > > +		 * be done here.
> > > >  		 */
> > > >  		if ((!pt_op->rebind && xe_vm_has_scratch(vm) &&
> > > > -		     xe_vm_in_preempt_fence_mode(vm)))
> > > > +		     xe_vm_in_lr_mode(vm)))
> > > >  			pt_update_ops->needs_invalidation = true;
> > > >  		else if (pt_op->rebind && !xe_vm_in_lr_mode(vm))
> > > >  			/* We bump also if batch_invalidate_tlb is
> > > true */
> > > > @@ -1759,9 +1782,13 @@ static int op_prepare(struct xe_vm *vm,
> > > >
> > > >  	switch (op->base.op) {
> > > >  	case DRM_GPUVA_OP_MAP:
> > > > -		if (!op->map.immediate &&
> > > xe_vm_in_fault_mode(vm))
> > > > +		if (!op->map.immediate &&
> > > xe_vm_in_fault_mode(vm) &&
> > > > +		    !op->map.invalidate_on_bind)
> > > >  			break;
> > > >
> > > > +		if (op->map.invalidate_on_bind)
> > > > +			pt_update_ops->invalidate_on_bind = true;
> > > > +
> > > >  		err = bind_op_prepare(vm, tile, pt_update_ops, op-
> > > >map.vma);
> > > >  		pt_update_ops->wait_vm_kernel = true;
> > > >  		break;
> > > > @@ -1871,6 +1898,8 @@ static void bind_op_commit(struct xe_vm
> > > *vm, struct xe_tile *tile,
> > > >  	}
> > > >  	vma->tile_present |= BIT(tile->id);
> > > >  	vma->tile_staged &= ~BIT(tile->id);
> > > > +	if (pt_update_ops->invalidate_on_bind)
> > > > +		vma->tile_invalidated |= BIT(tile->id);
> > > >  	if (xe_vma_is_userptr(vma)) {
> > > >  		lockdep_assert_held_read(&vm-
> > > >userptr.notifier_lock);
> > > >  		to_userptr_vma(vma)->userptr.initial_bind = true;
> > > > diff --git a/drivers/gpu/drm/xe/xe_pt_types.h
> > > b/drivers/gpu/drm/xe/xe_pt_types.h
> > > > index 384cc04de719..3d0aa2a5102e 100644
> > > > --- a/drivers/gpu/drm/xe/xe_pt_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_pt_types.h
> > > > @@ -108,6 +108,8 @@ struct xe_vm_pgtable_update_ops {
> > > >  	bool needs_userptr_lock;
> > > >  	/** @needs_invalidation: Needs invalidation */
> > > >  	bool needs_invalidation;
> > > > +	/** @invalidate_on_bind: Invalidate the range before bind */
> > > > +	bool invalidate_on_bind;
> > > >  	/**
> > > >  	 * @wait_vm_bookkeep: PT operations need to wait until VM
> > > is idle
> > > >  	 * (bookkeep dma-resv slots are idle) and stage all future VM
> > > activity
> > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > b/drivers/gpu/drm/xe/xe_vm.c
> > > > index d664f2e418b2..813d893d9b63 100644
> > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > @@ -1921,6 +1921,23 @@ static void print_op(struct xe_device *xe,
> > > struct drm_gpuva_op *op)
> > > >  }
> > > >  #endif
> > > >
> > > > +static bool __xe_vm_needs_clear_scratch_pages(struct xe_vm
> > > *vm, u32 bind_flags)
> > > > +{
> > > > +	if (!xe_vm_in_fault_mode(vm))
> > > > +		return false;
> > > > +
> > > > +	if (!NEEDS_SCRATCH(vm->xe))
> > > > +		return false;
> > > > +
> > > > +	if (!xe_vm_has_scratch(vm))
> > > > +		return false;
> > > > +
> > > > +	if (bind_flags & DRM_XE_VM_BIND_FLAG_IMMEDIATE)
> > > > +		return false;
> > > > +
> > > > +	return true;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Create operations list from IOCTL arguments, setup operations
> > > fields so parse
> > > >   * and commit steps are decoupled from IOCTL arguments. This
> > > step can fail.
> > > > @@ -1991,6 +2008,8 @@ vm_bind_ioctl_ops_create(struct xe_vm
> > > *vm, struct xe_bo *bo,
> > > >  			op->map.is_null = flags &
> > > DRM_XE_VM_BIND_FLAG_NULL;
> > > >  			op->map.dumpable = flags &
> > > DRM_XE_VM_BIND_FLAG_DUMPABLE;
> > > >  			op->map.pat_index = pat_index;
> > > > +			op->map.invalidate_on_bind =
> > > > +
> > > 	__xe_vm_needs_clear_scratch_pages(vm, flags);
> > > >  		} else if (__op->op == DRM_GPUVA_OP_PREFETCH) {
> > > >  			op->prefetch.region = prefetch_region;
> > > >  		}
> > > > @@ -2188,7 +2207,8 @@ static int vm_bind_ioctl_ops_parse(struct
> > > xe_vm *vm, struct drm_gpuva_ops *ops,
> > > >  				return PTR_ERR(vma);
> > > >
> > > >  			op->map.vma = vma;
> > > > -			if (op->map.immediate
> > > || !xe_vm_in_fault_mode(vm))
> > > > +			if (op->map.immediate
> > > || !xe_vm_in_fault_mode(vm) ||
> > > > +			    op->map.invalidate_on_bind)
> > > >
> > > 	xe_vma_ops_incr_pt_update_ops(vops,
> > > >  							      op-
> > > >tile_mask);
> > > >  			break;
> > > > @@ -2416,9 +2436,10 @@ static int op_lock_and_prep(struct
> > > drm_exec *exec, struct xe_vm *vm,
> > > >
> > > >  	switch (op->base.op) {
> > > >  	case DRM_GPUVA_OP_MAP:
> > > > -		err = vma_lock_and_validate(exec, op->map.vma,
> > > > -					    !xe_vm_in_fault_mode(vm)
> > > ||
> > > > -					    op->map.immediate);
> > > > +		if (!op->map.invalidate_on_bind)
> > > > +			err = vma_lock_and_validate(exec, op-
> > > >map.vma,
> > > > +
> > > 	    !xe_vm_in_fault_mode(vm) ||
> > > > +						    op-
> > > >map.immediate);
> > > >  		break;
> > > >  	case DRM_GPUVA_OP_REMAP:
> > > >  		err = check_ufence(gpuva_to_vma(op-
> > > >base.remap.unmap->va));
> > > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> > > b/drivers/gpu/drm/xe/xe_vm_types.h
> > > > index 52467b9b5348..dace04f4ea5e 100644
> > > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > > @@ -297,6 +297,8 @@ struct xe_vma_op_map {
> > > >  	bool is_null;
> > > >  	/** @dumpable: whether BO is dumped on GPU hang */
> > > >  	bool dumpable;
> > > > +	/** @invalidate: invalidate the VMA before bind */
> > > > +	bool invalidate_on_bind;
> > > >  	/** @pat_index: The pat index to use for this operation. */
> > > >  	u16 pat_index;
> > > >  };
> > > > --
> > > > 2.26.3
> > > >