[PATCH] drm/xe: Implement clear VRAM on free

Mon Jun 16 07:40:05 UTC 2025

On Fri, 2025-06-13 at 13:02 -0700, Matthew Brost wrote:
> On Fri, Jun 13, 2025 at 09:21:36AM -0700, Matthew Brost wrote:
> > On Fri, Jun 13, 2025 at 10:07:17AM +0200, Thomas Hellström wrote:
> > > On Thu, 2025-06-12 at 10:11 -0700, Matthew Brost wrote:
> > > > On Thu, Jun 12, 2025 at 02:53:16PM +0200, Thomas Hellström
> > > > wrote:
> > > > > On Tue, 2025-06-10 at 22:42 -0700, Matthew Brost wrote:
> > > > > > Clearing on free should hide latency of BO clears on new
> > > > > > user BO
> > > > > > allocations.
> > > > > > 
> > > > > > Implemented via calling xe_migrate_clear in release notify
> > > > > > and
> > > > > > updating
> > > > > > iterator in xe_migrate_clear to skip cleared buddy blocks.
> > > > > > Only
> > > > > > user
> > > > > > BOs
> > > > > > cleared in release notify as kernel BOs could still be in
> > > > > > use
> > > > > > (e.g.,
> > > > > > PT
> > > > > > BOs need to wait for dma-resv to be idle).
> > > > > 
> > > > > Wouldn't it be fully possible for a user to do (deep
> > > > > pipelining 3d
> > > > > case)
> > > > > 
> > > > > create_bo();
> > > > > map_write_unmap_bo();
> > > > > bind_bo();
> > > > > submit_job_touching_bo();
> > > > > unbind_bo();
> > > > > free_bo();
> > > > > 
> > > > > Where the free_bo and release_notify() is called long before
> > > > > the
> > > > > job we
> > > > > submitted even started?
> > > > > 
> > > > > So that would mean the clear needs to await any previous
> > > > > fences,
> > > > > and
> > > > > that dependency addition seems to have been removed from
> > > > > xe_migrate_clear.
> > > > > 
> > > > 
> > > > I think we are actually ok here. xe_vma_destroy is called on
> > > > unbind
> > > > with
> > > > the out fence from the bind IOCTL, so we don't get to
> > > > xe_vma_destroy_late until that fence signals, and
> > > > xe_vma_destroy_late
> > > > (possibly) does the final BO put. Whether this follow makes
> > > > sense is
> > > > a
> > > > bit questable - this was very early code I wrote in Xe and if I
> > > > rewrote
> > > > today I suspect it would look different.
> > > 
> > > Hmm, yeah you're right. So the unbind kernel fence should indeed
> > > be
> > > last fence we need to wait for here.
> > > 
> > 
> > It should actually be signaled too. I think we could avoid any dma-
> > resv
> > wait in this case. Kernel operations are typically (maybe always)
> > on the
> > migration queue, though, so we’d be waiting on those operations via
> > queue ordering anyway.

Hm. We should not be keeping the vma and xe_bo around until the unbind
fence has signaled? We did not always do that except for userptr where
we needed a mechanism to keep the dma-mappings? So if we mistakenly
introduced something that needs to keep them around, the above would
only make that harder to fix? It sounds like this completely bypasses
the TTM delayed delete mechanism?

If the remaining operations are maybe always from the migration queue,
they will be skipped for the clearing operation by the scheduler
anyway, right?

I think the safest and looking forward least error prone thing to do
here is to wait for all fences, since if we can restore the original
behaviour of dropping the bo reference at unbind time rather than
unbind fence signal time, the bo will be immediately individualized an
no new unnecessary fences can be added.

> > 
> > > > 
> > > > We could make this 'safer' by waiting on
> > > > DMA_RESV_USAGE_BOOKKEEP in
> > > > xe_migrate_clear for calls from release notify but for private
> > > > to VM
> > > > BO's we'd risk the clear getting stuck behind newly submitted
> > > > (i.e.,
> > > > submitted after the unbind) exec IOCTLs or binds.
> > > 
> > > Yeah, although at this point the individualization has already
> > > taken
> > > place, so at least there should be no starving, since the only
> > > unnecessary waits would be for execs submitted between the unbind
> > > and
> > > the individualization. So doable but leave it up to you.
> > > 
> > 
> > The individualization is done by the final put-likely assuming the
> > BO is
> > closed and unmapped in user space—in the worker mentioned above. If
> > an
> > exec or bind IOCTL is issued in the interim, we’d be waiting on
> > those.
> > 
> > > > > Another side-effect I think this will have is that bos that
> > > > > are
> > > > > deleted
> > > > > are not subject to asynchronous evicton. I think if this bo
> > > > > is hit
> > > > > during lru walk and clearing, TTM will just sync wait for it
> > > > > to
> > > > > become
> > > > > idle and then free the memory. I think the reason that could
> > > > > not be
> > > > > fixed in TTM is that TTM needs for all resource manager
> > > > > fences to
> > > > > be
> > > > > ordered, but if a check for ordered fences which I think
> > > > > requires
> > > > > here
> > > > > that the eviction exec_queue is the same as the clearing one,
> > > > > that
> > > > > could be fixed in TTM.
> > > > > 
> > > > 
> > > > I think async eviction is still controlled by no_wait_gpu,
> > > > right? See
> > > > ttm_bo_wait_ctx, if a deleted BO is found and no_wait_gpu is
> > > > clear
> > > > the
> > > > eviction process moves on, right? So the exec IOCTL can still
> > > > be
> > > > pipelined albeit not with deleted BOs that have pending clears.
> > > > We
> > > > also
> > > > clear no_wait_gpu in Xe FWIW.
> > > 
> > > Yes this is a rather complex problem further complicated by the
> > > fact
> > > that since we can in wait for fences under dma_resv locks, for a
> > > true
> > > no_wait_gpu exec to succeed, we're only allowed to do
> > > dma_resv_trylock.
> > > 
> > > Better to try to fix this in TTM rather than try to worry too
> > > much
> > > about it here.
> > > 
> > 
> > +1.
> > 
> > > > 
> > > > > Otherwise, this could also cause newly introduced sync waits
> > > > > in the
> > > > > exec() and vm_bind paths where we previously performed
> > > > > eviction and
> > > > > the
> > > > > subsequent clearing async.
> > > > > 
> > > > > Some additional stuff below:
> > > > > 
> > > > > 
> > > > > > 
> > > > > > Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/xe_bo.c           | 47
> > > > > > ++++++++++++++++++++++++++++
> > > > > >  drivers/gpu/drm/xe/xe_migrate.c      | 14 ++++++---
> > > > > >  drivers/gpu/drm/xe/xe_migrate.h      |  1 +
> > > > > >  drivers/gpu/drm/xe/xe_res_cursor.h   | 26 +++++++++++++++
> > > > > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c |  5 ++-
> > > > > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.h |  6 ++++
> > > > > >  6 files changed, 94 insertions(+), 5 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > index 4e39188a021a..74470f4d418d 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > @@ -1434,6 +1434,51 @@ static bool
> > > > > > xe_ttm_bo_lock_in_destructor(struct ttm_buffer_object
> > > > > > *ttm_bo)
> > > > > >  	return locked;
> > > > > >  }
> > > > > >  
> > > > > > +static void xe_ttm_bo_release_clear(struct
> > > > > > ttm_buffer_object
> > > > > > *ttm_bo)
> > > > > > +{
> > > > > > +	struct xe_device *xe = ttm_to_xe_device(ttm_bo-
> > > > > > >bdev);
> > > > > > +	struct dma_fence *fence;
> > > > > > +	int err, idx;
> > > > > > +
> > > > > > +	xe_bo_assert_held(ttm_to_xe_bo(ttm_bo));
> > > > > > +
> > > > > > +	if (ttm_bo->type != ttm_bo_type_device)
> > > > > > +		return;
> > > > > > +
> > > > > > +	if (xe_device_wedged(xe))
> > > > > > +		return;
> > > > > > +
> > > > > > +	if (!ttm_bo->resource || !mem_type_is_vram(ttm_bo-
> > > > > > > resource-
> > > > > > > mem_type))
> > > > > > +		return;
> > > > > > +
> > > > > > +	if (!drm_dev_enter(&xe->drm, &idx))
> > > > > > +		return;
> > > > > > +
> > > > > > +	if (!xe_pm_runtime_get_if_active(xe))
> > > > > > +		goto unbind;
> > > > > > +
> > > > > > +	err = dma_resv_reserve_fences(&ttm_bo->base._resv,
> > > > > > 1);
> > > > > > +	if (err)
> > > > > > +		goto put_pm;
> > > > > > +
> > > > > > +	fence = xe_migrate_clear(mem_type_to_migrate(xe,
> > > > > > ttm_bo-
> > > > > > > resource->mem_type),
> > > > > > +				 ttm_to_xe_bo(ttm_bo),
> > > > > > ttm_bo-
> > > > > > > resource,
> > > > > 
> > > > > We should be very careful with passing the xe_bo part here
> > > > > because
> > > > > the
> > > > > gem refcount is currently zero. So that any caller deeper
> > > > > down in
> > > > > the
> > > > > call chain might try to do an xe_bo_get() and blow up.
> > > > > 
> > > > > Ideally we'd make xe_migrate_clear() operate only on the
> > > > > ttm_bo for
> > > > > this to be safe.
> > > > > 
> > > > 
> > > > It looks like bo->size and xe_bo_sg are the two uses of an Xe
> > > > BO in
> > > > xe_migrate_clear(). Let me see if I can refactor the arguments
> > > > to
> > > > avoid
> > > > these + add some kernel doc.
> > > 
> > > Thanks,
> > > Thomas
> > > 
> > 
> > So I'll just respin the next rev with a refactor xe_migrate_clear
> > arguments.
> > 
> 
> Actually xe_migrate_clear sets the bo->ccs_cleared field, so we kinda
> need the Xe BO. I guess I'll leave as is.
> 
> Yes, if caller does xe_bo_get, that will blow up but no one is doing
> that and we'd immediately get a kernel splat if some one tried to
> change
> this so I think we are good. Thoughts?

I think we either (again to be robust to future errors)

1) need to ensure and document that the migrate layer is completely
safe for gem refcount 0 case bos or we

2) Only dereference gem refcount 0 bos directly in the TTM callbacks.

To me 2) seems simplest meaning we'd need to pass the ccs_cleared field
into the migrate layer function.

Thanks,
Thomas

> 
> Matt
> 
> > Matt
> > 
> > > 
> > > > 
> > > > Matt
> > > > 
> > > > > /Thomas
> > > > > 
> > > > > 
> > > > > > +				
> > > > > > XE_MIGRATE_CLEAR_FLAG_FULL |
> > > > > > +				
> > > > > > XE_MIGRATE_CLEAR_NON_DIRTY);
> > > > > > +	if (XE_WARN_ON(IS_ERR(fence)))
> > > > > > +		goto put_pm;
> > > > > > +
> > > > > > +	xe_ttm_vram_mgr_resource_set_cleared(ttm_bo-
> > > > > > >resource);
> > > > > > +	dma_resv_add_fence(&ttm_bo->base._resv, fence,
> > > > > > +			   DMA_RESV_USAGE_KERNEL);
> > > > > > +	dma_fence_put(fence);
> > > > > > +
> > > > > > +put_pm:
> > > > > > +	xe_pm_runtime_put(xe);
> > > > > > +unbind:
> > > > > > +	drm_dev_exit(idx);
> > > > > > +}
> > > > > > +
> > > > > >  static void xe_ttm_bo_release_notify(struct
> > > > > > ttm_buffer_object
> > > > > > *ttm_bo)
> > > > > >  {
> > > > > >  	struct dma_resv_iter cursor;
> > > > > > @@ -1478,6 +1523,8 @@ static void
> > > > > > xe_ttm_bo_release_notify(struct
> > > > > > ttm_buffer_object *ttm_bo)
> > > > > >  	}
> > > > > >  	dma_fence_put(replacement);
> > > > > >  
> > > > > > +	xe_ttm_bo_release_clear(ttm_bo);
> > > > > > +
> > > > > >  	dma_resv_unlock(ttm_bo->base.resv);
> > > > > >  }
> > > > > >  
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > index 8f8e9fdfb2a8..39d7200cb366 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > @@ -1063,7 +1063,7 @@ struct dma_fence
> > > > > > *xe_migrate_clear(struct
> > > > > > xe_migrate *m,
> > > > > >  	struct xe_gt *gt = m->tile->primary_gt;
> > > > > >  	struct xe_device *xe = gt_to_xe(gt);
> > > > > >  	bool clear_only_system_ccs = false;
> > > > > > -	struct dma_fence *fence = NULL;
> > > > > > +	struct dma_fence *fence = dma_fence_get_stub();
> > > > > >  	u64 size = bo->size;
> > > > > >  	struct xe_res_cursor src_it;
> > > > > >  	struct ttm_resource *src = dst;
> > > > > > @@ -1075,10 +1075,13 @@ struct dma_fence
> > > > > > *xe_migrate_clear(struct
> > > > > > xe_migrate *m,
> > > > > >  	if (!clear_bo_data && clear_ccs && !IS_DGFX(xe))
> > > > > >  		clear_only_system_ccs = true;
> > > > > >  
> > > > > > -	if (!clear_vram)
> > > > > > +	if (!clear_vram) {
> > > > > >  		xe_res_first_sg(xe_bo_sg(bo), 0, bo->size,
> > > > > > &src_it);
> > > > > > -	else
> > > > > > +	} else {
> > > > > >  		xe_res_first(src, 0, bo->size, &src_it);
> > > > > > +		if (!(clear_flags &
> > > > > > XE_MIGRATE_CLEAR_NON_DIRTY))
> > > > > > +			size -=
> > > > > > xe_res_next_dirty(&src_it);
> > > > > > +	}
> > > > > >  
> > > > > >  	while (size) {
> > > > > >  		u64 clear_L0_ofs;
> > > > > > @@ -1125,6 +1128,9 @@ struct dma_fence
> > > > > > *xe_migrate_clear(struct
> > > > > > xe_migrate *m,
> > > > > >  			emit_pte(m, bb, clear_L0_pt,
> > > > > > clear_vram,
> > > > > > clear_only_system_ccs,
> > > > > >  				 &src_it, clear_L0, dst);
> > > > > >  
> > > > > > +		if (clear_vram && !(clear_flags &
> > > > > > XE_MIGRATE_CLEAR_NON_DIRTY))
> > > > > > +			size -=
> > > > > > xe_res_next_dirty(&src_it);
> > > > > > +
> > > > > >  		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
> > > > > >  		update_idx = bb->len;
> > > > > >  
> > > > > > @@ -1146,7 +1152,7 @@ struct dma_fence
> > > > > > *xe_migrate_clear(struct
> > > > > > xe_migrate *m,
> > > > > >  		}
> > > > > >  
> > > > > >  		xe_sched_job_add_migrate_flush(job,
> > > > > > flush_flags);
> > > > > > -		if (!fence) {
> > > > > > +		if (fence == dma_fence_get_stub()) {
> > > > > >  			/*
> > > > > >  			 * There can't be anything
> > > > > > userspace
> > > > > > related
> > > > > > at this
> > > > > >  			 * point, so we just need to
> > > > > > respect any
> > > > > > potential move
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > b/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > index fb9839c1bae0..58a7b747ef11 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > @@ -118,6 +118,7 @@ int xe_migrate_access_memory(struct
> > > > > > xe_migrate
> > > > > > *m, struct xe_bo *bo,
> > > > > >  
> > > > > >  #define XE_MIGRATE_CLEAR_FLAG_BO_DATA		BIT(0)
> > > > > >  #define XE_MIGRATE_CLEAR_FLAG_CCS_DATA		BIT(1)
> > > > > > +#define XE_MIGRATE_CLEAR_NON_DIRTY		BIT(2)
> > > > > >  #define
> > > > > > XE_MIGRATE_CLEAR_FLAG_FULL	(XE_MIGRATE_CLEAR_FLAG_BO_
> > > > > > DATA |
> > > > > > \
> > > > > >  					XE_MIGRATE_CLEAR_F
> > > > > > LAG_CC
> > > > > > S_DA
> > > > > > TA)
> > > > > >  struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > b/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > index d1a403cfb628..630082e809ba 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > @@ -315,6 +315,32 @@ static inline void xe_res_next(struct
> > > > > > xe_res_cursor *cur, u64 size)
> > > > > >  	}
> > > > > >  }
> > > > > >  
> > > > > > +/**
> > > > > > + * xe_res_next_dirty - advance the cursor to next dirty
> > > > > > buddy
> > > > > > block
> > > > > > + *
> > > > > > + * @cur: the cursor to advance
> > > > > > + *
> > > > > > + * Move the cursor until dirty buddy block is found.
> > > > > > + *
> > > > > > + * Return: Number of bytes cursor has been advanced
> > > > > > + */
> > > > > > +static inline u64 xe_res_next_dirty(struct xe_res_cursor
> > > > > > *cur)
> > > > > > +{
> > > > > > +	struct drm_buddy_block *block = cur->node;
> > > > > > +	u64 bytes = 0;
> > > > > > +
> > > > > > +	XE_WARN_ON(cur->mem_type != XE_PL_VRAM0 &&
> > > > > > +		   cur->mem_type != XE_PL_VRAM1);
> > > > > > +
> > > > > > +	while (cur->remaining &&
> > > > > > drm_buddy_block_is_clear(block)) {
> > > > > > +		bytes += cur->size;
> > > > > > +		xe_res_next(cur, cur->size);
> > > > > > +		block = cur->node;
> > > > > > +	}
> > > > > > +
> > > > > > +	return bytes;
> > > > > > +}
> > > > > > +
> > > > > >  /**
> > > > > >   * xe_res_dma - return dma address of cursor at current
> > > > > > position
> > > > > >   *
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > index 9e375a40aee9..120046941c1e 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > @@ -84,6 +84,9 @@ static int xe_ttm_vram_mgr_new(struct
> > > > > > ttm_resource_manager *man,
> > > > > >  	if (place->fpfn || lpfn != man->size >>
> > > > > > PAGE_SHIFT)
> > > > > >  		vres->flags |= DRM_BUDDY_RANGE_ALLOCATION;
> > > > > >  
> > > > > > +	if (tbo->type == ttm_bo_type_device)
> > > > > > +		vres->flags |= DRM_BUDDY_CLEAR_ALLOCATION;
> > > > > > +
> > > > > >  	if (WARN_ON(!vres->base.size)) {
> > > > > >  		err = -EINVAL;
> > > > > >  		goto error_fini;
> > > > > > @@ -187,7 +190,7 @@ static void xe_ttm_vram_mgr_del(struct
> > > > > > ttm_resource_manager *man,
> > > > > >  	struct drm_buddy *mm = &mgr->mm;
> > > > > >  
> > > > > >  	mutex_lock(&mgr->lock);
> > > > > > -	drm_buddy_free_list(mm, &vres->blocks, 0);
> > > > > > +	drm_buddy_free_list(mm, &vres->blocks, vres-
> > > > > > >flags);
> > > > > >  	mgr->visible_avail += vres->used_visible_size;
> > > > > >  	mutex_unlock(&mgr->lock);
> > > > > >  
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > index cc76050e376d..dfc0e6890b3c 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > @@ -36,6 +36,12 @@ to_xe_ttm_vram_mgr_resource(struct
> > > > > > ttm_resource
> > > > > > *res)
> > > > > >  	return container_of(res, struct
> > > > > > xe_ttm_vram_mgr_resource,
> > > > > > base);
> > > > > >  }
> > > > > >  
> > > > > > +static inline void
> > > > > > +xe_ttm_vram_mgr_resource_set_cleared(struct ttm_resource
> > > > > > *res)
> > > > > > +{
> > > > > > +	to_xe_ttm_vram_mgr_resource(res)->flags |=
> > > > > > DRM_BUDDY_CLEARED;
> > > > > > +}
> > > > > > +
> > > > > >  static inline struct xe_ttm_vram_mgr *
> > > > > >  to_xe_ttm_vram_mgr(struct ttm_resource_manager *man)
> > > > > >  {
> > > > > 
> > >