[PATCH] drm/xe: Implement clear VRAM on free

Wed Jun 11 18:20:30 UTC 2025

On Wed, Jun 11, 2025 at 12:05:36PM -0600, Summers, Stuart wrote:
> On Wed, 2025-06-11 at 10:57 -0700, Matthew Brost wrote:
> > On Wed, Jun 11, 2025 at 11:04:06AM -0600, Summers, Stuart wrote:
> > > On Wed, 2025-06-11 at 09:46 -0700, Matthew Brost wrote:
> > > > On Wed, Jun 11, 2025 at 10:26:44AM -0600, Summers, Stuart wrote:
> > > > > On Tue, 2025-06-10 at 22:42 -0700, Matthew Brost wrote:
> > > > > > Clearing on free should hide latency of BO clears on new user
> > > > > > BO
> > > > > > allocations.
> > > > > 
> > > > > Do you have any test results showing the latency improvement
> > > > > here
> > > > > on
> > > > > high volume submission tests? Also how does this impact at very
> > > > > large
> > > > 
> > > > No performance data yet available. Something we definitely should
> > > > look
> > > > into and would be curious about the results. FWIW, AMDGPU
> > > > implements
> > > > clear on free in a very similar way to this series. 
> > > 
> > > I had a few other comments towards the end of the patch. Basically
> > > I
> > 
> > I should slow down and read... Repling to all comments here.
> > 
> > > think it would be nice to be able to configure this. There was
> > > significant testing done here for Aurora in i915 and the
> > > performance
> > > benefits were potentially different based on the types of workloads
> > > being used. Having user hints for instance might allow us to take
> > > advantage of those workload specific use cases.
> > > 
> > 
> > We can wire this mode to the uAPI for BO creation if needed - this
> > patch
> > does clear-on-free blindly for any user BO. Ofc with uAPI we need a
> > UMD
> > user to upstream. We may just want to start with this patch and do
> > that
> > in a follow up if needed as it a minor tweak from a code PoV.
> 
> Actually it would be great if you can add exactly this to the commit
> message. Basically we are intentionally doing clear on free here
> blindly because we don't currently have a user for some kind of more
> complicated (from the user perspective) hint system. (just kind of
> repeating what you said)
> 
> This way we have some documentation of why we chose this route other
> than just because amd is doing that.
> 

Sure.

> > 
> > > > 
> > > > > buffer size submissions and mix of large and small buffer
> > > > > submissions
> > > > > with eviction between different processes?
> > > > > 
> > > > 
> > > > Eviction somewhat orthogonal to this. If an eviction occurs and
> > > > it is
> > > > a
> > > > new allocation a clear would still have be issued ahead handing
> > > > the
> > > > memory over to the new BO, if eviction occurs and paged in BO has
> > > > backing store (it was previously evicted) we'd just copy the
> > > > contents
> > > > into the allocated memory.
> > > 
> > > Good point. I still think we should consider the scenarios of
> > > having a
> > > large buffer workload running and then clearing, then we have a
> > > small
> > > buffer workload coming in which now needs to wait on the clear from
> > > that large workload. Whereas on clear-on-alloc for the small
> > > workload,
> > > we can get that submission in quickly and not wait for that larger
> > > one.
> > > 
> > 
> > The buddy allocator tries to allocate cleared VRAM first, then falls
> > back to dirty VRAM. Dirty VRAM will still be cleared upon allocation
> > in
> > this patch.
> > 
> > The senerio you describe could result in eviction actually. If a
> > large
> > BO has pending clear - it won't be in either buddy allocator pool -
> > it
> > will be TTM as ghost object (pending free). If the buddy allocator
> > fails
> > an allocation, eviction is triggered with the ghost objects in TTM
> > being
> > tried to allocate first (I think, if I'm misstaing TTM internals my
> > bad). This would as you suggest result in the BO allocation waiting
> > on
> > potentially a large clear to finish even though the allocation is
> > small
> > in size.
> 
> So we have had customers in the past who have wanted explicit disabling
> of eviction. I agree in the current implementation that might be the

Explicitly disabling eviction is not an upstream option, downstream in
i915 you can do whatever you want as there are no rules. We have
discussed a pin accounting controler vis cgroups or something upstream
which may be acceptable. Different discussion though.

> case, but in the future we might want such a capability, so there is
> still a chance of blocking. But again, maybe this can be a future
> enhancement as you said.
> 
> > 
> > > I don't think there is necessarily a one-size-fits-all solution
> > > here
> > > which is why I think the hints are important.
> > > 
> > 
> > Jumping to new uAPI + hints immediately might not be the best
> > approach,
> > but as I stated this is minor thing to add if needed.
> > 
> > > Thanks,
> > > Stuart
> > > 
> > > > 
> > > > Matt
> > > > 
> > > > > > 
> > > > > > Implemented via calling xe_migrate_clear in release notify
> > > > > > and
> > > > > > updating
> > > > > > iterator in xe_migrate_clear to skip cleared buddy blocks.
> > > > > > Only
> > > > > > user
> > > > > > BOs
> > > > > > cleared in release notify as kernel BOs could still be in use
> > > > > > (e.g.,
> > > > > > PT
> > > > > > BOs need to wait for dma-resv to be idle).
> > > > > > 
> > > > > > Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/xe_bo.c           | 47
> > > > > > ++++++++++++++++++++++++++++
> > > > > >  drivers/gpu/drm/xe/xe_migrate.c      | 14 ++++++---
> > > > > >  drivers/gpu/drm/xe/xe_migrate.h      |  1 +
> > > > > >  drivers/gpu/drm/xe/xe_res_cursor.h   | 26 +++++++++++++++
> > > > > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c |  5 ++-
> > > > > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.h |  6 ++++
> > > > > >  6 files changed, 94 insertions(+), 5 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > index 4e39188a021a..74470f4d418d 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > @@ -1434,6 +1434,51 @@ static bool
> > > > > > xe_ttm_bo_lock_in_destructor(struct ttm_buffer_object
> > > > > > *ttm_bo)
> > > > > >         return locked;
> > > > > >  }
> > > > > >  
> > > > > > +static void xe_ttm_bo_release_clear(struct ttm_buffer_object
> > > > > > *ttm_bo)
> > > > > > +{
> > > > > > +       struct xe_device *xe = ttm_to_xe_device(ttm_bo-
> > > > > > >bdev);
> > > > > > +       struct dma_fence *fence;
> > > > > > +       int err, idx;
> > > > > > +
> > > > > > +       xe_bo_assert_held(ttm_to_xe_bo(ttm_bo));
> > > > > > +
> > > > > > +       if (ttm_bo->type != ttm_bo_type_device)
> > > > > > +               return;
> > > > > > +
> > > > > > +       if (xe_device_wedged(xe))
> > > > > > +               return;
> > > > > > +
> > > > > > +       if (!ttm_bo->resource || !mem_type_is_vram(ttm_bo-
> > > > > > > resource-
> > > > > > > mem_type))
> > > > > > +               return;
> > > > > > +
> > > > > > +       if (!drm_dev_enter(&xe->drm, &idx))
> > > > > > +               return;
> > > > > > +
> > > > > > +       if (!xe_pm_runtime_get_if_active(xe))
> > > > > > +               goto unbind;
> > > > > > +
> > > > > > +       err = dma_resv_reserve_fences(&ttm_bo->base._resv,
> > > > > > 1);
> > > > > > +       if (err)
> > > > > > +               goto put_pm;
> > > > > > +
> > > > > > +       fence = xe_migrate_clear(mem_type_to_migrate(xe,
> > > > > > ttm_bo-
> > > > > > > resource->mem_type),
> > > > > > +                                ttm_to_xe_bo(ttm_bo),
> > > > > > ttm_bo-
> > > > > > > resource,
> > > > > > +                                XE_MIGRATE_CLEAR_FLAG_FULL |
> > > > > > +                                XE_MIGRATE_CLEAR_NON_DIRTY);
> > > > > > +       if (XE_WARN_ON(IS_ERR(fence)))
> > > > > > +               goto put_pm;
> > > > > > +
> > > > > > +       xe_ttm_vram_mgr_resource_set_cleared(ttm_bo-
> > > > > > >resource);
> > > > > > +       dma_resv_add_fence(&ttm_bo->base._resv, fence,
> > > > > > +                          DMA_RESV_USAGE_KERNEL);
> > > > > > +       dma_fence_put(fence);
> > > > > > +
> > > > > > +put_pm:
> > > > > > +       xe_pm_runtime_put(xe);
> > > > > > +unbind:
> > > > > > +       drm_dev_exit(idx);
> > > > > > +}
> > > > > > +
> > > > > >  static void xe_ttm_bo_release_notify(struct
> > > > > > ttm_buffer_object
> > > > > > *ttm_bo)
> > > > > >  {
> > > > > >         struct dma_resv_iter cursor;
> > > > > > @@ -1478,6 +1523,8 @@ static void
> > > > > > xe_ttm_bo_release_notify(struct
> > > > > > ttm_buffer_object *ttm_bo)
> > > > > >         }
> > > > > >         dma_fence_put(replacement);
> > > > > >  
> > > > > > +       xe_ttm_bo_release_clear(ttm_bo);
> > > > > > +
> > > > > >         dma_resv_unlock(ttm_bo->base.resv);
> > > > > >  }
> > > > > >  
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > index 8f8e9fdfb2a8..39d7200cb366 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > @@ -1063,7 +1063,7 @@ struct dma_fence
> > > > > > *xe_migrate_clear(struct
> > > > > > xe_migrate *m,
> > > > > >         struct xe_gt *gt = m->tile->primary_gt;
> > > > > >         struct xe_device *xe = gt_to_xe(gt);
> > > > > >         bool clear_only_system_ccs = false;
> > > > > > -       struct dma_fence *fence = NULL;
> > > > > > +       struct dma_fence *fence = dma_fence_get_stub();
> > > > > >         u64 size = bo->size;
> > > > > >         struct xe_res_cursor src_it;
> > > > > >         struct ttm_resource *src = dst;
> > > > > > @@ -1075,10 +1075,13 @@ struct dma_fence
> > > > > > *xe_migrate_clear(struct
> > > > > > xe_migrate *m,
> > > > > >         if (!clear_bo_data && clear_ccs && !IS_DGFX(xe))
> > > > > >                 clear_only_system_ccs = true;
> > > > > >  
> > > > > > -       if (!clear_vram)
> > > > > > +       if (!clear_vram) {
> > > > > >                 xe_res_first_sg(xe_bo_sg(bo), 0, bo->size,
> > > > > > &src_it);
> > > > > > -       else
> > > > > > +       } else {
> > > > > >                 xe_res_first(src, 0, bo->size, &src_it);
> > > > > > +               if (!(clear_flags &
> > > > > > XE_MIGRATE_CLEAR_NON_DIRTY))
> > > > > > +                       size -= xe_res_next_dirty(&src_it);
> > > > > > +       }
> > > > > >  
> > > > > >         while (size) {
> > > > > >                 u64 clear_L0_ofs;
> > > > > > @@ -1125,6 +1128,9 @@ struct dma_fence
> > > > > > *xe_migrate_clear(struct
> > > > > > xe_migrate *m,
> > > > > >                         emit_pte(m, bb, clear_L0_pt,
> > > > > > clear_vram,
> > > > > > clear_only_system_ccs,
> > > > > >                                  &src_it, clear_L0, dst);
> > > > > >  
> > > > > > +               if (clear_vram && !(clear_flags &
> > > > > > XE_MIGRATE_CLEAR_NON_DIRTY))
> > > > > > +                       size -= xe_res_next_dirty(&src_it);
> > > > > > +
> > > > > >                 bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
> > > > > >                 update_idx = bb->len;
> > > > > >  
> > > > > > @@ -1146,7 +1152,7 @@ struct dma_fence
> > > > > > *xe_migrate_clear(struct
> > > > > > xe_migrate *m,
> > > > > >                 }
> > > > > >  
> > > > > >                 xe_sched_job_add_migrate_flush(job,
> > > > > > flush_flags);
> > > > > > -               if (!fence) {
> > > > > > +               if (fence == dma_fence_get_stub()) {
> > > > > >                         /*
> > > > > >                          * There can't be anything userspace
> > > > > > related
> > > > > > at this
> > > > > >                          * point, so we just need to respect
> > > > > > any
> > > > > > potential move
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > b/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > index fb9839c1bae0..58a7b747ef11 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > @@ -118,6 +118,7 @@ int xe_migrate_access_memory(struct
> > > > > > xe_migrate
> > > > > > *m, struct xe_bo *bo,
> > > > > >  
> > > > > >  #define XE_MIGRATE_CLEAR_FLAG_BO_DATA          BIT(0)
> > > > > >  #define XE_MIGRATE_CLEAR_FLAG_CCS_DATA         BIT(1)
> > > > > > +#define XE_MIGRATE_CLEAR_NON_DIRTY             BIT(2)
> > > > > >  #define
> > > > > > XE_MIGRATE_CLEAR_FLAG_FULL     (XE_MIGRATE_CLEAR_FLAG_BO_DATA
> > > > > > | \
> > > > > >                                         XE_MIGRATE_CLEAR_FLAG
> > > > > > _CCS
> > > > > > _DAT
> > > > > > A)
> > > > > >  struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > b/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > index d1a403cfb628..630082e809ba 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > @@ -315,6 +315,32 @@ static inline void xe_res_next(struct
> > > > > > xe_res_cursor *cur, u64 size)
> > > > > >         }
> > > > > >  }
> > > > > >  
> > > > > > +/**
> > > > > > + * xe_res_next_dirty - advance the cursor to next dirty
> > > > > > buddy
> > > > > > block
> > > > > > + *
> > > > > > + * @cur: the cursor to advance
> > > > > > + *
> > > > > > + * Move the cursor until dirty buddy block is found.
> > > > > > + *
> > > > > > + * Return: Number of bytes cursor has been advanced
> > > > > > + */
> > > > > > +static inline u64 xe_res_next_dirty(struct xe_res_cursor
> > > > > > *cur)
> > > > > > +{
> > > > > > +       struct drm_buddy_block *block = cur->node;
> > > > > > +       u64 bytes = 0;
> > > > > > +
> > > > > > +       XE_WARN_ON(cur->mem_type != XE_PL_VRAM0 &&
> > > > > > +                  cur->mem_type != XE_PL_VRAM1);
> > > > > 
> > > > > What if we have more than just these two? Maybe check against
> > > > > the
> > > > > mask
> > > > > instead.
> > 
> > Sure. I think we do this test in a couple of other places in the
> > driver too
> > and clean this up in seperate patch first.
> 
> No problem, just wanted to call it out. We can do this later..
> 
> > 
> > > > > 
> > > > > > +
> > > > > > +       while (cur->remaining &&
> > > > > > drm_buddy_block_is_clear(block))
> > > > > > {
> > > > > > +               bytes += cur->size;
> > > > > > +               xe_res_next(cur, cur->size);
> > > > > > +               block = cur->node;
> > > > > > +       }
> > > > > > +
> > > > > > +       return bytes;
> > > > > > +}
> > > > > > +
> > > > > >  /**
> > > > > >   * xe_res_dma - return dma address of cursor at current
> > > > > > position
> > > > > >   *
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > index 9e375a40aee9..120046941c1e 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > @@ -84,6 +84,9 @@ static int xe_ttm_vram_mgr_new(struct
> > > > > > ttm_resource_manager *man,
> > > > > >         if (place->fpfn || lpfn != man->size >> PAGE_SHIFT)
> > > > > >                 vres->flags |= DRM_BUDDY_RANGE_ALLOCATION;
> > > > > >  
> > > > > > +       if (tbo->type == ttm_bo_type_device)
> > > > > > +               vres->flags |= DRM_BUDDY_CLEAR_ALLOCATION;
> > > > > > +
> > > > > >         if (WARN_ON(!vres->base.size)) {
> > > > > >                 err = -EINVAL;
> > > > > >                 goto error_fini;
> > > > > > @@ -187,7 +190,7 @@ static void xe_ttm_vram_mgr_del(struct
> > > > > > ttm_resource_manager *man,
> > > > > >         struct drm_buddy *mm = &mgr->mm;
> > > > > >  
> > > > > >         mutex_lock(&mgr->lock);
> > > > > > -       drm_buddy_free_list(mm, &vres->blocks, 0);
> > > > > > +       drm_buddy_free_list(mm, &vres->blocks, vres->flags);
> > > > > >         mgr->visible_avail += vres->used_visible_size;
> > > > > >         mutex_unlock(&mgr->lock);
> > > > > >  
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > index cc76050e376d..dfc0e6890b3c 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > @@ -36,6 +36,12 @@ to_xe_ttm_vram_mgr_resource(struct
> > > > > > ttm_resource
> > > > > > *res)
> > > > > >         return container_of(res, struct
> > > > > > xe_ttm_vram_mgr_resource,
> > > > > > base);
> > > > > >  }
> > > > > >  
> > > > > > +static inline void
> > > > > > +xe_ttm_vram_mgr_resource_set_cleared(struct ttm_resource
> > > > > > *res)
> > > > > > +{
> > > > > > +       to_xe_ttm_vram_mgr_resource(res)->flags |=
> > > > > > DRM_BUDDY_CLEARED;
> > > > > 
> > > > > I see the amd driver is doing the same thing here. Maybe we can
> > > > > pull
> > > > > this in based on the flags given at the resource initialization
> > > > > so
> > > > > we
> > > > > can potentially tweak this for different scenarios (when to
> > > > > free)
> > > > > and
> > > > > converge what we're doing here and what amd is doing into the
> > > > > common
> > > > > code.
> > > > > 
> > 
> > AMD has a flag, which is the same test in this series as 'type ==
> > ttm_bo_type_device'. AMD blindly sets this flag on user BO creation
> > but
> > as I stated above we could wire this to uAPI if needed.
> > 
> > Also I think there are some kernel BOs which clear-on-free is safe
> > too
> > (e.g., I think only PT BOs need contents to stick around after final
> > put
> > until dma-resv is clean of fences). Would have to fully to test for
> > definitive answer though.
> 
> In most cases the kernel operations are going to be lower priority than

It is the inverse of that - kernel operations are a always higher
priority as without them a user app can't function (think any page table
memory allocation or memory allocation for a LRC). This is one of
reasons we just pin kernel memory in Xe compared evictable kernel memory
in the i915 (also evicting kernel memory is a huge pain as we have idle
the GuC state which could use that memory).

> the user operations (with the exception of maybe page faults). We had
> looked at things in i915 like preempting kernel operations when a user
> op comes in (the user clear-on-alloc preempts the kernel clear-on-
> free/alloc). This probably falls in the same optimization category you

We definitely won't do that as this would require multiple migration
queues and preemption via the GuC is so expensive I'd doubt you'd ever
win. I guess we could have a convolved algorithm in the migration queue
to jump to higher priortiy jobs but would need some really strong data
indicating this is needed. Also with dma-resv basically the migration
queue needs to be idle before anything from the user can run.

Matt

> mentioned, although it doesn't necessarily need a user hint.
> 
> Thanks,
> Stuart
> 
> > 
> > Matt
> > 
> > > > > Thanks,
> > > > > Stuart
> > > > > 
> > > > > > +}
> > > > > > +
> > > > > >  static inline struct xe_ttm_vram_mgr *
> > > > > >  to_xe_ttm_vram_mgr(struct ttm_resource_manager *man)
> > > > > >  {
> > > > > 
> > > 
>