[PATCH] drm/xe: Implement clear VRAM on free

Wed Jun 11 23:12:25 UTC 2025

On Wed, 2025-06-11 at 14:23 -0700, Matthew Brost wrote:
> On Wed, Jun 11, 2025 at 01:12:13PM -0600, Summers, Stuart wrote:
> > On Wed, 2025-06-11 at 12:03 -0700, Matthew Brost wrote:
> > > On Wed, Jun 11, 2025 at 12:01:10PM -0700, Matthew Brost wrote:
> > > 
> > > One correction...
> > > 
> > > > On Wed, Jun 11, 2025 at 12:23:13PM -0600, Summers, Stuart
> > > > wrote:
> > > > > On Wed, 2025-06-11 at 11:20 -0700, Matthew Brost wrote:
> > > > > > On Wed, Jun 11, 2025 at 12:05:36PM -0600, Summers, Stuart
> > > > > > wrote:
> > > > > > > On Wed, 2025-06-11 at 10:57 -0700, Matthew Brost wrote:
> > > > > > > > On Wed, Jun 11, 2025 at 11:04:06AM -0600, Summers,
> > > > > > > > Stuart
> > > > > > > > wrote:
> > > > > > > > > On Wed, 2025-06-11 at 09:46 -0700, Matthew Brost
> > > > > > > > > wrote:
> > > > > > > > > > On Wed, Jun 11, 2025 at 10:26:44AM -0600, Summers,
> > > > > > > > > > Stuart
> > > > > > > > > > wrote:
> > > > > > > > > > > On Tue, 2025-06-10 at 22:42 -0700, Matthew Brost
> > > > > > > > > > > wrote:
> > > > > > > > > > > > Clearing on free should hide latency of BO
> > > > > > > > > > > > clears
> > > > > > > > > > > > on new
> > > > > > > > > > > > user
> > > > > > > > > > > > BO
> > > > > > > > > > > > allocations.
> > > > > > > > > > > 
> > > > > > > > > > > Do you have any test results showing the latency
> > > > > > > > > > > improvement
> > > > > > > > > > > here
> > > > > > > > > > > on
> > > > > > > > > > > high volume submission tests? Also how does this
> > > > > > > > > > > impact at
> > > > > > > > > > > very
> > > > > > > > > > > large
> > > > > > > > > > 
> > > > > > > > > > No performance data yet available. Something we
> > > > > > > > > > definitely
> > > > > > > > > > should
> > > > > > > > > > look
> > > > > > > > > > into and would be curious about the results. FWIW,
> > > > > > > > > > AMDGPU
> > > > > > > > > > implements
> > > > > > > > > > clear on free in a very similar way to this series.
> > > > > 
> > > > > So back to this one, what is the motivation for this change
> > > > > if we
> > > > > don't
> > > > > have data?
> > > > > 
> > > > 
> > > > It is the assumption that in typical cases (no memory pressure)
> > > > the
> > > > buddy allocator pool can find a clean VRAM upon allocation
> > > > resulting in
> > > > immediate use of the BO rather than issuing a clear which could
> > > > delay a
> > > > bind IOCTL, an exec IOCTL, or even mmaping the BO (dma-resv
> > > > enforces
> > > > this ordering btw).
> > > > 
> > > > Even if there is memory pressure, it is fairly easy to reason
> > > > that
> > > > clear-on-free creates a pipeline in which the clear is started
> > > > earlier
> > > > than it would be upon allocation thus reducing the overall
> > > > delay of
> > > > the
> > > > BO usage after allocation.
> > > > 
> > > > Yes, there might be odd corner of large clear-on-free followed
> > > > by
> > > > small
> > > > allocation where it might be slightly worse, but overall it
> > > > seems
> > > > to be
> > > > an easy win - both AMD / Intel have done work to make clear-on-
> > > > free
> > > > possible.
> > 
> > Ok so basically we think the baseline is going to be better this
> > way (I
> > agree with what you have here) and any corner case will be
> > addressed in
> > the future as they come up (i.e. uAPI).
> > 
> > It does seem like even if we don't have a uAPI we could have a
> > solution
> > that lets the small workload preempt the large workload clear to do
> > its
> > own clear-on-alloc, allowing the small workload to move on to the
> > render/compute engine while the large clearing workload does it's
> > clear
> > using blitter (just as an example). It's an optimization though I
> > agree. And without an explicit user (yet for Xe...) we might just
> > kick
> > that can down the road.
> > 
> 
> All of this depends on getting memory - there isn't a concept of
> large
> or small BOs anywhere in DRM subsystem, or specially clears attached
> to
> BOs - it just BOs with fences in dma-resv and once all the fences
> signal
> memory is return to the buddy allocator. Either memory is available
> or
> it isn't and eviction is triggered.

Appreciate the explanation here and that makes sense to me.

> 
> What you suggest sounds like i915ism that is wildly complex, likely
> not
> possible with a shared code base, and would never be accepted by our
> maintainers or anyone in the DRM subsystem.

I think I had mentioned above, but to be clear here, I agree with the
uAPI approach. We should let the user decide how they want to manage
memory for their specific workloads. I also agree this isn't something
we need to do here.

> 
> > > > 
> > > > > > > > > 
> > > > > > > > > I had a few other comments towards the end of the
> > > > > > > > > patch.
> > > > > > > > > Basically
> > > > > > > > > I
> > > > > > > > 
> > > > > > > > I should slow down and read... Repling to all comments
> > > > > > > > here.
> > > > > > > > 
> > > > > > > > > think it would be nice to be able to configure this.
> > > > > > > > > There was
> > > > > > > > > significant testing done here for Aurora in i915 and
> > > > > > > > > the
> > > > > > > > > performance
> > > > > > > > > benefits were potentially different based on the
> > > > > > > > > types of
> > > > > > > > > workloads
> > > > > > > > > being used. Having user hints for instance might
> > > > > > > > > allow us
> > > > > > > > > to
> > > > > > > > > take
> > > > > > > > > advantage of those workload specific use cases.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > We can wire this mode to the uAPI for BO creation if
> > > > > > > > needed
> > > > > > > > -
> > > > > > > > this
> > > > > > > > patch
> > > > > > > > does clear-on-free blindly for any user BO. Ofc with
> > > > > > > > uAPI
> > > > > > > > we need
> > > > > > > > a
> > > > > > > > UMD
> > > > > > > > user to upstream. We may just want to start with this
> > > > > > > > patch
> > > > > > > > and
> > > > > > > > do
> > > > > > > > that
> > > > > > > > in a follow up if needed as it a minor tweak from a
> > > > > > > > code
> > > > > > > > PoV.
> > > > > > > 
> > > > > > > Actually it would be great if you can add exactly this to
> > > > > > > the
> > > > > > > commit
> > > > > > > message. Basically we are intentionally doing clear on
> > > > > > > free
> > > > > > > here
> > > > > > > blindly because we don't currently have a user for some
> > > > > > > kind
> > > > > > > of
> > > > > > > more
> > > > > > > complicated (from the user perspective) hint system.
> > > > > > > (just
> > > > > > > kind of
> > > > > > > repeating what you said)
> > > > > > > 
> > > > > > > This way we have some documentation of why we chose this
> > > > > > > route
> > > > > > > other
> > > > > > > than just because amd is doing that.
> > > > > > > 
> > > > > > 
> > > > > > Sure.
> > > > > > 
> > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > buffer size submissions and mix of large and
> > > > > > > > > > > small
> > > > > > > > > > > buffer
> > > > > > > > > > > submissions
> > > > > > > > > > > with eviction between different processes?
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Eviction somewhat orthogonal to this. If an
> > > > > > > > > > eviction
> > > > > > > > > > occurs
> > > > > > > > > > and
> > > > > > > > > > it is
> > > > > > > > > > a
> > > > > > > > > > new allocation a clear would still have be issued
> > > > > > > > > > ahead
> > > > > > > > > > handing
> > > > > > > > > > the
> > > > > > > > > > memory over to the new BO, if eviction occurs and
> > > > > > > > > > paged
> > > > > > > > > > in BO
> > > > > > > > > > has
> > > > > > > > > > backing store (it was previously evicted) we'd just
> > > > > > > > > > copy the
> > > > > > > > > > contents
> > > > > > > > > > into the allocated memory.
> > > > > > > > > 
> > > > > > > > > Good point. I still think we should consider the
> > > > > > > > > scenarios of
> > > > > > > > > having a
> > > > > > > > > large buffer workload running and then clearing, then
> > > > > > > > > we
> > > > > > > > > have a
> > > > > > > > > small
> > > > > > > > > buffer workload coming in which now needs to wait on
> > > > > > > > > the
> > > > > > > > > clear
> > > > > > > > > from
> > > > > > > > > that large workload. Whereas on clear-on-alloc for
> > > > > > > > > the
> > > > > > > > > small
> > > > > > > > > workload,
> > > > > > > > > we can get that submission in quickly and not wait
> > > > > > > > > for
> > > > > > > > > that
> > > > > > > > > larger
> > > > > > > > > one.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > The buddy allocator tries to allocate cleared VRAM
> > > > > > > > first,
> > > > > > > > then
> > > > > > > > falls
> > > > > > > > back to dirty VRAM. Dirty VRAM will still be cleared
> > > > > > > > upon
> > > > > > > > allocation
> > > > > > > > in
> > > > > > > > this patch.
> > > > > > > > 
> > > > > > > > The senerio you describe could result in eviction
> > > > > > > > actually.
> > > > > > > > If a
> > > > > > > > large
> > > > > > > > BO has pending clear - it won't be in either buddy
> > > > > > > > allocator pool
> > > > > > > > -
> > > > > > > > it
> > > > > > > > will be TTM as ghost object (pending free). If the
> > > > > > > > buddy
> > > > > > > > allocator
> > > > > > > > fails
> > > > > > > > an allocation, eviction is triggered with the ghost
> > > > > > > > objects
> > > > > > > > in
> > > > > > > > TTM
> > > > > > > > being
> > > > > > > > tried to allocate first (I think, if I'm misstaing TTM
> > > > > > > > internals
> > > > > > > > my
> > > > > > > > bad). This would as you suggest result in the BO
> > > > > > > > allocation
> > > > > > > > waiting
> > > > > > > > on
> > > > > > > > potentially a large clear to finish even though the
> > > > > > > > allocation is
> > > > > > > > small
> > > > > > > > in size.
> > > > > > > 
> > > > > > > So we have had customers in the past who have wanted
> > > > > > > explicit
> > > > > > > disabling
> > > > > > > of eviction. I agree in the current implementation that
> > > > > > > might
> > > > > > > be
> > > > > > > the
> > > > > > 
> > > > > > Explicitly disabling eviction is not an upstream option,
> > > > > > downstream
> > > > > > in
> > > > > > i915 you can do whatever you want as there are no rules. We
> > > > > > have
> > > > > > discussed a pin accounting controler vis cgroups or
> > > > > > something
> > > > > > upstream
> > > > > > which may be acceptable. Different discussion though.
> > > > > > 
> > > > > > > case, but in the future we might want such a capability,
> > > > > > > so
> > > > > > > there
> > > > > > > is
> > > > > > > still a chance of blocking. But again, maybe this can be
> > > > > > > a
> > > > > > > future
> > > > > > > enhancement as you said.
> > > > > > > 
> > > > > > > > 
> > > > > > > > > I don't think there is necessarily a one-size-fits-
> > > > > > > > > all
> > > > > > > > > solution
> > > > > > > > > here
> > > > > > > > > which is why I think the hints are important.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Jumping to new uAPI + hints immediately might not be
> > > > > > > > the
> > > > > > > > best
> > > > > > > > approach,
> > > > > > > > but as I stated this is minor thing to add if needed.
> > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Stuart
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Matt
> > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Implemented via calling xe_migrate_clear in
> > > > > > > > > > > > release
> > > > > > > > > > > > notify
> > > > > > > > > > > > and
> > > > > > > > > > > > updating
> > > > > > > > > > > > iterator in xe_migrate_clear to skip cleared
> > > > > > > > > > > > buddy
> > > > > > > > > > > > blocks.
> > > > > > > > > > > > Only
> > > > > > > > > > > > user
> > > > > > > > > > > > BOs
> > > > > > > > > > > > cleared in release notify as kernel BOs could
> > > > > > > > > > > > still
> > > > > > > > > > > > be in
> > > > > > > > > > > > use
> > > > > > > > > > > > (e.g.,
> > > > > > > > > > > > PT
> > > > > > > > > > > > BOs need to wait for dma-resv to be idle).
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Matthew Brost
> > > > > > > > > > > > <matthew.brost at intel.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  drivers/gpu/drm/xe/xe_bo.c           | 47
> > > > > > > > > > > > ++++++++++++++++++++++++++++
> > > > > > > > > > > >  drivers/gpu/drm/xe/xe_migrate.c      | 14
> > > > > > > > > > > > ++++++--
> > > > > > > > > > > > -
> > > > > > > > > > > >  drivers/gpu/drm/xe/xe_migrate.h      |  1 +
> > > > > > > > > > > >  drivers/gpu/drm/xe/xe_res_cursor.h   | 26
> > > > > > > > > > > > +++++++++++++++
> > > > > > > > > > > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c |  5 ++-
> > > > > > > > > > > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.h |  6 ++++
> > > > > > > > > > > >  6 files changed, 94 insertions(+), 5
> > > > > > > > > > > > deletions(-)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > > index 4e39188a021a..74470f4d418d 100644
> > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > > @@ -1434,6 +1434,51 @@ static bool
> > > > > > > > > > > > xe_ttm_bo_lock_in_destructor(struct
> > > > > > > > > > > > ttm_buffer_object
> > > > > > > > > > > > *ttm_bo)
> > > > > > > > > > > >         return locked;
> > > > > > > > > > > >  }
> > > > > > > > > > > >  
> > > > > > > > > > > > +static void xe_ttm_bo_release_clear(struct
> > > > > > > > > > > > ttm_buffer_object
> > > > > > > > > > > > *ttm_bo)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +       struct xe_device *xe =
> > > > > > > > > > > > ttm_to_xe_device(ttm_bo-
> > > > > > > > > > > > > bdev);
> > > > > > > > > > > > +       struct dma_fence *fence;
> > > > > > > > > > > > +       int err, idx;
> > > > > > > > > > > > +
> > > > > > > > > > > > +       xe_bo_assert_held(ttm_to_xe_bo(ttm_bo))
> > > > > > > > > > > > ;
> > > > > > > > > > > > +
> > > > > > > > > > > > +       if (ttm_bo->type != ttm_bo_type_device)
> > > > > > > > > > > > +               return;
> > > > > > > > > > > > +
> > > > > > > > > > > > +       if (xe_device_wedged(xe))
> > > > > > > > > > > > +               return;
> > > > > > > > > > > > +
> > > > > > > > > > > > +       if (!ttm_bo->resource ||
> > > > > > > > > > > > !mem_type_is_vram(ttm_bo-
> > > > > > > > > > > > > resource-
> > > > > > > > > > > > > mem_type))
> > > > > > > > > > > > +               return;
> > > > > > > > > > > > +
> > > > > > > > > > > > +       if (!drm_dev_enter(&xe->drm, &idx))
> > > > > > > > > > > > +               return;
> > > > > > > > > > > > +
> > > > > > > > > > > > +       if (!xe_pm_runtime_get_if_active(xe))
> > > > > > > > > > > > +               goto unbind;
> > > > > > > > > > > > +
> > > > > > > > > > > > +       err = dma_resv_reserve_fences(&ttm_bo-
> > > > > > > > > > > > > base._resv,
> > > > > > > > > > > > 1);
> > > > > > > > > > > > +       if (err)
> > > > > > > > > > > > +               goto put_pm;
> > > > > > > > > > > > +
> > > > > > > > > > > > +       fence =
> > > > > > > > > > > > xe_migrate_clear(mem_type_to_migrate(xe,
> > > > > > > > > > > > ttm_bo-
> > > > > > > > > > > > > resource->mem_type),
> > > > > > > > > > > > +                               
> > > > > > > > > > > > ttm_to_xe_bo(ttm_bo),
> > > > > > > > > > > > ttm_bo-
> > > > > > > > > > > > > resource,
> > > > > > > > > > > > +                               
> > > > > > > > > > > > XE_MIGRATE_CLEAR_FLAG_FULL |
> > > > > > > > > > > > +                               
> > > > > > > > > > > > XE_MIGRATE_CLEAR_NON_DIRTY);
> > > > > > > > > > > > +       if (XE_WARN_ON(IS_ERR(fence)))
> > > > > > > > > > > > +               goto put_pm;
> > > > > > > > > > > > +
> > > > > > > > > > > > +       xe_ttm_vram_mgr_resource_set_cleared(tt
> > > > > > > > > > > > m_bo
> > > > > > > > > > > > -
> > > > > > > > > > > > > resource);
> > > > > > > > > > > > +       dma_resv_add_fence(&ttm_bo->base._resv,
> > > > > > > > > > > > fence,
> > > > > > > > > > > > +                         
> > > > > > > > > > > > DMA_RESV_USAGE_KERNEL);
> > > > > > > > > > > > +       dma_fence_put(fence);
> > > > > > > > > > > > +
> > > > > > > > > > > > +put_pm:
> > > > > > > > > > > > +       xe_pm_runtime_put(xe);
> > > > > > > > > > > > +unbind:
> > > > > > > > > > > > +       drm_dev_exit(idx);
> > > > > > > > > > > > +}
> > > > > > > > > > > > +
> > > > > > > > > > > >  static void xe_ttm_bo_release_notify(struct
> > > > > > > > > > > > ttm_buffer_object
> > > > > > > > > > > > *ttm_bo)
> > > > > > > > > > > >  {
> > > > > > > > > > > >         struct dma_resv_iter cursor;
> > > > > > > > > > > > @@ -1478,6 +1523,8 @@ static void
> > > > > > > > > > > > xe_ttm_bo_release_notify(struct
> > > > > > > > > > > > ttm_buffer_object *ttm_bo)
> > > > > > > > > > > >         }
> > > > > > > > > > > >         dma_fence_put(replacement);
> > > > > > > > > > > >  
> > > > > > > > > > > > +       xe_ttm_bo_release_clear(ttm_bo);
> > > > > > > > > > > > +
> > > > > > > > > > > >         dma_resv_unlock(ttm_bo->base.resv);
> > > > > > > > > > > >  }
> > > > > > > > > > > >  
> > > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > > > > > > > index 8f8e9fdfb2a8..39d7200cb366 100644
> > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > > > > > > > @@ -1063,7 +1063,7 @@ struct dma_fence
> > > > > > > > > > > > *xe_migrate_clear(struct
> > > > > > > > > > > > xe_migrate *m,
> > > > > > > > > > > >         struct xe_gt *gt = m->tile->primary_gt;
> > > > > > > > > > > >         struct xe_device *xe = gt_to_xe(gt);
> > > > > > > > > > > >         bool clear_only_system_ccs = false;
> > > > > > > > > > > > -       struct dma_fence *fence = NULL;
> > > > > > > > > > > > +       struct dma_fence *fence =
> > > > > > > > > > > > dma_fence_get_stub();
> > > > > > > > > > > >         u64 size = bo->size;
> > > > > > > > > > > >         struct xe_res_cursor src_it;
> > > > > > > > > > > >         struct ttm_resource *src = dst;
> > > > > > > > > > > > @@ -1075,10 +1075,13 @@ struct dma_fence
> > > > > > > > > > > > *xe_migrate_clear(struct
> > > > > > > > > > > > xe_migrate *m,
> > > > > > > > > > > >         if (!clear_bo_data && clear_ccs &&
> > > > > > > > > > > > !IS_DGFX(xe))
> > > > > > > > > > > >                 clear_only_system_ccs = true;
> > > > > > > > > > > >  
> > > > > > > > > > > > -       if (!clear_vram)
> > > > > > > > > > > > +       if (!clear_vram) {
> > > > > > > > > > > >                 xe_res_first_sg(xe_bo_sg(bo),
> > > > > > > > > > > > 0,
> > > > > > > > > > > > bo-
> > > > > > > > > > > > > size,
> > > > > > > > > > > > &src_it);
> > > > > > > > > > > > -       else
> > > > > > > > > > > > +       } else {
> > > > > > > > > > > >                 xe_res_first(src, 0, bo->size,
> > > > > > > > > > > > &src_it);
> > > > > > > > > > > > +               if (!(clear_flags &
> > > > > > > > > > > > XE_MIGRATE_CLEAR_NON_DIRTY))
> > > > > > > > > > > > +                       size -=
> > > > > > > > > > > > xe_res_next_dirty(&src_it);
> > > > > > > > > > > > +       }
> > > > > > > > > > > >  
> > > > > > > > > > > >         while (size) {
> > > > > > > > > > > >                 u64 clear_L0_ofs;
> > > > > > > > > > > > @@ -1125,6 +1128,9 @@ struct dma_fence
> > > > > > > > > > > > *xe_migrate_clear(struct
> > > > > > > > > > > > xe_migrate *m,
> > > > > > > > > > > >                         emit_pte(m, bb,
> > > > > > > > > > > > clear_L0_pt,
> > > > > > > > > > > > clear_vram,
> > > > > > > > > > > > clear_only_system_ccs,
> > > > > > > > > > > >                                  &src_it,
> > > > > > > > > > > > clear_L0,
> > > > > > > > > > > > dst);
> > > > > > > > > > > >  
> > > > > > > > > > > > +               if (clear_vram && !(clear_flags
> > > > > > > > > > > > &
> > > > > > > > > > > > XE_MIGRATE_CLEAR_NON_DIRTY))
> > > > > > > > > > > > +                       size -=
> > > > > > > > > > > > xe_res_next_dirty(&src_it);
> > > > > > > > > > > > +
> > > > > > > > > > > >                 bb->cs[bb->len++] =
> > > > > > > > > > > > MI_BATCH_BUFFER_END;
> > > > > > > > > > > >                 update_idx = bb->len;
> > > > > > > > > > > >  
> > > > > > > > > > > > @@ -1146,7 +1152,7 @@ struct dma_fence
> > > > > > > > > > > > *xe_migrate_clear(struct
> > > > > > > > > > > > xe_migrate *m,
> > > > > > > > > > > >                 }
> > > > > > > > > > > >  
> > > > > > > > > > > >                 xe_sched_job_add_migrate_flush(
> > > > > > > > > > > > job,
> > > > > > > > > > > > flush_flags);
> > > > > > > > > > > > -               if (!fence) {
> > > > > > > > > > > > +               if (fence ==
> > > > > > > > > > > > dma_fence_get_stub())
> > > > > > > > > > > > {
> > > > > > > > > > > >                         /*
> > > > > > > > > > > >                          * There can't be
> > > > > > > > > > > > anything
> > > > > > > > > > > > userspace
> > > > > > > > > > > > related
> > > > > > > > > > > > at this
> > > > > > > > > > > >                          * point, so we just
> > > > > > > > > > > > need
> > > > > > > > > > > > to
> > > > > > > > > > > > respect
> > > > > > > > > > > > any
> > > > > > > > > > > > potential move
> > > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > > > > > > > index fb9839c1bae0..58a7b747ef11 100644
> > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > > > > > > > @@ -118,6 +118,7 @@ int
> > > > > > > > > > > > xe_migrate_access_memory(struct
> > > > > > > > > > > > xe_migrate
> > > > > > > > > > > > *m, struct xe_bo *bo,
> > > > > > > > > > > >  
> > > > > > > > > > > >  #define
> > > > > > > > > > > > XE_MIGRATE_CLEAR_FLAG_BO_DATA          BIT(0)
> > > > > > > > > > > >  #define
> > > > > > > > > > > > XE_MIGRATE_CLEAR_FLAG_CCS_DATA         BIT(1)
> > > > > > > > > > > > +#define
> > > > > > > > > > > > XE_MIGRATE_CLEAR_NON_DIRTY             BIT(2)
> > > > > > > > > > > >  #define
> > > > > > > > > > > > XE_MIGRATE_CLEAR_FLAG_FULL     (XE_MIGRATE_CLEA
> > > > > > > > > > > > R_FL
> > > > > > > > > > > > AG_BO_
> > > > > > > > > > > > DATA
> > > > > > > > > > > > > \
> > > > > > > > > > > >                                         XE_MIGR
> > > > > > > > > > > > ATE_
> > > > > > > > > > > > CLEAR_
> > > > > > > > > > > > FLAG
> > > > > > > > > > > > _CCS
> > > > > > > > > > > > _DAT
> > > > > > > > > > > > A)
> > > > > > > > > > > >  struct dma_fence *xe_migrate_clear(struct
> > > > > > > > > > > > xe_migrate *m,
> > > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > > > > > > > index d1a403cfb628..630082e809ba 100644
> > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > > > > > > > @@ -315,6 +315,32 @@ static inline void
> > > > > > > > > > > > xe_res_next(struct
> > > > > > > > > > > > xe_res_cursor *cur, u64 size)
> > > > > > > > > > > >         }
> > > > > > > > > > > >  }
> > > > > > > > > > > >  
> > > > > > > > > > > > +/**
> > > > > > > > > > > > + * xe_res_next_dirty - advance the cursor to
> > > > > > > > > > > > next
> > > > > > > > > > > > dirty
> > > > > > > > > > > > buddy
> > > > > > > > > > > > block
> > > > > > > > > > > > + *
> > > > > > > > > > > > + * @cur: the cursor to advance
> > > > > > > > > > > > + *
> > > > > > > > > > > > + * Move the cursor until dirty buddy block is
> > > > > > > > > > > > found.
> > > > > > > > > > > > + *
> > > > > > > > > > > > + * Return: Number of bytes cursor has been
> > > > > > > > > > > > advanced
> > > > > > > > > > > > + */
> > > > > > > > > > > > +static inline u64 xe_res_next_dirty(struct
> > > > > > > > > > > > xe_res_cursor
> > > > > > > > > > > > *cur)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +       struct drm_buddy_block *block = cur-
> > > > > > > > > > > > >node;
> > > > > > > > > > > > +       u64 bytes = 0;
> > > > > > > > > > > > +
> > > > > > > > > > > > +       XE_WARN_ON(cur->mem_type != XE_PL_VRAM0
> > > > > > > > > > > > &&
> > > > > > > > > > > > +                  cur->mem_type !=
> > > > > > > > > > > > XE_PL_VRAM1);
> > > > > > > > > > > 
> > > > > > > > > > > What if we have more than just these two? Maybe
> > > > > > > > > > > check
> > > > > > > > > > > against
> > > > > > > > > > > the
> > > > > > > > > > > mask
> > > > > > > > > > > instead.
> > > > > > > > 
> > > > > > > > Sure. I think we do this test in a couple of other
> > > > > > > > places
> > > > > > > > in the
> > > > > > > > driver too
> > > > > > > > and clean this up in seperate patch first.
> > > > > > > 
> > > > > > > No problem, just wanted to call it out. We can do this
> > > > > > > later..
> > > > > > > 
> > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > +
> > > > > > > > > > > > +       while (cur->remaining &&
> > > > > > > > > > > > drm_buddy_block_is_clear(block))
> > > > > > > > > > > > {
> > > > > > > > > > > > +               bytes += cur->size;
> > > > > > > > > > > > +               xe_res_next(cur, cur->size);
> > > > > > > > > > > > +               block = cur->node;
> > > > > > > > > > > > +       }
> > > > > > > > > > > > +
> > > > > > > > > > > > +       return bytes;
> > > > > > > > > > > > +}
> > > > > > > > > > > > +
> > > > > > > > > > > >  /**
> > > > > > > > > > > >   * xe_res_dma - return dma address of cursor
> > > > > > > > > > > > at
> > > > > > > > > > > > current
> > > > > > > > > > > > position
> > > > > > > > > > > >   *
> > > > > > > > > > > > diff --git
> > > > > > > > > > > > a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > > > > > > > index 9e375a40aee9..120046941c1e 100644
> > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > > > > > > > @@ -84,6 +84,9 @@ static int
> > > > > > > > > > > > xe_ttm_vram_mgr_new(struct
> > > > > > > > > > > > ttm_resource_manager *man,
> > > > > > > > > > > >         if (place->fpfn || lpfn != man->size >>
> > > > > > > > > > > > PAGE_SHIFT)
> > > > > > > > > > > >                 vres->flags |=
> > > > > > > > > > > > DRM_BUDDY_RANGE_ALLOCATION;
> > > > > > > > > > > >  
> > > > > > > > > > > > +       if (tbo->type == ttm_bo_type_device)
> > > > > > > > > > > > +               vres->flags |=
> > > > > > > > > > > > DRM_BUDDY_CLEAR_ALLOCATION;
> > > > > > > > > > > > +
> > > > > > > > > > > >         if (WARN_ON(!vres->base.size)) {
> > > > > > > > > > > >                 err = -EINVAL;
> > > > > > > > > > > >                 goto error_fini;
> > > > > > > > > > > > @@ -187,7 +190,7 @@ static void
> > > > > > > > > > > > xe_ttm_vram_mgr_del(struct
> > > > > > > > > > > > ttm_resource_manager *man,
> > > > > > > > > > > >         struct drm_buddy *mm = &mgr->mm;
> > > > > > > > > > > >  
> > > > > > > > > > > >         mutex_lock(&mgr->lock);
> > > > > > > > > > > > -       drm_buddy_free_list(mm, &vres->blocks,
> > > > > > > > > > > > 0);
> > > > > > > > > > > > +       drm_buddy_free_list(mm, &vres->blocks,
> > > > > > > > > > > > vres-
> > > > > > > > > > > > > flags);
> > > > > > > > > > > >         mgr->visible_avail += vres-
> > > > > > > > > > > > > used_visible_size;
> > > > > > > > > > > >         mutex_unlock(&mgr->lock);
> > > > > > > > > > > >  
> > > > > > > > > > > > diff --git
> > > > > > > > > > > > a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > > > > > > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > > > > > > > index cc76050e376d..dfc0e6890b3c 100644
> > > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > > > > > > > @@ -36,6 +36,12 @@
> > > > > > > > > > > > to_xe_ttm_vram_mgr_resource(struct
> > > > > > > > > > > > ttm_resource
> > > > > > > > > > > > *res)
> > > > > > > > > > > >         return container_of(res, struct
> > > > > > > > > > > > xe_ttm_vram_mgr_resource,
> > > > > > > > > > > > base);
> > > > > > > > > > > >  }
> > > > > > > > > > > >  
> > > > > > > > > > > > +static inline void
> > > > > > > > > > > > +xe_ttm_vram_mgr_resource_set_cleared(struct
> > > > > > > > > > > > ttm_resource
> > > > > > > > > > > > *res)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +       to_xe_ttm_vram_mgr_resource(res)->flags
> > > > > > > > > > > > |=
> > > > > > > > > > > > DRM_BUDDY_CLEARED;
> > > > > > > > > > > 
> > > > > > > > > > > I see the amd driver is doing the same thing
> > > > > > > > > > > here.
> > > > > > > > > > > Maybe we
> > > > > > > > > > > can
> > > > > > > > > > > pull
> > > > > > > > > > > this in based on the flags given at the resource
> > > > > > > > > > > initialization
> > > > > > > > > > > so
> > > > > > > > > > > we
> > > > > > > > > > > can potentially tweak this for different
> > > > > > > > > > > scenarios
> > > > > > > > > > > (when to
> > > > > > > > > > > free)
> > > > > > > > > > > and
> > > > > > > > > > > converge what we're doing here and what amd is
> > > > > > > > > > > doing
> > > > > > > > > > > into
> > > > > > > > > > > the
> > > > > > > > > > > common
> > > > > > > > > > > code.
> > > > > > > > > > > 
> > > > > > > > 
> > > > > > > > AMD has a flag, which is the same test in this series
> > > > > > > > as
> > > > > > > > 'type ==
> > > > > > > > ttm_bo_type_device'. AMD blindly sets this flag on user
> > > > > > > > BO
> > > > > > > > creation
> > > > > > > > but
> > > > > > > > as I stated above we could wire this to uAPI if needed.
> > 
> > Sorry I didn't fully understand the response here. So I was saying
> > we
> 
> Ok, my bad, I think get what you are suggesting now.
> 
> > have a common routine on the drm side that takes a flag
> > (CLEAR_ON_ALLOC
> > vs CLEAR_ON_FREE) and performs the operation based on the flag.
> > Then
> 
> 'performs the operation based on the flag'
> 
> Do you mean the clear - that is device specific operation hence in
> the
> TTM vfunc release notify.
> 
> > both users on this side would just CLEAR_ON_FREE. Or we can even
> > drop
> > that flag and just assume everyone will always CLEAR_ON_FREE to
> > avoid
> > the extra code of clear-on-alloc. Then if we do decide to do some
> > uAPI
> 
> 'avoid the extra code of clear-on-alloc' - The clear on alloc is
> skipped
> if buddy blocks are marked as clear. There is not a guarantee they
> will
> be clear given we skip clear-on-free for kernel memory, drm buddy
> does
> not initialize blocks to clear, or for whatever reason the clear-on-
> free
> fails in the TTM release notify vfunc function (e.g., a kmalloc fails
> there).
> 
> > in the future we add the extra flag. But it still lets us
> > consolidate
> > the amd and intel routines.
> > 
> 
> You mean xe_ttm_vram_mgr_resource_set_cleared /
> amdgpu_vram_mgr_set_cleared?
> 
> Both of those operate on a driver specific structure which
> encapsulates
> a TTM resource. TTM allocates memory via a TTM resource manager
> abstraction /w driver vfuncs making no assumptions about memory
> allocation - AMD and Intel happen to use the DRM buddy allocator for
> VRAM but TTM again has no idea how the memory is being allocated. So
> sticking a buddy allocation specific flag into TTM is a no go.

Ok this makes a lot of sense to me - basically we don't want to polute
the ttm layer with buddy allocator semantics, so we push the code to
the drivers to do that. So even though there is code duplication here
between amd and xe, it still aligns with the common layer
implementation as it is today.

I'm good with what you have here then. With the commit message updated
to describe the intention as discussed above:
Reviewed-by: Stuart Summers <stuart.summers at intel.com>

> 
> We could add specific TTM flag to TTM resource saying the resource is
> clear and convert to buddy flag upon block list free, but we don't
> have
> a flags fields in the TTM resource whereas the driver side we do.
> Also
> have you ever tried changing TTM - it is not a whole lot of fun and
> I pretty much guarnetee it will get nacked for one reason or another.
> That is not something I would personally even attempt to post because
> I
> know better.
> 
> > The type == ttm_bo_type_device seems like something we can extract
> > out
> > as device-specific.
> > 
> 
> What do mean extract? A driver side helper? I thought about that but
> struggled a bit with naming so left out. Could include that in next
> rev
> I suppose.

My comment here was specifically in relation to moving functionality to
a common layer. If we don't do that, I don't know if it's worth
implementing a wrapper here for this patch specifically. No issue on my
side with what you have.

Thanks,
Stuart

> 
> Matt
> 
> > Thanks,
> > Stuart
> > 
> > > > > > > > 
> > > > > > > > Also I think there are some kernel BOs which clear-on-
> > > > > > > > free
> > > > > > > > is
> > > > > > > > safe
> > > > > > > > too
> > > > > > > > (e.g., I think only PT BOs need contents to stick
> > > > > > > > around
> > > > > > > > after
> > > > > > > > final
> > > > > > > > put
> > > > > > > > until dma-resv is clean of fences). Would have to fully
> > > > > > > > to
> > > > > > > > test
> > > > > > > > for
> > > > > > > > definitive answer though.
> > > > > > > 
> > > > > > > In most cases the kernel operations are going to be lower
> > > > > > > priority
> > > > > > > than
> > > > > > 
> > > > > > It is the inverse of that - kernel operations are a always
> > > > > > higher
> > > > > > priority as without them a user app can't function (think
> > > > > > any
> > > > > > page
> > > > > > table
> > > > > > memory allocation or memory allocation for a LRC). This is
> > > > > > one
> > > > > > of
> > > > > > reasons we just pin kernel memory in Xe compared evictable
> > > > > > kernel
> > > > > > memory
> > > > > > in the i915 (also evicting kernel memory is a huge pain as
> > > > > > we
> > > > > > have
> > > > > > idle
> > > > > > the GuC state which could use that memory).
> > > > > 
> > > > > We're talking about kernel operations attached to different
> > > > > processes
> > > > > though. The incoming kernel work would be for a new user
> > > > > process
> > > > > (prepping the buffer for submission). The existing user
> > > > > process
> > > > > IMO
> > > > > should have higher priority.
> > > > > 
> > > > 
> > > > You still have to wait on the user operation complete in dma-
> > > > resv
> > > > mode
> > > > before stealing memory.
> > > > 
> > > > In LR preempt fence mode, we'd wait until LR VM is preempted
> > > > off
> > > > the
> > > > hardware.
> > > > 
> > > > In LR fault mode, we can steal the memory immediately.
> > > > 
> > > 
> > > s/immediately/after TLB invalidation completes
> > > 
> > > Matt
> > > 
> > > > There is no difference though if we are trying to get kernel
> > > > memory
> > > > or
> > > > user memory - this is just how eviction works.
> > > > 
> > > > Also if we are in LR mode, we have no idea how long the user
> > > > processes
> > > > which we stealing memory from will take to complete (e.g., we
> > > > do
> > > > not
> > > > install fences in dma-resv attached to jobs as the fences could
> > > > run
> > > > forever breaking the rules of fence or in other word breaking
> > > > reclaim).
> > > > 
> > > > This is where pinning of user memory could come in play /w some
> > > > accounting controler, likely cgroups. LR mode could pin some
> > > > memory
> > > > to
> > > > ensure fast forward progress or buffers it really doesn't want
> > > > to
> > > > be
> > > > moved around (e.g., buffers exported over a fast network
> > > > interconnect).
> > > > 
> > > > Matt
> > > > 
> > > > > Thanks,
> > > > > Stuart
> > > > > 
> > > > > > 
> > > > > > > the user operations (with the exception of maybe page
> > > > > > > faults). We
> > > > > > > had
> > > > > > > looked at things in i915 like preempting kernel
> > > > > > > operations
> > > > > > > when a
> > > > > > > user
> > > > > > > op comes in (the user clear-on-alloc preempts the kernel
> > > > > > > clear-on-
> > > > > > > free/alloc). This probably falls in the same optimization
> > > > > > > category
> > > > > > > you
> > > > > > 
> > > > > > We definitely won't do that as this would require multiple
> > > > > > migration
> > > > > > queues and preemption via the GuC is so expensive I'd doubt
> > > > > > you'd
> > > > > > ever
> > > > > > win. I guess we could have a convolved algorithm in the
> > > > > > migration
> > > > > > queue
> > > > > > to jump to higher priortiy jobs but would need some really
> > > > > > strong
> > > > > > data
> > > > > > indicating this is needed. Also with dma-resv basically the
> > > > > > migration
> > > > > > queue needs to be idle before anything from the user can
> > > > > > run.
> > > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > > mentioned, although it doesn't necessarily need a user
> > > > > > > hint.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Stuart
> > > > > > > 
> > > > > > > > 
> > > > > > > > Matt
> > > > > > > > 
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Stuart
> > > > > > > > > > > 
> > > > > > > > > > > > +}
> > > > > > > > > > > > +
> > > > > > > > > > > >  static inline struct xe_ttm_vram_mgr *
> > > > > > > > > > > >  to_xe_ttm_vram_mgr(struct ttm_resource_manager
> > > > > > > > > > > > *man)
> > > > > > > > > > > >  {
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > 
> >