[PATCH] drm/xe: Implement clear VRAM on free

Wed Jun 11 21:23:44 UTC 2025

On Wed, Jun 11, 2025 at 01:12:13PM -0600, Summers, Stuart wrote:
> On Wed, 2025-06-11 at 12:03 -0700, Matthew Brost wrote:
> > On Wed, Jun 11, 2025 at 12:01:10PM -0700, Matthew Brost wrote:
> > 
> > One correction...
> > 
> > > On Wed, Jun 11, 2025 at 12:23:13PM -0600, Summers, Stuart wrote:
> > > > On Wed, 2025-06-11 at 11:20 -0700, Matthew Brost wrote:
> > > > > On Wed, Jun 11, 2025 at 12:05:36PM -0600, Summers, Stuart
> > > > > wrote:
> > > > > > On Wed, 2025-06-11 at 10:57 -0700, Matthew Brost wrote:
> > > > > > > On Wed, Jun 11, 2025 at 11:04:06AM -0600, Summers, Stuart
> > > > > > > wrote:
> > > > > > > > On Wed, 2025-06-11 at 09:46 -0700, Matthew Brost wrote:
> > > > > > > > > On Wed, Jun 11, 2025 at 10:26:44AM -0600, Summers,
> > > > > > > > > Stuart
> > > > > > > > > wrote:
> > > > > > > > > > On Tue, 2025-06-10 at 22:42 -0700, Matthew Brost
> > > > > > > > > > wrote:
> > > > > > > > > > > Clearing on free should hide latency of BO clears
> > > > > > > > > > > on new
> > > > > > > > > > > user
> > > > > > > > > > > BO
> > > > > > > > > > > allocations.
> > > > > > > > > > 
> > > > > > > > > > Do you have any test results showing the latency
> > > > > > > > > > improvement
> > > > > > > > > > here
> > > > > > > > > > on
> > > > > > > > > > high volume submission tests? Also how does this
> > > > > > > > > > impact at
> > > > > > > > > > very
> > > > > > > > > > large
> > > > > > > > > 
> > > > > > > > > No performance data yet available. Something we
> > > > > > > > > definitely
> > > > > > > > > should
> > > > > > > > > look
> > > > > > > > > into and would be curious about the results. FWIW,
> > > > > > > > > AMDGPU
> > > > > > > > > implements
> > > > > > > > > clear on free in a very similar way to this series. 
> > > > 
> > > > So back to this one, what is the motivation for this change if we
> > > > don't
> > > > have data?
> > > > 
> > > 
> > > It is the assumption that in typical cases (no memory pressure) the
> > > buddy allocator pool can find a clean VRAM upon allocation
> > > resulting in
> > > immediate use of the BO rather than issuing a clear which could
> > > delay a
> > > bind IOCTL, an exec IOCTL, or even mmaping the BO (dma-resv
> > > enforces
> > > this ordering btw).
> > > 
> > > Even if there is memory pressure, it is fairly easy to reason that
> > > clear-on-free creates a pipeline in which the clear is started
> > > earlier
> > > than it would be upon allocation thus reducing the overall delay of
> > > the
> > > BO usage after allocation.
> > > 
> > > Yes, there might be odd corner of large clear-on-free followed by
> > > small
> > > allocation where it might be slightly worse, but overall it seems
> > > to be
> > > an easy win - both AMD / Intel have done work to make clear-on-free
> > > possible.
> 
> Ok so basically we think the baseline is going to be better this way (I
> agree with what you have here) and any corner case will be addressed in
> the future as they come up (i.e. uAPI).
> 
> It does seem like even if we don't have a uAPI we could have a solution
> that lets the small workload preempt the large workload clear to do its
> own clear-on-alloc, allowing the small workload to move on to the
> render/compute engine while the large clearing workload does it's clear
> using blitter (just as an example). It's an optimization though I
> agree. And without an explicit user (yet for Xe...) we might just kick
> that can down the road.
> 

All of this depends on getting memory - there isn't a concept of large
or small BOs anywhere in DRM subsystem, or specially clears attached to
BOs - it just BOs with fences in dma-resv and once all the fences signal
memory is return to the buddy allocator. Either memory is available or
it isn't and eviction is triggered.

What you suggest sounds like i915ism that is wildly complex, likely not
possible with a shared code base, and would never be accepted by our
maintainers or anyone in the DRM subsystem.

> > > 
> > > > > > > > 
> > > > > > > > I had a few other comments towards the end of the patch.
> > > > > > > > Basically
> > > > > > > > I
> > > > > > > 
> > > > > > > I should slow down and read... Repling to all comments
> > > > > > > here.
> > > > > > > 
> > > > > > > > think it would be nice to be able to configure this.
> > > > > > > > There was
> > > > > > > > significant testing done here for Aurora in i915 and the
> > > > > > > > performance
> > > > > > > > benefits were potentially different based on the types of
> > > > > > > > workloads
> > > > > > > > being used. Having user hints for instance might allow us
> > > > > > > > to
> > > > > > > > take
> > > > > > > > advantage of those workload specific use cases.
> > > > > > > > 
> > > > > > > 
> > > > > > > We can wire this mode to the uAPI for BO creation if needed
> > > > > > > -
> > > > > > > this
> > > > > > > patch
> > > > > > > does clear-on-free blindly for any user BO. Ofc with uAPI
> > > > > > > we need
> > > > > > > a
> > > > > > > UMD
> > > > > > > user to upstream. We may just want to start with this patch
> > > > > > > and
> > > > > > > do
> > > > > > > that
> > > > > > > in a follow up if needed as it a minor tweak from a code
> > > > > > > PoV.
> > > > > > 
> > > > > > Actually it would be great if you can add exactly this to the
> > > > > > commit
> > > > > > message. Basically we are intentionally doing clear on free
> > > > > > here
> > > > > > blindly because we don't currently have a user for some kind
> > > > > > of
> > > > > > more
> > > > > > complicated (from the user perspective) hint system. (just
> > > > > > kind of
> > > > > > repeating what you said)
> > > > > > 
> > > > > > This way we have some documentation of why we chose this
> > > > > > route
> > > > > > other
> > > > > > than just because amd is doing that.
> > > > > > 
> > > > > 
> > > > > Sure.
> > > > > 
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > > buffer size submissions and mix of large and small
> > > > > > > > > > buffer
> > > > > > > > > > submissions
> > > > > > > > > > with eviction between different processes?
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Eviction somewhat orthogonal to this. If an eviction
> > > > > > > > > occurs
> > > > > > > > > and
> > > > > > > > > it is
> > > > > > > > > a
> > > > > > > > > new allocation a clear would still have be issued ahead
> > > > > > > > > handing
> > > > > > > > > the
> > > > > > > > > memory over to the new BO, if eviction occurs and paged
> > > > > > > > > in BO
> > > > > > > > > has
> > > > > > > > > backing store (it was previously evicted) we'd just
> > > > > > > > > copy the
> > > > > > > > > contents
> > > > > > > > > into the allocated memory.
> > > > > > > > 
> > > > > > > > Good point. I still think we should consider the
> > > > > > > > scenarios of
> > > > > > > > having a
> > > > > > > > large buffer workload running and then clearing, then we
> > > > > > > > have a
> > > > > > > > small
> > > > > > > > buffer workload coming in which now needs to wait on the
> > > > > > > > clear
> > > > > > > > from
> > > > > > > > that large workload. Whereas on clear-on-alloc for the
> > > > > > > > small
> > > > > > > > workload,
> > > > > > > > we can get that submission in quickly and not wait for
> > > > > > > > that
> > > > > > > > larger
> > > > > > > > one.
> > > > > > > > 
> > > > > > > 
> > > > > > > The buddy allocator tries to allocate cleared VRAM first,
> > > > > > > then
> > > > > > > falls
> > > > > > > back to dirty VRAM. Dirty VRAM will still be cleared upon
> > > > > > > allocation
> > > > > > > in
> > > > > > > this patch.
> > > > > > > 
> > > > > > > The senerio you describe could result in eviction actually.
> > > > > > > If a
> > > > > > > large
> > > > > > > BO has pending clear - it won't be in either buddy
> > > > > > > allocator pool
> > > > > > > -
> > > > > > > it
> > > > > > > will be TTM as ghost object (pending free). If the buddy
> > > > > > > allocator
> > > > > > > fails
> > > > > > > an allocation, eviction is triggered with the ghost objects
> > > > > > > in
> > > > > > > TTM
> > > > > > > being
> > > > > > > tried to allocate first (I think, if I'm misstaing TTM
> > > > > > > internals
> > > > > > > my
> > > > > > > bad). This would as you suggest result in the BO allocation
> > > > > > > waiting
> > > > > > > on
> > > > > > > potentially a large clear to finish even though the
> > > > > > > allocation is
> > > > > > > small
> > > > > > > in size.
> > > > > > 
> > > > > > So we have had customers in the past who have wanted explicit
> > > > > > disabling
> > > > > > of eviction. I agree in the current implementation that might
> > > > > > be
> > > > > > the
> > > > > 
> > > > > Explicitly disabling eviction is not an upstream option,
> > > > > downstream
> > > > > in
> > > > > i915 you can do whatever you want as there are no rules. We
> > > > > have
> > > > > discussed a pin accounting controler vis cgroups or something
> > > > > upstream
> > > > > which may be acceptable. Different discussion though.
> > > > > 
> > > > > > case, but in the future we might want such a capability, so
> > > > > > there
> > > > > > is
> > > > > > still a chance of blocking. But again, maybe this can be a
> > > > > > future
> > > > > > enhancement as you said.
> > > > > > 
> > > > > > > 
> > > > > > > > I don't think there is necessarily a one-size-fits-all
> > > > > > > > solution
> > > > > > > > here
> > > > > > > > which is why I think the hints are important.
> > > > > > > > 
> > > > > > > 
> > > > > > > Jumping to new uAPI + hints immediately might not be the
> > > > > > > best
> > > > > > > approach,
> > > > > > > but as I stated this is minor thing to add if needed.
> > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Stuart
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Matt
> > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Implemented via calling xe_migrate_clear in release
> > > > > > > > > > > notify
> > > > > > > > > > > and
> > > > > > > > > > > updating
> > > > > > > > > > > iterator in xe_migrate_clear to skip cleared buddy
> > > > > > > > > > > blocks.
> > > > > > > > > > > Only
> > > > > > > > > > > user
> > > > > > > > > > > BOs
> > > > > > > > > > > cleared in release notify as kernel BOs could still
> > > > > > > > > > > be in
> > > > > > > > > > > use
> > > > > > > > > > > (e.g.,
> > > > > > > > > > > PT
> > > > > > > > > > > BOs need to wait for dma-resv to be idle).
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Matthew Brost
> > > > > > > > > > > <matthew.brost at intel.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  drivers/gpu/drm/xe/xe_bo.c           | 47
> > > > > > > > > > > ++++++++++++++++++++++++++++
> > > > > > > > > > >  drivers/gpu/drm/xe/xe_migrate.c      | 14 ++++++--
> > > > > > > > > > > -
> > > > > > > > > > >  drivers/gpu/drm/xe/xe_migrate.h      |  1 +
> > > > > > > > > > >  drivers/gpu/drm/xe/xe_res_cursor.h   | 26
> > > > > > > > > > > +++++++++++++++
> > > > > > > > > > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c |  5 ++-
> > > > > > > > > > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.h |  6 ++++
> > > > > > > > > > >  6 files changed, 94 insertions(+), 5 deletions(-)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > index 4e39188a021a..74470f4d418d 100644
> > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > > > @@ -1434,6 +1434,51 @@ static bool
> > > > > > > > > > > xe_ttm_bo_lock_in_destructor(struct
> > > > > > > > > > > ttm_buffer_object
> > > > > > > > > > > *ttm_bo)
> > > > > > > > > > >         return locked;
> > > > > > > > > > >  }
> > > > > > > > > > >  
> > > > > > > > > > > +static void xe_ttm_bo_release_clear(struct
> > > > > > > > > > > ttm_buffer_object
> > > > > > > > > > > *ttm_bo)
> > > > > > > > > > > +{
> > > > > > > > > > > +       struct xe_device *xe =
> > > > > > > > > > > ttm_to_xe_device(ttm_bo-
> > > > > > > > > > > > bdev);
> > > > > > > > > > > +       struct dma_fence *fence;
> > > > > > > > > > > +       int err, idx;
> > > > > > > > > > > +
> > > > > > > > > > > +       xe_bo_assert_held(ttm_to_xe_bo(ttm_bo));
> > > > > > > > > > > +
> > > > > > > > > > > +       if (ttm_bo->type != ttm_bo_type_device)
> > > > > > > > > > > +               return;
> > > > > > > > > > > +
> > > > > > > > > > > +       if (xe_device_wedged(xe))
> > > > > > > > > > > +               return;
> > > > > > > > > > > +
> > > > > > > > > > > +       if (!ttm_bo->resource ||
> > > > > > > > > > > !mem_type_is_vram(ttm_bo-
> > > > > > > > > > > > resource-
> > > > > > > > > > > > mem_type))
> > > > > > > > > > > +               return;
> > > > > > > > > > > +
> > > > > > > > > > > +       if (!drm_dev_enter(&xe->drm, &idx))
> > > > > > > > > > > +               return;
> > > > > > > > > > > +
> > > > > > > > > > > +       if (!xe_pm_runtime_get_if_active(xe))
> > > > > > > > > > > +               goto unbind;
> > > > > > > > > > > +
> > > > > > > > > > > +       err = dma_resv_reserve_fences(&ttm_bo-
> > > > > > > > > > > > base._resv,
> > > > > > > > > > > 1);
> > > > > > > > > > > +       if (err)
> > > > > > > > > > > +               goto put_pm;
> > > > > > > > > > > +
> > > > > > > > > > > +       fence =
> > > > > > > > > > > xe_migrate_clear(mem_type_to_migrate(xe,
> > > > > > > > > > > ttm_bo-
> > > > > > > > > > > > resource->mem_type),
> > > > > > > > > > > +                               
> > > > > > > > > > > ttm_to_xe_bo(ttm_bo),
> > > > > > > > > > > ttm_bo-
> > > > > > > > > > > > resource,
> > > > > > > > > > > +                               
> > > > > > > > > > > XE_MIGRATE_CLEAR_FLAG_FULL |
> > > > > > > > > > > +                               
> > > > > > > > > > > XE_MIGRATE_CLEAR_NON_DIRTY);
> > > > > > > > > > > +       if (XE_WARN_ON(IS_ERR(fence)))
> > > > > > > > > > > +               goto put_pm;
> > > > > > > > > > > +
> > > > > > > > > > > +       xe_ttm_vram_mgr_resource_set_cleared(ttm_bo
> > > > > > > > > > > -
> > > > > > > > > > > > resource);
> > > > > > > > > > > +       dma_resv_add_fence(&ttm_bo->base._resv,
> > > > > > > > > > > fence,
> > > > > > > > > > > +                          DMA_RESV_USAGE_KERNEL);
> > > > > > > > > > > +       dma_fence_put(fence);
> > > > > > > > > > > +
> > > > > > > > > > > +put_pm:
> > > > > > > > > > > +       xe_pm_runtime_put(xe);
> > > > > > > > > > > +unbind:
> > > > > > > > > > > +       drm_dev_exit(idx);
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > >  static void xe_ttm_bo_release_notify(struct
> > > > > > > > > > > ttm_buffer_object
> > > > > > > > > > > *ttm_bo)
> > > > > > > > > > >  {
> > > > > > > > > > >         struct dma_resv_iter cursor;
> > > > > > > > > > > @@ -1478,6 +1523,8 @@ static void
> > > > > > > > > > > xe_ttm_bo_release_notify(struct
> > > > > > > > > > > ttm_buffer_object *ttm_bo)
> > > > > > > > > > >         }
> > > > > > > > > > >         dma_fence_put(replacement);
> > > > > > > > > > >  
> > > > > > > > > > > +       xe_ttm_bo_release_clear(ttm_bo);
> > > > > > > > > > > +
> > > > > > > > > > >         dma_resv_unlock(ttm_bo->base.resv);
> > > > > > > > > > >  }
> > > > > > > > > > >  
> > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > > > > > > b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > > > > > > index 8f8e9fdfb2a8..39d7200cb366 100644
> > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > > > > > > > > > @@ -1063,7 +1063,7 @@ struct dma_fence
> > > > > > > > > > > *xe_migrate_clear(struct
> > > > > > > > > > > xe_migrate *m,
> > > > > > > > > > >         struct xe_gt *gt = m->tile->primary_gt;
> > > > > > > > > > >         struct xe_device *xe = gt_to_xe(gt);
> > > > > > > > > > >         bool clear_only_system_ccs = false;
> > > > > > > > > > > -       struct dma_fence *fence = NULL;
> > > > > > > > > > > +       struct dma_fence *fence =
> > > > > > > > > > > dma_fence_get_stub();
> > > > > > > > > > >         u64 size = bo->size;
> > > > > > > > > > >         struct xe_res_cursor src_it;
> > > > > > > > > > >         struct ttm_resource *src = dst;
> > > > > > > > > > > @@ -1075,10 +1075,13 @@ struct dma_fence
> > > > > > > > > > > *xe_migrate_clear(struct
> > > > > > > > > > > xe_migrate *m,
> > > > > > > > > > >         if (!clear_bo_data && clear_ccs &&
> > > > > > > > > > > !IS_DGFX(xe))
> > > > > > > > > > >                 clear_only_system_ccs = true;
> > > > > > > > > > >  
> > > > > > > > > > > -       if (!clear_vram)
> > > > > > > > > > > +       if (!clear_vram) {
> > > > > > > > > > >                 xe_res_first_sg(xe_bo_sg(bo), 0,
> > > > > > > > > > > bo-
> > > > > > > > > > > > size,
> > > > > > > > > > > &src_it);
> > > > > > > > > > > -       else
> > > > > > > > > > > +       } else {
> > > > > > > > > > >                 xe_res_first(src, 0, bo->size,
> > > > > > > > > > > &src_it);
> > > > > > > > > > > +               if (!(clear_flags &
> > > > > > > > > > > XE_MIGRATE_CLEAR_NON_DIRTY))
> > > > > > > > > > > +                       size -=
> > > > > > > > > > > xe_res_next_dirty(&src_it);
> > > > > > > > > > > +       }
> > > > > > > > > > >  
> > > > > > > > > > >         while (size) {
> > > > > > > > > > >                 u64 clear_L0_ofs;
> > > > > > > > > > > @@ -1125,6 +1128,9 @@ struct dma_fence
> > > > > > > > > > > *xe_migrate_clear(struct
> > > > > > > > > > > xe_migrate *m,
> > > > > > > > > > >                         emit_pte(m, bb,
> > > > > > > > > > > clear_L0_pt,
> > > > > > > > > > > clear_vram,
> > > > > > > > > > > clear_only_system_ccs,
> > > > > > > > > > >                                  &src_it, clear_L0,
> > > > > > > > > > > dst);
> > > > > > > > > > >  
> > > > > > > > > > > +               if (clear_vram && !(clear_flags &
> > > > > > > > > > > XE_MIGRATE_CLEAR_NON_DIRTY))
> > > > > > > > > > > +                       size -=
> > > > > > > > > > > xe_res_next_dirty(&src_it);
> > > > > > > > > > > +
> > > > > > > > > > >                 bb->cs[bb->len++] =
> > > > > > > > > > > MI_BATCH_BUFFER_END;
> > > > > > > > > > >                 update_idx = bb->len;
> > > > > > > > > > >  
> > > > > > > > > > > @@ -1146,7 +1152,7 @@ struct dma_fence
> > > > > > > > > > > *xe_migrate_clear(struct
> > > > > > > > > > > xe_migrate *m,
> > > > > > > > > > >                 }
> > > > > > > > > > >  
> > > > > > > > > > >                 xe_sched_job_add_migrate_flush(job,
> > > > > > > > > > > flush_flags);
> > > > > > > > > > > -               if (!fence) {
> > > > > > > > > > > +               if (fence == dma_fence_get_stub())
> > > > > > > > > > > {
> > > > > > > > > > >                         /*
> > > > > > > > > > >                          * There can't be anything
> > > > > > > > > > > userspace
> > > > > > > > > > > related
> > > > > > > > > > > at this
> > > > > > > > > > >                          * point, so we just need
> > > > > > > > > > > to
> > > > > > > > > > > respect
> > > > > > > > > > > any
> > > > > > > > > > > potential move
> > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > > > > > > b/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > > > > > > index fb9839c1bae0..58a7b747ef11 100644
> > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_migrate.h
> > > > > > > > > > > @@ -118,6 +118,7 @@ int
> > > > > > > > > > > xe_migrate_access_memory(struct
> > > > > > > > > > > xe_migrate
> > > > > > > > > > > *m, struct xe_bo *bo,
> > > > > > > > > > >  
> > > > > > > > > > >  #define
> > > > > > > > > > > XE_MIGRATE_CLEAR_FLAG_BO_DATA          BIT(0)
> > > > > > > > > > >  #define
> > > > > > > > > > > XE_MIGRATE_CLEAR_FLAG_CCS_DATA         BIT(1)
> > > > > > > > > > > +#define
> > > > > > > > > > > XE_MIGRATE_CLEAR_NON_DIRTY             BIT(2)
> > > > > > > > > > >  #define
> > > > > > > > > > > XE_MIGRATE_CLEAR_FLAG_FULL     (XE_MIGRATE_CLEAR_FL
> > > > > > > > > > > AG_BO_
> > > > > > > > > > > DATA
> > > > > > > > > > > > \
> > > > > > > > > > >                                         XE_MIGRATE_
> > > > > > > > > > > CLEAR_
> > > > > > > > > > > FLAG
> > > > > > > > > > > _CCS
> > > > > > > > > > > _DAT
> > > > > > > > > > > A)
> > > > > > > > > > >  struct dma_fence *xe_migrate_clear(struct
> > > > > > > > > > > xe_migrate *m,
> > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > > > > > > b/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > > > > > > index d1a403cfb628..630082e809ba 100644
> > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_res_cursor.h
> > > > > > > > > > > @@ -315,6 +315,32 @@ static inline void
> > > > > > > > > > > xe_res_next(struct
> > > > > > > > > > > xe_res_cursor *cur, u64 size)
> > > > > > > > > > >         }
> > > > > > > > > > >  }
> > > > > > > > > > >  
> > > > > > > > > > > +/**
> > > > > > > > > > > + * xe_res_next_dirty - advance the cursor to next
> > > > > > > > > > > dirty
> > > > > > > > > > > buddy
> > > > > > > > > > > block
> > > > > > > > > > > + *
> > > > > > > > > > > + * @cur: the cursor to advance
> > > > > > > > > > > + *
> > > > > > > > > > > + * Move the cursor until dirty buddy block is
> > > > > > > > > > > found.
> > > > > > > > > > > + *
> > > > > > > > > > > + * Return: Number of bytes cursor has been
> > > > > > > > > > > advanced
> > > > > > > > > > > + */
> > > > > > > > > > > +static inline u64 xe_res_next_dirty(struct
> > > > > > > > > > > xe_res_cursor
> > > > > > > > > > > *cur)
> > > > > > > > > > > +{
> > > > > > > > > > > +       struct drm_buddy_block *block = cur->node;
> > > > > > > > > > > +       u64 bytes = 0;
> > > > > > > > > > > +
> > > > > > > > > > > +       XE_WARN_ON(cur->mem_type != XE_PL_VRAM0 &&
> > > > > > > > > > > +                  cur->mem_type != XE_PL_VRAM1);
> > > > > > > > > > 
> > > > > > > > > > What if we have more than just these two? Maybe check
> > > > > > > > > > against
> > > > > > > > > > the
> > > > > > > > > > mask
> > > > > > > > > > instead.
> > > > > > > 
> > > > > > > Sure. I think we do this test in a couple of other places
> > > > > > > in the
> > > > > > > driver too
> > > > > > > and clean this up in seperate patch first.
> > > > > > 
> > > > > > No problem, just wanted to call it out. We can do this
> > > > > > later..
> > > > > > 
> > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > +
> > > > > > > > > > > +       while (cur->remaining &&
> > > > > > > > > > > drm_buddy_block_is_clear(block))
> > > > > > > > > > > {
> > > > > > > > > > > +               bytes += cur->size;
> > > > > > > > > > > +               xe_res_next(cur, cur->size);
> > > > > > > > > > > +               block = cur->node;
> > > > > > > > > > > +       }
> > > > > > > > > > > +
> > > > > > > > > > > +       return bytes;
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > >  /**
> > > > > > > > > > >   * xe_res_dma - return dma address of cursor at
> > > > > > > > > > > current
> > > > > > > > > > > position
> > > > > > > > > > >   *
> > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > > > > > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > > > > > > index 9e375a40aee9..120046941c1e 100644
> > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > > > > > > > > @@ -84,6 +84,9 @@ static int
> > > > > > > > > > > xe_ttm_vram_mgr_new(struct
> > > > > > > > > > > ttm_resource_manager *man,
> > > > > > > > > > >         if (place->fpfn || lpfn != man->size >>
> > > > > > > > > > > PAGE_SHIFT)
> > > > > > > > > > >                 vres->flags |=
> > > > > > > > > > > DRM_BUDDY_RANGE_ALLOCATION;
> > > > > > > > > > >  
> > > > > > > > > > > +       if (tbo->type == ttm_bo_type_device)
> > > > > > > > > > > +               vres->flags |=
> > > > > > > > > > > DRM_BUDDY_CLEAR_ALLOCATION;
> > > > > > > > > > > +
> > > > > > > > > > >         if (WARN_ON(!vres->base.size)) {
> > > > > > > > > > >                 err = -EINVAL;
> > > > > > > > > > >                 goto error_fini;
> > > > > > > > > > > @@ -187,7 +190,7 @@ static void
> > > > > > > > > > > xe_ttm_vram_mgr_del(struct
> > > > > > > > > > > ttm_resource_manager *man,
> > > > > > > > > > >         struct drm_buddy *mm = &mgr->mm;
> > > > > > > > > > >  
> > > > > > > > > > >         mutex_lock(&mgr->lock);
> > > > > > > > > > > -       drm_buddy_free_list(mm, &vres->blocks, 0);
> > > > > > > > > > > +       drm_buddy_free_list(mm, &vres->blocks,
> > > > > > > > > > > vres-
> > > > > > > > > > > > flags);
> > > > > > > > > > >         mgr->visible_avail += vres-
> > > > > > > > > > > >used_visible_size;
> > > > > > > > > > >         mutex_unlock(&mgr->lock);
> > > > > > > > > > >  
> > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > > > > > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > > > > > > index cc76050e376d..dfc0e6890b3c 100644
> > > > > > > > > > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > > > > > > > > @@ -36,6 +36,12 @@
> > > > > > > > > > > to_xe_ttm_vram_mgr_resource(struct
> > > > > > > > > > > ttm_resource
> > > > > > > > > > > *res)
> > > > > > > > > > >         return container_of(res, struct
> > > > > > > > > > > xe_ttm_vram_mgr_resource,
> > > > > > > > > > > base);
> > > > > > > > > > >  }
> > > > > > > > > > >  
> > > > > > > > > > > +static inline void
> > > > > > > > > > > +xe_ttm_vram_mgr_resource_set_cleared(struct
> > > > > > > > > > > ttm_resource
> > > > > > > > > > > *res)
> > > > > > > > > > > +{
> > > > > > > > > > > +       to_xe_ttm_vram_mgr_resource(res)->flags |=
> > > > > > > > > > > DRM_BUDDY_CLEARED;
> > > > > > > > > > 
> > > > > > > > > > I see the amd driver is doing the same thing here.
> > > > > > > > > > Maybe we
> > > > > > > > > > can
> > > > > > > > > > pull
> > > > > > > > > > this in based on the flags given at the resource
> > > > > > > > > > initialization
> > > > > > > > > > so
> > > > > > > > > > we
> > > > > > > > > > can potentially tweak this for different scenarios
> > > > > > > > > > (when to
> > > > > > > > > > free)
> > > > > > > > > > and
> > > > > > > > > > converge what we're doing here and what amd is doing
> > > > > > > > > > into
> > > > > > > > > > the
> > > > > > > > > > common
> > > > > > > > > > code.
> > > > > > > > > > 
> > > > > > > 
> > > > > > > AMD has a flag, which is the same test in this series as
> > > > > > > 'type ==
> > > > > > > ttm_bo_type_device'. AMD blindly sets this flag on user BO
> > > > > > > creation
> > > > > > > but
> > > > > > > as I stated above we could wire this to uAPI if needed.
> 
> Sorry I didn't fully understand the response here. So I was saying we

Ok, my bad, I think get what you are suggesting now.

> have a common routine on the drm side that takes a flag (CLEAR_ON_ALLOC
> vs CLEAR_ON_FREE) and performs the operation based on the flag. Then

'performs the operation based on the flag'

Do you mean the clear - that is device specific operation hence in the
TTM vfunc release notify.

> both users on this side would just CLEAR_ON_FREE. Or we can even drop
> that flag and just assume everyone will always CLEAR_ON_FREE to avoid
> the extra code of clear-on-alloc. Then if we do decide to do some uAPI

'avoid the extra code of clear-on-alloc' - The clear on alloc is skipped
if buddy blocks are marked as clear. There is not a guarantee they will
be clear given we skip clear-on-free for kernel memory, drm buddy does
not initialize blocks to clear, or for whatever reason the clear-on-free
fails in the TTM release notify vfunc function (e.g., a kmalloc fails
there).

> in the future we add the extra flag. But it still lets us consolidate
> the amd and intel routines.
> 

You mean xe_ttm_vram_mgr_resource_set_cleared /
amdgpu_vram_mgr_set_cleared?

Both of those operate on a driver specific structure which encapsulates
a TTM resource. TTM allocates memory via a TTM resource manager
abstraction /w driver vfuncs making no assumptions about memory
allocation - AMD and Intel happen to use the DRM buddy allocator for
VRAM but TTM again has no idea how the memory is being allocated. So
sticking a buddy allocation specific flag into TTM is a no go.

We could add specific TTM flag to TTM resource saying the resource is
clear and convert to buddy flag upon block list free, but we don't have
a flags fields in the TTM resource whereas the driver side we do. Also
have you ever tried changing TTM - it is not a whole lot of fun and
I pretty much guarnetee it will get nacked for one reason or another.
That is not something I would personally even attempt to post because I
know better.

> The type == ttm_bo_type_device seems like something we can extract out
> as device-specific.
> 

What do mean extract? A driver side helper? I thought about that but
struggled a bit with naming so left out. Could include that in next rev
I suppose.

Matt

> Thanks,
> Stuart
> 
> > > > > > > 
> > > > > > > Also I think there are some kernel BOs which clear-on-free
> > > > > > > is
> > > > > > > safe
> > > > > > > too
> > > > > > > (e.g., I think only PT BOs need contents to stick around
> > > > > > > after
> > > > > > > final
> > > > > > > put
> > > > > > > until dma-resv is clean of fences). Would have to fully to
> > > > > > > test
> > > > > > > for
> > > > > > > definitive answer though.
> > > > > > 
> > > > > > In most cases the kernel operations are going to be lower
> > > > > > priority
> > > > > > than
> > > > > 
> > > > > It is the inverse of that - kernel operations are a always
> > > > > higher
> > > > > priority as without them a user app can't function (think any
> > > > > page
> > > > > table
> > > > > memory allocation or memory allocation for a LRC). This is one
> > > > > of
> > > > > reasons we just pin kernel memory in Xe compared evictable
> > > > > kernel
> > > > > memory
> > > > > in the i915 (also evicting kernel memory is a huge pain as we
> > > > > have
> > > > > idle
> > > > > the GuC state which could use that memory).
> > > > 
> > > > We're talking about kernel operations attached to different
> > > > processes
> > > > though. The incoming kernel work would be for a new user process
> > > > (prepping the buffer for submission). The existing user process
> > > > IMO
> > > > should have higher priority.
> > > > 
> > > 
> > > You still have to wait on the user operation complete in dma-resv
> > > mode
> > > before stealing memory.
> > > 
> > > In LR preempt fence mode, we'd wait until LR VM is preempted off
> > > the
> > > hardware.
> > > 
> > > In LR fault mode, we can steal the memory immediately.
> > > 
> > 
> > s/immediately/after TLB invalidation completes
> > 
> > Matt
> > 
> > > There is no difference though if we are trying to get kernel memory
> > > or
> > > user memory - this is just how eviction works.
> > > 
> > > Also if we are in LR mode, we have no idea how long the user
> > > processes
> > > which we stealing memory from will take to complete (e.g., we do
> > > not
> > > install fences in dma-resv attached to jobs as the fences could run
> > > forever breaking the rules of fence or in other word breaking
> > > reclaim).
> > > 
> > > This is where pinning of user memory could come in play /w some
> > > accounting controler, likely cgroups. LR mode could pin some memory
> > > to
> > > ensure fast forward progress or buffers it really doesn't want to
> > > be
> > > moved around (e.g., buffers exported over a fast network
> > > interconnect).
> > > 
> > > Matt
> > > 
> > > > Thanks,
> > > > Stuart
> > > > 
> > > > > 
> > > > > > the user operations (with the exception of maybe page
> > > > > > faults). We
> > > > > > had
> > > > > > looked at things in i915 like preempting kernel operations
> > > > > > when a
> > > > > > user
> > > > > > op comes in (the user clear-on-alloc preempts the kernel
> > > > > > clear-on-
> > > > > > free/alloc). This probably falls in the same optimization
> > > > > > category
> > > > > > you
> > > > > 
> > > > > We definitely won't do that as this would require multiple
> > > > > migration
> > > > > queues and preemption via the GuC is so expensive I'd doubt
> > > > > you'd
> > > > > ever
> > > > > win. I guess we could have a convolved algorithm in the
> > > > > migration
> > > > > queue
> > > > > to jump to higher priortiy jobs but would need some really
> > > > > strong
> > > > > data
> > > > > indicating this is needed. Also with dma-resv basically the
> > > > > migration
> > > > > queue needs to be idle before anything from the user can run.
> > > > > 
> > > > > Matt
> > > > > 
> > > > > > mentioned, although it doesn't necessarily need a user hint.
> > > > > > 
> > > > > > Thanks,
> > > > > > Stuart
> > > > > > 
> > > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > > Stuart
> > > > > > > > > > 
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > >  static inline struct xe_ttm_vram_mgr *
> > > > > > > > > > >  to_xe_ttm_vram_mgr(struct ttm_resource_manager
> > > > > > > > > > > *man)
> > > > > > > > > > >  {
> > > > > > > > > > 
> > > > > > > > 
> > > > > > 
> > > > 
>