[PATCH] drm/xe: Use ttm_uncached for BO with NEEDS_UC flag

Wed Jun 19 11:40:45 UTC 2024

On Wed, 2024-06-19 at 10:44 +0100, Matthew Auld wrote:
> On 18/06/2024 19:54, Matt Roper wrote:
> > On Tue, Jun 18, 2024 at 08:29:24PM +0200, Thomas Hellström wrote:
> > > Hi, Matt
> > > 
> > > On Tue, 2024-06-18 at 09:43 -0700, Matt Roper wrote:
> > > > On Tue, Jun 18, 2024 at 02:38:01PM +0200, Thomas Hellström
> > > > wrote:
> > > > > Hi,
> > > > > 
> > > > > On Mon, 2024-06-17 at 13:28 -0700, Matt Roper wrote:
> > > > > > On Wed, Jun 12, 2024 at 08:03:24PM +0200, Michal Wajdeczko
> > > > > > wrote:
> > > > > > > Hi Thomas,
> > > > > > > 
> > > > > > > On 11.06.2024 14:47, Thomas Hellström wrote:
> > > > > > > > Hi, Michal,
> > > > > > > > 
> > > > > > > > On Thu, 2024-06-06 at 21:56 +0200, Michal Wajdeczko
> > > > > > > > wrote:
> > > > > > > > > We should honor requested uncached mode also at the
> > > > > > > > > TTM
> > > > > > > > > layer.
> > > > > > > > > Otherwise, we risk losing updates to the memory based
> > > > > > > > > interrupts
> > > > > > > > > source or status vectors, as those require uncached
> > > > > > > > > memory.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Michal Wajdeczko
> > > > > > > > > <michal.wajdeczko at intel.com>
> > > > > > > > > Cc: Thomas Hellström
> > > > > > > > > <thomas.hellstrom at linux.intel.com>
> > > > > > > > > Cc: Matt Roper <matthew.d.roper at intel.com>
> > > > > > > > > ---
> > > > > > > > >   drivers/gpu/drm/xe/xe_bo.c | 3 +++
> > > > > > > > >   1 file changed, 3 insertions(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > index 2bae01ce4e5b..2573cc118f29 100644
> > > > > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > > > > @@ -378,6 +378,9 @@ static struct ttm_tt
> > > > > > > > > *xe_ttm_tt_create(struct
> > > > > > > > > ttm_buffer_object *ttm_bo,
> > > > > > > > >   	    (xe->info.graphics_verx100 >= 1270 &&
> > > > > > > > > bo-
> > > > > > > > > > flags &
> > > > > > > > > XE_BO_FLAG_PAGETABLE))
> > > > > > > > >   		caching = ttm_write_combined;
> > > > > > > > >   
> > > > > > > > > +	if (bo->flags & XE_BO_FLAG_NEEDS_UC)
> > > > > > > > > +		caching = ttm_uncached;
> > > > > > > > > +
> > > > > > > > >   	err = ttm_tt_init(&tt->ttm, &bo->ttm,
> > > > > > > > > page_flags,
> > > > > > > > > caching,
> > > > > > > > > extra_pages);
> > > > > > > > >   	if (err) {
> > > > > > > > >   		kfree(tt);
> > > > > > > > 
> > > > > > > > To me the preferred method is to teach bo->cpu_caching
> > > > > > > > about
> > > > > > > > the
> > > > > > > > uncached mode and then include it in the switch
> > > > > > > > statement
> > > > > > > > above.
> > > > > > > > 
> > > > > > > 
> > > > > > > but bo->cpu_caching is currently documented as:
> > > > > > > 
> > > > > > > /**
> > > > > > >   * @cpu_caching: CPU caching mode. Currently only used
> > > > > > > for
> > > > > > > userspace
> > > > > > >   * objects.
> > > > > > >   */
> > > > > > > 
> > > > > > > and value 0 is implicitly reserved as kind of default, so
> > > > > > > 'teaching'
> > > > > > > would likely mean either extending uapi with something
> > > > > > > like:
> > > > > > > 
> > > > > > >    #define DRM_XE_GEM_CPU_CACHING_WB                     
> > > > > > > 1
> > > > > > >    #define DRM_XE_GEM_CPU_CACHING_WC                     
> > > > > > > 2
> > > > > > > + #define DRM_XE_GEM_CPU_CACHING_UC                     
> > > > > > > 3
> > > > > > > 
> > > > > > > which will introduce lot of undesired right now code
> > > > > > > changes,
> > > > > > > or we
> > > > > > > will
> > > > > > > introduce internal only flag:
> > > > > > > 
> > > > > > > + #define XE_CPU_CACHING_UC                     
> > > > > > > ((u16)~0)
> > > > > > > 
> > > > > > > but that doesn't look like a clean solution.
> > > > > > > 
> > > > > > > 
> > > > > > > OTOH, just above this new diff chunk, there is already a
> > > > > > > code
> > > > > > > that
> > > > > > > updates caching mode outside the "switch statement
> > > > > > > above":
> > > > > > > 
> > > > > > > 	if ((!bo->cpu_caching && bo->flags &
> > > > > > > XE_BO_FLAG_SCANOUT)
> > > > > > > > > 
> > > > > > > 	    (xe->info.graphics_verx100 >= 1270 &&
> > > > > > > 	     bo->flags & XE_BO_FLAG_PAGETABLE))
> > > > > > > 		caching = ttm_write_combined;
> > > > > > > 
> > > > > > > so maybe as a short term solution we can keep this patch
> > > > > > > as
> > > > > > > it's
> > > > > > > doing
> > > > > > > similar last resort stuff and return to 'preferred way'
> > > > > > > later:
> > > > > > > 
> > > > > > > 	if (!bo->cpu_caching && bo->flags &
> > > > > > > XE_BO_FLAG_NEEDS_UC)
> > > > > > > 		caching = ttm_uncached;
> > > > > > 
> > > > > > Yeah, cpu_caching is a "uapi only" thing at the moment (and
> > > > > > even
> > > > > > then
> > > > > > is
> > > > > > only set in some situations).  Given the current design and
> > > > > > assumptions
> > > > > > of the code, maybe it would be more clear to add an
> > > > > > assertion
> > > > > > like
> > > > > > this
> > > > > > to help document why this is special?
> > > > > > 
> > > > > >          if (bo->flags & XE_BO_FLAG_NEEDS_UC) {
> > > > > >                  /*
> > > > > >                   * Valid only for internally-created
> > > > > > buffers
> > > > > > only,
> > > > > > for
> > > > > >                   * which cpu_caching is never initialized.
> > > > > >                   */
> > > > > >                  xe_assert(xe, bo->cpu_caching == 0);
> > > > > >                  caching = ttm_uncached;
> > > > > >          }
> > > > > > 
> > > > > > If we decide we want a more general redesign of cpu_caching
> > > > > > behavior,
> > > > > > that would probably be a separate change from the direct
> > > > > > functional
> > > > > > fix
> > > > > > here.
> > > > > 
> > > > > I do think the change should have actually been done before
> > > > > the
> > > > > scanout
> > > > > caching hack. We shouldn't be building special cases like
> > > > > this, but
> > > > > rather fix what's missing.
> > > > 
> > > > I think things happened the other way around.  The scanout
> > > > caching
> > > > adjustment pre-dates the existence of bo->cpu_caching in the
> > > > driver.
> > > > When bo->cpu_caching got added in commit 622f709ca629
> > > > ("drm/xe/uapi:
> > > > Add
> > > > support for CPU caching mode"), it intentionally left kernel
> > > > objects
> > > > set
> > > > to 0 by design.
> > > 
> > > I don't really see a discussion around the kernel objects in the
> > > commit
> > > message, it mentions "currently" in the caching mode doc, but
> > > that
> > > sounds more like a documentation of current status than a
> > > guideline.
> > > Ofc, it might be in the review discussion but I haven't looked
> > > too
> > > closely TBH.
> > > 
> > > Hmm. This commit raises some questions. Do you remember the
> > > reason for
> > > leaving out the kernel bos? Was it because kernel bos relies on
> > > implicit caching mode selection whereas the user-space bo caching
> > > mode
> > > selection is now explicit? Otherwise it's pretty standard in the
> > > driver
> > > to map the DRM_XE user-space flags / enums to XE_ internal ones.
> 
> Yeah, no real reason IIRC. I don't remember exactly, but I don't
> think 
> there was a huge need for kmd internal objects needing explicit
> caching 
> at the time, and didn't seem worth adding just for that scanout case.
> 
> > 
> > Unfortunately I don't remember the details here very well myself; I
> > know
> > this patch went through a lot of revisions and morphed a fair bit
> > during
> > the review cycles.  Adding a couple extra Cc's of people who were
> > more
> > actively involved in the review and may have a clearer memory of
> > how we
> > initially settled on the userspace-only design.
> > 
> > +cc Jose, Oak
> > 
> > 
> > Matt
> > 
> > > 
> > > (As a complete side note it looks like the system pages for VRAM-
> > > only
> > > user-bo eviction are now forced to be write-combined by the
> > > uAPI?)
> 
> Yeah, originally we had smem_cpu_caching, where it had to be left as 
> zero for vram objects. vram caching was then implicit and always wc.
> I 
> think internally kmd would use wb if it was evicted and vram-only,
> idea 
> was also to potentially allow umd to also control this with 
> smem_cpu_caching.
> 
> There was some discussion here: 
> https://lore.kernel.org/intel-xe/30454ab26b8df746cfcdc4c1a0548e7c1e676ac0.camel@intel.com/
> 
> Feedback was rather to have single cpu_caching value which would
> always 
> be respected when mmapping the object. If the cpu_caching is wc then
> the 
> mmap is always wc, and not sometimes wc and sometimes wb, depending
> on 
> if it's evicted or not which userspace is unaware of. The downside is
> that wc for evicted vram is going to be common and maybe not ideal. 
> Maybe this can be revisited?

Yes we will need to do that. System WC is really for integrated only as
not all architectures (any beyond x86?) supports it.

And in addition it comes with performance implications.

We've tried to ensure that discrete never binds system incoherent (even
with incoherent PAT settings), so WB is safe from that POW.

Also when we considered this on i915 we verified with UMD that the
change-of-caching mode under the hood was safe from WC promoted to WB
and back. If that has changed so we *must* have the same caching mode,
then we need to consider to read back stuff from the fault handler...

/Thomas

> 
> > > 
> > > > 
> > > > We can certainly change the design now if everyone agrees that
> > > > it
> > > > makes
> > > > the code cleaner,
> > > 
> > > This is probably a bigger change than I originally though and
> > > requires
> > > some additional consideration of the above.
> > > 
> > > >   but I think that the general refactoring and
> > > > repurposing of bo->cpu_caching is orthogonal to the functional
> > > > fix
> > > > that
> > > > Michal is providing here.
> > > 
> > > Generally I think that if something is missing to be able to
> > > cleanly
> > > add a fix, then we should definitely try our best to add that
> > > something
> > > _first_. In this case it turns out, though, that it requires some
> > > additional afterthought.
> > > 
> > > So for the patch
> > > 
> > > Acked-by: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> > > 
> > > 
> > > 
> > > > 
> > > > +Cc Pallavi, Matt Auld as the authors of the original design in
> > > > case
> > > > they have any thoughts on extending the usage of bo-
> > > > >cpu_caching.
> > > > 
> > > > 
> > > > Matt
> > > > 
> > > > > 
> > > > > Can't we make bo->cpu_caching valid also for kernel bos with
> > > > > a new
> > > > > enum
> > > > > and do the translation in the ioctl?
> > > > > 
> > > > > /Thomas
> > > > > 
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > > 
> > > > > > > Michal
> > > > > > 
> > > > > 
> > > > 
> > > 
> >