[Intel-xe] [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind

Wed Aug 30 15:27:22 UTC 2023

Several questions:
1. the pat_index from vm_bind will override the setting from bo_create?
How to keep the value from bo_create unchanged? 
2. no UC/WB/WC definition (CPU cachable) in drm_xe_gem_mmap_offset, will it be handled by KMD automatically?
For example: if set 1-way, it means GPU could snoop CPU cache, we could use WB  in mmap offset
If it is COHERENCY_NONE, we could only use UC,  all these logic is handled by KMD automatically?
3.  about " For imported dma-buf (from a different device) the coherency mode is also implicit
and must also be either 1WAY or 2WAY" 
it means it must be 1way or 2way, and UMD need not to set it?

> -----Original Message-----
> From: Auld, Matthew <matthew.auld at intel.com>
> Sent: Wednesday, August 30, 2023 7:28 PM
> To: Roper, Matthew D <matthew.d.roper at intel.com>
> Cc: intel-xe at lists.freedesktop.org; Mishra, Pallavi <pallavi.mishra at intel.com>;
> Thomas Hellström <thomas.hellstrom at linux.intel.com>; Joonas Lahtinen
> <joonas.lahtinen at linux.intel.com>; De Marchi, Lucas
> <lucas.demarchi at intel.com>; Souza, Jose <jose.souza at intel.com>; Hazubski,
> Filip <filip.hazubski at intel.com>; Zhang, Carl <carl.zhang at intel.com>; Yu, Effie
> <effie.yu at intel.com>
> Subject: Re: [RFC 5/5] drm/xe/uapi: support pat_index selection with vm_bind
> 
> On 29/08/2023 22:36, Matt Roper wrote:
> > On Tue, Aug 29, 2023 at 05:28:46PM +0100, Matthew Auld wrote:
> >> Allow userspace to directly control the pat_index for a given vm
> >> binding. This should allow directly controlling the coherency,
> >> caching and potentially other stuff in the future for the ppGTT binding.
> >>
> >> The exact meaning behind the pat_index is very platform specific (see
> >> BSpec or PRMs) but effectively maps to some predefined memory
> >> attributes. From the KMD pov we only care about the coherency that is
> >> provided by the pat_index, which falls into either NONE, 1WAY or 2WAY.
> >> The vm_bind coherency mode for the given pat_index needs to match the
> >> given coh_mode that was set at object creation. For platforms that
> >> lack
> >
> > Is it actually important to match the coherency mode?  I think one of
> > the main goals was to know up front if userspace might be using a
> > non-snooping PAT setting that would let it bypass the CPU cache (and
> > potentially read old, stale data from a different process if the
> > buffer's clear value is still sitting in cache and hasn't landed in
> > memory yet).
> >
> > If that's the only concern, then I think it should still be fine to
> > map with a non-matching PAT as long as it's more coherent than the one
> > specified at creation, right?  E.g., if the buffer was created with
> > 1-way coherency, it would be fine to map it with 2-way because
> > userspace still can't use that to observe the previous contents of the
> > buffer. Or
> 
> Yeah, I guess we could in theory do something that.
> 
> > if the buffer was created with "non-coherent" then we've already done
> > the necessary clflushing in kernel before handing to buffer over to
> > userspace to ensure the clear value landed in memory, so any valid PAT
> > index should be safe (from a security POV) after that, right?  Any
> > other problems that arise from mismatched coherency would just be
> > contained to the app possibly shooting itself in the foot, which isn't
> > really our concern.
> 
> That is also my understanding, at least from the KMD security pov. If you
> allocate as wb then you must use at least 1way, since there is no flushing for
> clearing or swap-in. For uc/wc you could in theory use whatever you want.
> 
> >
> >
> >> the explicit coherency mode, we treat UC/WT/WC as NONE and WB as
> 2WAY.
> >>
> >> For userptr mappings we lack a corresponding gem object, so the expected
> >> coherency mode is instead implicit and must fall into either 1WAY or
> >> 2WAY. Trying to use NONE will be rejected by the kernel. For imported
> >> dma-buf (from a different device) the coherency mode is also implicit
> >> and must also be either 1WAY or 2WAY.
> >>
> >> As part of adding pat_index support with vm_bind we also need stop using
> >> xe_cache_level and instead use the pat_index in various places. We still
> >> make use of xe_cache_level, but only as a convenience for kernel
> >> internal objectsi (internally it maps to some reasonable pat_index). For
> >
> > Maybe we should kill xe_cache_level completely and just assign
> > xe_gt->pat_cached / xe_gt->pat_uncached at init that can be used in
> > appropriate places, similar to what we do with MOCS (gt->mocs.uc_index,
> > gt->mocs.wb_index)?
> 
> OK, seems reasonable to me.
> 
> >
> >> now this is just a 1:1 conversion of the existing code, however for
> >> platforms like MTL+ we might need to give more control through bo_create
> >> or stop using WB on the CPU side if we need CPU access.
> >>
> >> Bspec: 45101, 44235 #xe
> >> Bspec: 70552, 71582, 59400 #xe2
> >> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
> >> Cc: Pallavi Mishra <pallavi.mishra at intel.com>
> >> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> >> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> >> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
> >> Cc: Matt Roper <matthew.d.roper at intel.com>
> >> Cc: José Roberto de Souza <jose.souza at intel.com>
> >> Cc: Filip Hazubski <filip.hazubski at intel.com>
> >> Cc: Carl Zhang <carl.zhang at intel.com>
> >> Cc: Effie Yu <effie.yu at intel.com>
> >> ---
> >>   drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +-
> >>   drivers/gpu/drm/xe/tests/xe_migrate.c |  2 +-
> >>   drivers/gpu/drm/xe/xe_ggtt.c          |  7 ++-
> >>   drivers/gpu/drm/xe/xe_ggtt_types.h    |  2 +-
> >>   drivers/gpu/drm/xe/xe_migrate.c       | 14 ++---
> >>   drivers/gpu/drm/xe/xe_pt.c            | 32 +++++-------
> >>   drivers/gpu/drm/xe/xe_pt.h            |  6 +--
> >>   drivers/gpu/drm/xe/xe_vm.c            | 73 +++++++++++++++++++++------
> >>   drivers/gpu/drm/xe/xe_vm_types.h      | 13 +++--
> >>   include/uapi/drm/xe_drm.h             | 41 ++++++++++++++-
> >>   10 files changed, 134 insertions(+), 58 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
> b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >> index 4d6296cdbcfd..bb4c182048c3 100644
> >> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> >> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> >> @@ -302,7 +302,7 @@ struct i915_address_space {
> >>   		(*alloc_scratch_dma)(struct i915_address_space *vm, int sz);
> >>
> >>   	u64 (*pte_encode)(dma_addr_t addr,
> >> -			  unsigned int pat_index,
> >> +			  u32 pat_index,
> >>   			  u32 flags); /* Create a valid PTE */
> >>   #define PTE_READ_ONLY	BIT(0)
> >>   #define PTE_LM		BIT(1)
> >> diff --git a/drivers/gpu/drm/xe/tests/xe_migrate.c
> b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >> index 5c8d5e78d9bc..7a128fd20a29 100644
> >> --- a/drivers/gpu/drm/xe/tests/xe_migrate.c
> >> +++ b/drivers/gpu/drm/xe/tests/xe_migrate.c
> >> @@ -301,7 +301,7 @@ static void xe_migrate_sanity_test(struct xe_migrate
> *m, struct kunit *test)
> >>   	/* First part of the test, are we updating our pagetable bo with a new
> entry? */
> >>   	xe_map_wr(xe, &bo->vmap, XE_PAGE_SIZE * (NUM_KERNEL_PDE - 1),
> u64,
> >>   		  0xdeaddeadbeefbeef);
> >> -	expected = xe_pte_encode(pt, 0, XE_CACHE_WB, 0);
> >> +	expected = xe_pte_encode(pt, 0, xe_pat_get_index(xe,
> XE_CACHE_WB), 0);
> >>   	if (m->q->vm->flags & XE_VM_FLAG_64K)
> >>   		expected |= XE_PTE_PS64;
> >>   	if (xe_bo_is_vram(pt))
> >> diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
> >> index 209fa053d9fb..4134c26150a5 100644
> >> --- a/drivers/gpu/drm/xe/xe_ggtt.c
> >> +++ b/drivers/gpu/drm/xe/xe_ggtt.c
> >> @@ -41,7 +41,8 @@ u64 xe_ggtt_pte_encode(struct xe_bo *bo, u64
> bo_offset)
> >>   		pte |= XE_GGTT_PTE_DM;
> >>
> >>   	if ((ggtt->pat_encode).pte_encode)
> >> -		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> XE_CACHE_WB_1_WAY);
> >> +		pte = (ggtt->pat_encode).pte_encode(xe, pte,
> >> +						    xe_pat_get_index(xe,
> XE_CACHE_WB_1_WAY));
> >>
> >>   	return pte;
> >>   }
> >> @@ -102,10 +103,8 @@ static void primelockdep(struct xe_ggtt *ggtt)
> >>   }
> >>
> >>   static u64 xelpg_ggtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> >> -						enum xe_cache_level cache)
> >> +				     u32 pat_index)
> >>   {
> >> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >> -
> >>   	pte_pat &= ~(XELPG_GGTT_PTE_PAT_MASK);
> >>
> >>   	if (pat_index & BIT(0))
> >> diff --git a/drivers/gpu/drm/xe/xe_ggtt_types.h
> b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >> index 7e55fac1a8a9..0bc40cb072e3 100644
> >> --- a/drivers/gpu/drm/xe/xe_ggtt_types.h
> >> +++ b/drivers/gpu/drm/xe/xe_ggtt_types.h
> >> @@ -31,7 +31,7 @@ struct xe_ggtt {
> >>
> >>   	struct {
> >>   		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >> -						enum xe_cache_level cache);
> >> +				  u32 pat_index);
> >>   	} pat_encode;
> >>   };
> >>
> >> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> b/drivers/gpu/drm/xe/xe_migrate.c
> >> index a782ea282cb6..54585e98452a 100644
> >> --- a/drivers/gpu/drm/xe/xe_migrate.c
> >> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> >> @@ -24,6 +24,7 @@
> >>   #include "xe_lrc.h"
> >>   #include "xe_map.h"
> >>   #include "xe_mocs.h"
> >> +#include "xe_pat.h"
> >>   #include "xe_pt.h"
> >>   #include "xe_res_cursor.h"
> >>   #include "xe_sched_job.h"
> >> @@ -162,6 +163,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
> struct xe_migrate *m,
> >>   	u32 num_entries = NUM_PT_SLOTS, num_level = vm->pt_root[id]-
> >level;
> >>   	u32 map_ofs, level, i;
> >>   	struct xe_bo *bo, *batch = tile->mem.kernel_bb_pool->bo;
> >> +	u32 pat_index = xe_pat_get_index(xe, XE_CACHE_WB);
> >>   	u64 entry;
> >>   	int ret;
> >>
> >> @@ -189,14 +191,14 @@ static int xe_migrate_prepare_vm(struct xe_tile
> *tile, struct xe_migrate *m,
> >>   		return ret;
> >>   	}
> >>
> >> -	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, XE_CACHE_WB);
> >> +	entry = xe_pde_encode(bo, bo->size - XE_PAGE_SIZE, pat_index);
> >>   	xe_pt_write(xe, &vm->pt_root[id]->bo->vmap, 0, entry);
> >>
> >>   	map_ofs = (num_entries - num_level) * XE_PAGE_SIZE;
> >>
> >>   	/* Map the entire BO in our level 0 pt */
> >>   	for (i = 0, level = 0; i < num_entries; level++) {
> >> -		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, XE_CACHE_WB,
> 0);
> >> +		entry = xe_pte_encode(bo, i * XE_PAGE_SIZE, pat_index, 0);
> >>
> >>   		xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64, entry);
> >>
> >> @@ -214,7 +216,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
> struct xe_migrate *m,
> >>   		for (i = 0; i < batch->size;
> >>   		     i += vm->flags & XE_VM_FLAG_64K ? XE_64K_PAGE_SIZE :
> >>   		     XE_PAGE_SIZE) {
> >> -			entry = xe_pte_encode(batch, i, XE_CACHE_WB, 0);
> >> +			entry = xe_pte_encode(batch, i, pat_index, 0);
> >>
> >>   			xe_map_wr(xe, &bo->vmap, map_ofs + level * 8, u64,
> >>   				  entry);
> >> @@ -239,7 +241,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
> struct xe_migrate *m,
> >>   			flags = XE_PDE_64K;
> >>
> >>   		entry = xe_pde_encode(bo, map_ofs + (level - 1) *
> >> -					XE_PAGE_SIZE, XE_CACHE_WB);
> >> +					XE_PAGE_SIZE, pat_index);
> >>   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE * level,
> u64,
> >>   			  entry | flags);
> >>   	}
> >> @@ -247,7 +249,7 @@ static int xe_migrate_prepare_vm(struct xe_tile *tile,
> struct xe_migrate *m,
> >>   	/* Write PDE's that point to our BO. */
> >>   	for (i = 0; i < num_entries - num_level; i++) {
> >>   		entry = xe_pde_encode(bo, i * XE_PAGE_SIZE,
> >> -				      XE_CACHE_WB);
> >> +				      pat_index);
> >>
> >>   		xe_map_wr(xe, &bo->vmap, map_ofs + XE_PAGE_SIZE +
> >>   			  (i + 1) * 8, u64, entry);
> >> @@ -1256,7 +1258,7 @@ xe_migrate_update_pgtables(struct xe_migrate
> *m,
> >>
> >>   			XE_WARN_ON(pt_bo->size != SZ_4K);
> >>
> >> -			addr = xe_pte_encode(pt_bo, 0, XE_CACHE_WB, 0);
> >> +			addr = xe_pte_encode(pt_bo, 0, xe_pat_get_index(xe,
> XE_CACHE_WB), 0);
> >>   			bb->cs[bb->len++] = lower_32_bits(addr);
> >>   			bb->cs[bb->len++] = upper_32_bits(addr);
> >>   		}
> >> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
> >> index 64713f400d94..019af2920078 100644
> >> --- a/drivers/gpu/drm/xe/xe_pt.c
> >> +++ b/drivers/gpu/drm/xe/xe_pt.c
> >> @@ -10,6 +10,7 @@
> >>   #include "xe_gt.h"
> >>   #include "xe_gt_tlb_invalidation.h"
> >>   #include "xe_migrate.h"
> >> +#include "xe_pat.h"
> >>   #include "xe_pt_types.h"
> >>   #include "xe_pt_walk.h"
> >>   #include "xe_res_cursor.h"
> >> @@ -57,24 +58,22 @@ static struct xe_pt *xe_pt_entry(struct xe_pt_dir
> *pt_dir, unsigned int index)
> >>    *
> >>    * Return: An encoded page directory entry. No errors.
> >>    */
> >> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >> -		  const enum xe_cache_level cache)
> >> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index)
> >>   {
> >>   	u64 pde;
> >>   	struct xe_vm *vm = bo->vm;
> >>   	struct xe_device *xe = vm->xe;
> >>
> >> -
> >>   	pde = xe_bo_addr(bo, bo_offset, XE_PAGE_SIZE);
> >>   	pde |= XE_PAGE_PRESENT | XE_PAGE_RW;
> >>
> >>   	if ((vm->pat_encode).pde_encode)
> >> -		pde = (vm->pat_encode).pde_encode(xe, pde, cache);
> >> +		pde = (vm->pat_encode).pde_encode(xe, pde, pat_index);
> >>
> >>   	return pde;
> >>   }
> >>
> >> -static u64 __pte_encode(u64 pte, enum xe_cache_level cache,
> >> +static u64 __pte_encode(u64 pte, u32 pat_index,
> >>   			struct xe_vma *vma, u32 pt_level)
> >>   {
> >>   	struct xe_vm *vm = xe_vma_vm(vma);
> >> @@ -89,7 +88,7 @@ static u64 __pte_encode(u64 pte, enum
> xe_cache_level cache,
> >>   		pte |= XE_PTE_NULL;
> >>
> >>   	if ((vm->pat_encode).pte_encode)
> >> -		pte = (vm->pat_encode).pte_encode(xe, pte, cache);
> >> +		pte = (vm->pat_encode).pte_encode(xe, pte, pat_index);
> >>
> >>   	if (pt_level == 1)
> >>   		pte |= XE_PDE_PS_2M;
> >> @@ -112,7 +111,7 @@ static u64 __pte_encode(u64 pte, enum
> xe_cache_level cache,
> >>    *
> >>    * Return: An encoded page-table entry. No errors.
> >>    */
> >> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
> cache,
> >> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index,
> >>   		  u32 pt_level)
> >>   {
> >>   	u64 pte;
> >> @@ -121,7 +120,7 @@ u64 xe_pte_encode(struct xe_bo *bo, u64 offset,
> enum xe_cache_level cache,
> >>   	if (xe_bo_is_vram(bo) || xe_bo_is_stolen_devmem(bo))
> >>   		pte |= XE_PPGTT_PTE_DM;
> >>
> >> -	return __pte_encode(pte, cache, NULL, pt_level);
> >> +	return __pte_encode(pte, pat_index, NULL, pt_level);
> >>   }
> >>
> >>   static u64 __xe_pt_empty_pte(struct xe_tile *tile, struct xe_vm *vm,
> >> @@ -134,12 +133,12 @@ static u64 __xe_pt_empty_pte(struct xe_tile *tile,
> struct xe_vm *vm,
> >>
> >>   	if (level == 0) {
> >>   		u64 empty = xe_pte_encode(vm->scratch_bo[id], 0,
> >> -					  XE_CACHE_WB, 0);
> >> +					  xe_pat_get_index(vm->xe,
> XE_CACHE_WB), 0);
> >>
> >>   		return empty;
> >>   	} else {
> >>   		return xe_pde_encode(vm->scratch_pt[id][level - 1]->bo, 0,
> >> -				     XE_CACHE_WB);
> >> +				     xe_pat_get_index(vm->xe,
> XE_CACHE_WB));
> >>   	}
> >>   }
> >>
> >> @@ -368,8 +367,6 @@ struct xe_pt_stage_bind_walk {
> >>   	struct xe_vm *vm;
> >>   	/** @tile: The tile we're building for. */
> >>   	struct xe_tile *tile;
> >> -	/** @cache: Desired cache level for the ptes */
> >> -	enum xe_cache_level cache;
> >>   	/** @default_pte: PTE flag only template. No address is associated */
> >>   	u64 default_pte;
> >>   	/** @dma_offset: DMA offset to add to the PTE. */
> >> @@ -604,7 +601,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> pgoff_t offset,
> >>
> >>   		pte = __pte_encode(is_null ? 0 :
> >>   				   xe_res_dma(curs) + xe_walk->dma_offset,
> >> -				   xe_walk->cache, xe_walk->vma, level);
> >> +				   xe_walk->vma->pat_index, xe_walk->vma,
> level);
> >>   		pte |= xe_walk->default_pte;
> >>
> >>   		/*
> >> @@ -669,7 +666,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent,
> pgoff_t offset,
> >>   			xe_child->is_compact = true;
> >>   		}
> >>
> >> -		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->cache) | flags;
> >> +		pte = xe_pde_encode(xe_child->bo, 0, xe_walk->vma-
> >pat_index) | flags;
> >>   		ret = xe_pt_insert_entry(xe_walk, xe_parent, offset, xe_child,
> >>   					 pte);
> >>   	}
> >> @@ -730,13 +727,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma
> *vma,
> >>   		if (vma && vma->gpuva.flags & XE_VMA_ATOMIC_PTE_BIT)
> >>   			xe_walk.default_pte |= XE_USM_PPGTT_PTE_AE;
> >>   		xe_walk.dma_offset = vram_region_gpu_offset(bo-
> >ttm.resource);
> >> -		xe_walk.cache = XE_CACHE_WB;
> >> -	} else {
> >> -		if (!xe_vma_has_no_bo(vma) && bo->flags &
> XE_BO_SCANOUT_BIT)
> >> -			xe_walk.cache = XE_CACHE_WT;
> >> -		else
> >> -			xe_walk.cache = XE_CACHE_WB;
> >>   	}
> >> +
> >>   	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
> >>   		xe_walk.dma_offset =
> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
> >>
> >> diff --git a/drivers/gpu/drm/xe/xe_pt.h b/drivers/gpu/drm/xe/xe_pt.h
> >> index 01be7ab08f87..1d433a5a96b4 100644
> >> --- a/drivers/gpu/drm/xe/xe_pt.h
> >> +++ b/drivers/gpu/drm/xe/xe_pt.h
> >> @@ -45,10 +45,8 @@ __xe_pt_unbind_vma(struct xe_tile *tile, struct
> xe_vma *vma, struct xe_exec_queu
> >>
> >>   bool xe_pt_zap_ptes(struct xe_tile *tile, struct xe_vma *vma);
> >>
> >> -u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset,
> >> -		  const enum xe_cache_level level);
> >> +u64 xe_pde_encode(struct xe_bo *bo, u64 bo_offset, u32 pat_index);
> >>
> >> -u64 xe_pte_encode(struct xe_bo *bo, u64 offset, enum xe_cache_level
> cache,
> >> -		  u32 pt_level);
> >> +u64 xe_pte_encode(struct xe_bo *bo, u64 offset, u32 pat_index, u32
> pt_level);
> >>
> >>   #endif
> >> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> >> index 7eeeed0411f3..34603a7e84b0 100644
> >> --- a/drivers/gpu/drm/xe/xe_vm.c
> >> +++ b/drivers/gpu/drm/xe/xe_vm.c
> >> @@ -6,6 +6,7 @@
> >>   #include "xe_vm.h"
> >>
> >>   #include <linux/dma-fence-array.h>
> >> +#include <linux/nospec.h>
> >>
> >>   #include <drm/drm_print.h>
> >>   #include <drm/ttm/ttm_execbuf_util.h>
> >> @@ -874,7 +875,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
> *vm,
> >>   				    u64 start, u64 end,
> >>   				    bool read_only,
> >>   				    bool is_null,
> >> -				    u8 tile_mask)
> >> +				    u8 tile_mask,
> >> +				    u32 pat_index)
> >>   {
> >>   	struct xe_vma *vma;
> >>   	struct xe_tile *tile;
> >> @@ -913,6 +915,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
> *vm,
> >>   			vma->tile_mask |= 0x1 << id;
> >>   	}
> >>
> >> +	vma->pat_index = pat_index;
> >> +
> >>   	if (vm->xe->info.platform == XE_PVC)
> >>   		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
> >>
> >> @@ -1194,10 +1198,8 @@ static void xe_vma_op_work_func(struct
> work_struct *w);
> >>   static void vm_destroy_work_func(struct work_struct *w);
> >>
> >>   static u64 xelp_ppgtt_pde_encode_pat(struct xe_device *xe, u64 pde_pat,
> >> -						enum xe_cache_level cache)
> >> +				     u32 pat_index)
> >>   {
> >> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >> -
> >>   	pde_pat &= ~(XELP_PDE_PAT_MASK);
> >>
> >>   	if (pat_index & BIT(0))
> >> @@ -1213,10 +1215,8 @@ static u64 xelp_ppgtt_pde_encode_pat(struct
> xe_device *xe, u64 pde_pat,
> >>   }
> >>
> >>   static u64 xelp_ppgtt_pte_encode_pat(struct xe_device *xe, u64 pte_pat,
> >> -						enum xe_cache_level cache)
> >> +				     u32 pat_index)
> >>   {
> >> -	u32 pat_index = xe_pat_get_index(xe, cache);
> >> -
> >>   	pte_pat &= ~(XELP_PTE_PAT_MASK);
> >>
> >>   	if (pat_index & BIT(0))
> >> @@ -1622,7 +1622,7 @@ struct xe_vm *xe_vm_lookup(struct xe_file *xef,
> u32 id)
> >>   u64 xe_vm_pdp4_descriptor(struct xe_vm *vm, struct xe_tile *tile)
> >>   {
> >>   	return xe_pde_encode(vm->pt_root[tile->id]->bo, 0,
> >> -			     XE_CACHE_WB);
> >> +			     xe_pat_get_index(vm->xe, XE_CACHE_WB));
> >>   }
> >>
> >>   static struct dma_fence *
> >> @@ -2311,7 +2311,7 @@ static void print_op(struct xe_device *xe, struct
> drm_gpuva_op *op)
> >>   static struct drm_gpuva_ops *
> >>   vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
> >>   			 u64 bo_offset_or_userptr, u64 addr, u64 range,
> >> -			 u32 operation, u8 tile_mask, u32 region)
> >> +			 u32 operation, u8 tile_mask, u32 region, u32
> pat_index)
> >>   {
> >>   	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
> >>   	struct ww_acquire_ctx ww;
> >> @@ -2339,6 +2339,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> struct xe_bo *bo,
> >>   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> >>
> >>   			op->tile_mask = tile_mask;
> >> +			op->pat_index = pat_index;
> >>   			op->map.immediate =
> >>   				operation & XE_VM_BIND_FLAG_IMMEDIATE;
> >>   			op->map.read_only =
> >> @@ -2366,6 +2367,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> struct xe_bo *bo,
> >>   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
> >>
> >>   			op->tile_mask = tile_mask;
> >> +			op->pat_index = pat_index;
> >>   			op->prefetch.region = region;
> >>   		}
> >>   		break;
> >> @@ -2408,7 +2410,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
> struct xe_bo *bo,
> >>   }
> >>
> >>   static struct xe_vma *new_vma(struct xe_vm *vm, struct
> drm_gpuva_op_map *op,
> >> -			      u8 tile_mask, bool read_only, bool is_null)
> >> +			      u8 tile_mask, bool read_only, bool is_null,
> >> +			      u32 pat_index)
> >>   {
> >>   	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
> >>   	struct xe_vma *vma;
> >> @@ -2425,7 +2428,7 @@ static struct xe_vma *new_vma(struct xe_vm
> *vm, struct drm_gpuva_op_map *op,
> >>   	vma = xe_vma_create(vm, bo, op->gem.offset,
> >>   			    op->va.addr, op->va.addr +
> >>   			    op->va.range - 1, read_only, is_null,
> >> -			    tile_mask);
> >> +			    tile_mask, pat_index);
> >>   	if (bo)
> >>   		xe_bo_unlock(bo, &ww);
> >>
> >> @@ -2539,7 +2542,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
> *vm, struct xe_exec_queue *q,
> >>
> >>   				vma = new_vma(vm, &op->base.map,
> >>   					      op->tile_mask, op->map.read_only,
> >> -					      op->map.is_null);
> >> +					      op->map.is_null, op->pat_index);
> >>   				if (IS_ERR(vma)) {
> >>   					err = PTR_ERR(vma);
> >>   					goto free_fence;
> >> @@ -2567,7 +2570,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
> *vm, struct xe_exec_queue *q,
> >>
> >>   					vma = new_vma(vm, op-
> >base.remap.prev,
> >>   						      op->tile_mask, read_only,
> >> -						      is_null);
> >> +						      is_null, op->pat_index);
> >>   					if (IS_ERR(vma)) {
> >>   						err = PTR_ERR(vma);
> >>   						goto free_fence;
> >> @@ -2603,7 +2606,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
> *vm, struct xe_exec_queue *q,
> >>
> >>   					vma = new_vma(vm, op-
> >base.remap.next,
> >>   						      op->tile_mask, read_only,
> >> -						      is_null);
> >> +						      is_null, op->pat_index);
> >>   					if (IS_ERR(vma)) {
> >>   						err = PTR_ERR(vma);
> >>   						goto free_fence;
> >> @@ -3158,8 +3161,14 @@ static int vm_bind_ioctl_check_args(struct
> xe_device *xe,
> >>   		u32 obj = (*bind_ops)[i].obj;
> >>   		u64 obj_offset = (*bind_ops)[i].obj_offset;
> >>   		u32 region = (*bind_ops)[i].region;
> >> +		u32 pat_index = (*bind_ops)[i].pat_index;
> >>   		bool is_null = op & XE_VM_BIND_FLAG_NULL;
> >>
> >> +		if (XE_IOCTL_DBG(xe, pat_index >= xe-
> >info.pat_table_n_entries)) {
> >> +			err = -EINVAL;
> >> +			goto free_bind_ops;
> >> +		}
> >> +
> >>   		if (i == 0) {
> >>   			*async = !!(op & XE_VM_BIND_FLAG_ASYNC);
> >>   		} else if (XE_IOCTL_DBG(xe, !*async) ||
> >> @@ -3346,8 +3355,25 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
> void *data, struct drm_file *file)
> >>   		struct drm_gem_object *gem_obj;
> >>   		u64 range = bind_ops[i].range;
> >>   		u64 addr = bind_ops[i].addr;
> >> +		u32 op = bind_ops[i].op;
> >>   		u32 obj = bind_ops[i].obj;
> >>   		u64 obj_offset = bind_ops[i].obj_offset;
> >> +		u32 pat_index = bind_ops[i].pat_index;
> >> +		u16 coh_mode;
> >> +
> >> +		pat_index = array_index_nospec(pat_index,
> >> +					       xe->info.pat_table_n_entries);
> >> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
> >> +		if (XE_IOCTL_DBG(xe, !coh_mode)) {
> >
> > Assuming we drop the unusable entries from the TGL table, this should be
> > impossible, right?  Any index that makes it past the n_entries check at
> > the top of the function should have a valid, non-zero coh_mode value.
> > So this should probably be an assertion (to highlight a KMD bug) rather
> > than just a silent uapi failure return.
> 
> Makes sense.
> 
> >
> >> +			err = -EINVAL;
> >> +			goto put_obj;
> >> +		}
> >> +
> >> +		if (XE_IOCTL_DBG(xe, VM_BIND_OP(op) ==
> XE_VM_BIND_OP_MAP_USERPTR &&
> >> +				 coh_mode == XE_GEM_COHERENCY_NONE)) {
> >> +			err = -EINVAL;
> >> +			goto put_obj;
> >> +		}
> >>
> >>   		if (!obj)
> >>   			continue;
> >> @@ -3375,6 +3401,22 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
> void *data, struct drm_file *file)
> >>   				goto put_obj;
> >>   			}
> >>   		}
> >> +
> >> +		if (bos[i]->coh_mode) {
> >> +			if (XE_IOCTL_DBG(xe, bos[i]->coh_mode !=
> coh_mode)) {
> >> +				err = -EINVAL;
> >> +				goto put_obj;
> >> +			}
> >> +		} else if (XE_IOCTL_DBG(xe, coh_mode ==
> XE_GEM_COHERENCY_NONE)) {
> >> +			/*
> >> +			 * Imported dma-buf from a different device should
> >> +			 * require 1way or 2way coherency since we don't
> know
> >> +			 * how it was mapped on CPU. Just assume is it
> >> +			 * potentially cached on CPU side.
> >> +			 */
> >> +			err = -EINVAL;
> >> +			goto put_obj;
> >> +		}
> >>   	}
> >>
> >>   	if (args->num_syncs) {
> >> @@ -3412,10 +3454,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev,
> void *data, struct drm_file *file)
> >>   		u64 obj_offset = bind_ops[i].obj_offset;
> >>   		u8 tile_mask = bind_ops[i].tile_mask;
> >>   		u32 region = bind_ops[i].region;
> >> +		u32 pat_index = bind_ops[i].pat_index;
> >>
> >>   		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
> >>   						  addr, range, op, tile_mask,
> >> -						  region);
> >> +						  region, pat_index);
> >>   		if (IS_ERR(ops[i])) {
> >>   			err = PTR_ERR(ops[i]);
> >>   			ops[i] = NULL;
> >> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
> b/drivers/gpu/drm/xe/xe_vm_types.h
> >> index 83a1f87b6537..508679ed3c74 100644
> >> --- a/drivers/gpu/drm/xe/xe_vm_types.h
> >> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> >> @@ -111,6 +111,11 @@ struct xe_vma {
> >>   	 */
> >>   	u8 tile_present;
> >>
> >> +	/**
> >> +	 * @pat_index: The pat index to use when encoding the PTEs for this
> vma.
> >> +	 */
> >> +	u32 pat_index;
> >> +
> >>   	struct {
> >>   		struct list_head rebind_link;
> >>   	} notifier;
> >> @@ -338,10 +343,8 @@ struct xe_vm {
> >>   	bool batch_invalidate_tlb;
> >>
> >>   	struct {
> >> -		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat,
> >> -						enum xe_cache_level cache);
> >> -		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat,
> >> -						enum xe_cache_level cache);
> >> +		u64 (*pte_encode)(struct xe_device *xe, u64 pte_pat, u32
> pat_index);
> >> +		u64 (*pde_encode)(struct xe_device *xe, u64 pde_pat, u32
> pat_index);
> >>   	} pat_encode;
> >>   };
> >>
> >> @@ -417,6 +420,8 @@ struct xe_vma_op {
> >>   	struct async_op_fence *fence;
> >>   	/** @tile_mask: gt mask for this operation */
> >>   	u8 tile_mask;
> >> +	/** @pat_index: The pat index to use for this operation. */
> >> +	u32 pat_index;
> >>   	/** @flags: operation flags */
> >>   	enum xe_vma_op_flags flags;
> >>
> >> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> >> index 64bc66d4b550..0c15b6f32447 100644
> >> --- a/include/uapi/drm/xe_drm.h
> >> +++ b/include/uapi/drm/xe_drm.h
> >> @@ -600,8 +600,45 @@ struct drm_xe_vm_bind_op {
> >>   	 */
> >>   	__u32 obj;
> >>
> >> -	/** @pad: MBZ */
> >> -	__u32 pad;
> >> +	/**
> >> +	 * @pat_index: The platform defined @pat_index to use for this
> mapping.
> >> +	 * The index basically maps to some predefined memory attributes,
> >> +	 * including things like caching, coherency and likely other stuff in
> >> +	 * the future.  The exact meaning of the pat_index is platform specific
> >
> > BTW, "other stuff in the future" already includes compression on Xe2, we
> > just haven't landed the patches for the Xe2 table yet.
> 
> Ok, good to know.
> 
> >
> >> +	 * and defined in the Bspec and PRMs.  When the KMD sets up the
> binding
> >> +	 * the index here is encoded into the ppGTT PTE.
> >> +	 *
> >> +	 * For coherency the @pat_index needs to match the
> >> +	 * drm_xe_gem_create.coh_mode, so either
> XE_GEM_COHERENCY_NONE,
> >> +	 * XE_GEM_COHERENCY_1WAY or XE_GEM_COHERENCY_2WAY. The
> KMD will extract
> >> +	 * the coherency mode from the @pat_index and reject if there is a
> >> +	 * mismatch (see note below for pre-MTL platforms).
> >> +	 *
> >> +	 * Note: On pre-MTL platforms there is only a caching mode and no
> >> +	 * explicit coherency mode, but on such hardware there is always a
> >> +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
> >> +	 * CPU caches even with the caching mode set as uncached.  It's only
> the
> >> +	 * display engine that is incoherent (on dgpu it must be in VRAM which
> >> +	 * is always mapped as WC on the CPU). However to keep the uapi
> somewhat
> >> +	 * consistent with newer platforms the KMD groups the different cache
> >> +	 * levels into the following coherency buckets on all pre-MTL platforms:
> >> +	 *
> >> +	 *	ppGTT UC -> XE_GEM_COHERENCY_NONE
> >> +	 *	ppGTT WC -> XE_GEM_COHERENCY_NONE
> >> +	 *	ppGTT WT -> XE_GEM_COHERENCY_NONE
> >> +	 *	ppGTT WB -> XE_GEM_COHERENCY_2WAY
> >
> > As noted on the previous patch, it seems like 2-way is appropriate for
> > LLC platforms, but 1-way might be a more accurate description of dGPU
> > behavior.
> >
> >> +	 *
> >> +	 * In practice UC/WC/WT should only ever used for scanout surfaces on
> >> +	 * such platforms since it is only the display engine that is actually
> >> +	 * incoherent. Everything else should typically use WB given that we
> >
> > What if we're sharing our buffers with some other (non-GPU) device?  Are
> > there cases where that other device wouldn't be coherent with the LLC,
> > so we'd want to use one of these?
> 
> Yeah, I guess there might be cases like that. I'll reword.
> 
> >
> >
> > Matt
> >
> >> +	 * have a shared-LLC.  On MTL+ this completely changes (also
> potentially
> >> +	 * no shared-LLC) and the HW defines the coherency mode as part of
> the
> >> +	 * @pat_index.
> >> +	 *
> >> +	 * Note: For userptr and externally imported dma-buf the kernel
> expects
> >> +	 * either 1WAY or 2WAY for the @pat_index.
> >> +	 */
> >> +	__u32 pat_index;
> >>
> >>   	union {
> >>   		/**
> >> --
> >> 2.41.0
> >>
> >