[Intel-xe] [PATCH v9 3/3] drm/xe/uapi: support pat_index selection with vm_bind

Thu Nov 16 08:35:40 UTC 2023

On 16/11/2023 03:59, Zhang, Carl wrote:
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld at intel.com>
>> Sent: Monday, November 13, 2023 5:17 PM
>>
>> On 13/11/2023 06:16, Zhang, Carl wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Auld, Matthew <matthew.auld at intel.com>
>>>> Sent: Thursday, November 9, 2023 11:42 PM
>>>>
>>>> Hi,
>>>>
>>>> On 09/11/2023 15:03, Zhang, Carl wrote:
>>>>> Hi Matthew,
>>>>> Sorry for the delay response, questions about this statement:
>>>>> "For imported dma-buf (from  a different device) the coherency mode
>>>>> is also
>>>> implicit and must also be either 1WAY or 2WAY."
>>>>> 1. how about imported dma-buf from a same device/GPU? Same
>>>> requirement?
>>>>
>>>> For self-import (i.e same device) you can also use coh_none, however
>>>> it only works if it makes sense given the original cpu_caching. See
>>>> the prime_self_import_coh test here:
>>>>
>>>> https://patchwork.freedesktop.org/patch/566785/?series=124667&rev=9
>>>>
>>>>    From KMD pov there is not much difference between self-import and
>>>> just some normal userspace object.
>>>>
>>>>> 2. how about the exporting side? If it was created as a UC, the
>>>>> coherency will
>>>> be treated as NONE,  then it was exported , and imported by another
>>>> component, it should be implicit and be 1-way or 2way. KMD will
>>>> handle these difference?
>>>>
>>>> If it was imported from a difference device then KMD has no concept
>>>> of cpu_caching for that object. For now that means requiring 1way+.
>>>>
>>>> See the prime_external_import_coh test for example, in the same link
>>>> above. It uses vgem to create some object (but in principle it could
>>>> also be i915, amdgpu,
>>>> etc) which Xe then imports and maps using vm_bind
>>>> 1way+.
>>>>
>>>>> 3. about dma-buf, producer and consumer maybe set different
>>>>> cpu_caching?  As you know,  in producer side, may not know whether
>>>>> it will be exported during creation or
>>>>
>>>> With dma-buf only the producer controls the cpu_caching or equivalent.
>>>> The consumer can only control how they wish to map it on the gpu, i.e
>>>> the PAT index.
>>>>
>>>
>>> Actually, my question is focus on whether producer and consumer could
>>> set different PAT_index, from current discussion:
>>> it is "yes" for GPU caching part
>>> but is "no" for CPU caching part  & coherency part? --- if it also is
>>> "yes", there
>>
>> The coherency part was removed from the uapi (i.e gem_create.coh_mode),
>> which just leaves the gem_create.cpu_caching. The cpu_caching is not a PAT
>> index attribute on the GPU, it is just about controlling how the KMD is meant
>> allocate the memory. The memory is allocated as either wb or wc, which is
>> then immutable.
>>
>>> are no any concern. Producer and consumer could use totally different
>>> PAT_index without conflict .
>>
>> Right, producer and consumer could use different PAT index.
>>
>>>
>>> I use one example to describe my question (same device):
>>> Media decode one frame, then share it to compute to do AI inference.
>>> Media does not know whether compute will do some additional CPU access.
>>>
>>> Media create the surfaces (CPU caching : None,  coherency: None,
>>
>> Not sure what is "CPU caching : None"? Or is this just cpu_caching=wc ?
>> There is no "CPU caching" PAT index attribute AFAICT. You can just set the
>> caching policy for l3/l4, which can be things like: wb, wt, uc, wb-transient.
>>
> I mean "CPU caching: None"  UC
> 
>>> Compression: no, GPU cache: L3 ...) then share it to compute, which
>>> field should be kept same?
>>
>> KMD only cares about the "coherency:" field. But since this is same-device, in
>> this particular case you can use either none or 1-way on the importer side.
>>
>>> then , export it to dma buffer fd.
>>> Compute side: import it and create another bo, vmbind with different
>>> VM, Then , it should be (CPU caching: WB, coherency: 1-way,
>>> compression: no, GPU cache: any value)
>>
>> Not sure what is this "CPU caching: WB", but ignoring that above should work I
>> think.
>>
> 
> Consumer side does not know CPU caching setting which is handled by producer
> During the bo creation stage,   but consumer could always set 1-way , because no matter
> It is WB or UC, it could work , maybe impact perf.  Is it the design?
> 
> What I am thinking is ,  if it is no-coherency,  could consumer also know it is "no-coherency" originally.
> And also call vmbind as "no-coherency" to get a better perf?

Yeah, 1way+ will always work for consumer. You could try coh_none first 
if that is preferred on some platforms, and if that is rejected you fall 
back to 1way+, but that only really works for self-import.

> 
> 
>>>
>>>> vmbind, it may be determined by application after these operations.
>>>>> 3 looks a bit duplicate with 2.
>>>>>
>>>>> Thanks
>>>>> Carl
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Auld, Matthew <matthew.auld at intel.com>
>>>>>> Sent: Friday, November 3, 2023 11:43 PM
>>>>>> To: intel-xe at lists.freedesktop.org
>>>>>> Cc: Mishra, Pallavi <pallavi.mishra at intel.com>; Thomas Hellström
>>>>>> <thomas.hellstrom at linux.intel.com>; Joonas Lahtinen
>>>>>> <joonas.lahtinen at linux.intel.com>; De Marchi, Lucas
>>>>>> <lucas.demarchi at intel.com>; Roper, Matthew D
>>>>>> <matthew.d.roper at intel.com>; Souza, Jose <jose.souza at intel.com>;
>>>>>> Hazubski, Filip <filip.hazubski at intel.com>; Zhang, Carl
>>>>>> <carl.zhang at intel.com>; Yu, Effie <effie.yu at intel.com>; Xu,
>>>>>> Zhengguo <zhengguo.xu at intel.com>; Dugast, Francois
>>>>>> <francois.dugast at intel.com>
>>>>>> Subject: [PATCH v9 3/3] drm/xe/uapi: support pat_index selection
>>>>>> with vm_bind
>>>>>>
>>>>>> Allow userspace to directly control the pat_index for a given vm
>>>>>> binding. This should allow directly controlling the coherency,
>>>>>> caching behaviour, compression and potentially other stuff in the
>>>>>> future for
>>>> the ppGTT binding.
>>>>>>
>>>>>> The exact meaning behind the pat_index is very platform specific
>>>>>> (see BSpec or
>>>>>> PRMs) but effectively maps to some predefined memory attributes.
>>>>>>  From the KMD pov we only care about the coherency that is provided
>>>>>> by the pat_index, which falls into either NONE, 1WAY or 2WAY.
>>>>>> The vm_bind coherency mode for the given pat_index needs to be at
>>>>>> least 1way coherent when using cpu_caching with
>>>>>> DRM_XE_GEM_CPU_CACHING_WB. For platforms that lack the explicit
>>>>>> coherency mode attribute, we treat UC/WT/WC as NONE and WB as
>>>>>> AT_LEAST_1WAY.
>>>>>>
>>>>>> For userptr mappings we lack a corresponding gem object, so the
>>>>>> expected coherency mode is instead implicit and must fall into
>>>>>> either 1WAY
>>>> or 2WAY.
>>>>>> Trying to use NONE will be rejected by the kernel. For imported
>>>>>> dma-buf (from a different device) the coherency mode is also
>>>>>> implicit and must also be either 1WAY or 2WAY.
>>>>>>
>>>>>> v2:
>>>>>>      - Undefined coh_mode(pat_index) can now be treated as programmer
>>>>>>        error. (Matt Roper)
>>>>>>      - We now allow gem_create.coh_mode <= coh_mode(pat_index),
>>>>>> rather
>>>> than
>>>>>>        having to match exactly. This ensures imported dma-buf can always
>>>>>>        just use 1way (or even 2way), now that we also bundle 1way/2way
>> into
>>>>>>        at_least_1way. We still require 1way/2way for external dma-buf, but
>>>>>>        the policy can now be the same for self-import, if desired.
>>>>>>      - Use u16 for pat_index in uapi. u32 is massive overkill. (José)
>>>>>>      - Move as much of the pat_index validation as we can into
>>>>>>        vm_bind_ioctl_check_args. (José)
>>>>>> v3 (Matt Roper):
>>>>>>      - Split the pte_encode() refactoring into separate patch.
>>>>>> v4:
>>>>>>      - Rebase
>>>>>> v5:
>>>>>>      - Check for and reject !coh_mode which would indicate hw reserved
>>>>>>        pat_index on xe2.
>>>>>> v6:
>>>>>>      - Rebase on removal of coh_mode from uapi. We just need to reject
>>>>>>        cpu_caching=wb + pat_index with coh_none.
>>>>>>
>>>>>> Testcase: igt at xe_pat
>>>>>> Bspec: 45101, 44235 #xe
>>>>>> Bspec: 70552, 71582, 59400 #xe2
>>>>>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
>>>>>> Cc: Pallavi Mishra <pallavi.mishra at intel.com>
>>>>>> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>>>>>> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
>>>>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>>>>> Cc: Matt Roper <matthew.d.roper at intel.com>
>>>>>> Cc: José Roberto de Souza <jose.souza at intel.com>
>>>>>> Cc: Filip Hazubski <filip.hazubski at intel.com>
>>>>>> Cc: Carl Zhang <carl.zhang at intel.com>
>>>>>> Cc: Effie Yu <effie.yu at intel.com>
>>>>>> Cc: Zhengguo Xu <zhengguo.xu at intel.com>
>>>>>> Cc: Francois Dugast <francois.dugast at intel.com>
>>>>>> Tested-by: José Roberto de Souza <jose.souza at intel.com>
>>>>>> Reviewed-by: José Roberto de Souza <jose.souza at intel.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/xe/xe_pt.c       | 11 +-----
>>>>>>     drivers/gpu/drm/xe/xe_vm.c       | 68
>> ++++++++++++++++++++++++++++----
>>>>>>     drivers/gpu/drm/xe/xe_vm_types.h |  7 ++++
>>>>>>     include/uapi/drm/xe_drm.h        | 42 +++++++++++++++++++-
>>>>>>     4 files changed, 110 insertions(+), 18 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_pt.c
>>>>>> b/drivers/gpu/drm/xe/xe_pt.c index
>>>>>> 31afab617b4e..a0f31f991f34 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_pt.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>>>>>> @@ -290,8 +290,6 @@ struct xe_pt_stage_bind_walk {
>>>>>>     	struct xe_vm *vm;
>>>>>>     	/** @tile: The tile we're building for. */
>>>>>>     	struct xe_tile *tile;
>>>>>> -	/** @cache: Desired cache level for the ptes */
>>>>>> -	enum xe_cache_level cache;
>>>>>>     	/** @default_pte: PTE flag only template. No address is associated */
>>>>>>     	u64 default_pte;
>>>>>>     	/** @dma_offset: DMA offset to add to the PTE. */ @@ -511,7
>>>>>> +509,7 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t
>> offset,  {
>>>>>>     	struct xe_pt_stage_bind_walk *xe_walk =
>>>>>>     		container_of(walk, typeof(*xe_walk), base);
>>>>>> -	u16 pat_index = tile_to_xe(xe_walk->tile)->pat.idx[xe_walk->cache];
>>>>>> +	u16 pat_index = xe_walk->vma->pat_index;
>>>>>>     	struct xe_pt *xe_parent = container_of(parent,
>>>>>> typeof(*xe_parent), base);
>>>>>>     	struct xe_vm *vm = xe_walk->vm;
>>>>>>     	struct xe_pt *xe_child;
>>>>>> @@ -657,13 +655,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct
>>>>>> xe_vma *vma,
>>>>>>     	if (is_devmem) {
>>>>>>     		xe_walk.default_pte |= XE_PPGTT_PTE_DM;
>>>>>>     		xe_walk.dma_offset = vram_region_gpu_offset(bo-
>>>>>>> ttm.resource);
>>>>>> -		xe_walk.cache = XE_CACHE_WB;
>>>>>> -	} else {
>>>>>> -		if (!xe_vma_has_no_bo(vma) && bo->flags &
>>>>>> XE_BO_SCANOUT_BIT)
>>>>>> -			xe_walk.cache = XE_CACHE_WT;
>>>>>> -		else
>>>>>> -			xe_walk.cache = XE_CACHE_WB;
>>>>>>     	}
>>>>>> +
>>>>>>     	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>>>>>     		xe_walk.dma_offset =
>>>>>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c
>>>>>> b/drivers/gpu/drm/xe/xe_vm.c index
>>>>>> d26c90f0d702..0723974f1fc8 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>>>> @@ -6,6 +6,7 @@
>>>>>>     #include "xe_vm.h"
>>>>>>
>>>>>>     #include <linux/dma-fence-array.h>
>>>>>> +#include <linux/nospec.h>
>>>>>>
>>>>>>     #include <drm/drm_exec.h>
>>>>>>     #include <drm/drm_print.h>
>>>>>> @@ -26,6 +27,7 @@
>>>>>>     #include "xe_gt_pagefault.h"
>>>>>>     #include "xe_gt_tlb_invalidation.h"
>>>>>>     #include "xe_migrate.h"
>>>>>> +#include "xe_pat.h"
>>>>>>     #include "xe_pm.h"
>>>>>>     #include "xe_preempt_fence.h"
>>>>>>     #include "xe_pt.h"
>>>>>> @@ -864,7 +866,8 @@ static struct xe_vma *xe_vma_create(struct
>>>>>> xe_vm *vm,
>>>>>>     				    u64 start, u64 end,
>>>>>>     				    bool read_only,
>>>>>>     				    bool is_null,
>>>>>> -				    u8 tile_mask)
>>>>>> +				    u8 tile_mask,
>>>>>> +				    u16 pat_index)
>>>>>>     {
>>>>>>     	struct xe_vma *vma;
>>>>>>     	struct xe_tile *tile;
>>>>>> @@ -906,6 +909,8 @@ static struct xe_vma *xe_vma_create(struct
>>>>>> xe_vm *vm,
>>>>>>     	if (GRAPHICS_VER(vm->xe) >= 20 || vm->xe->info.platform ==
>> XE_PVC)
>>>>>>     		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>>>>>
>>>>>> +	vma->pat_index = pat_index;
>>>>>> +
>>>>>>     	if (bo) {
>>>>>>     		xe_bo_assert_held(bo);
>>>>>>
>>>>>> @@ -2167,7 +2172,8 @@ static void print_op(struct xe_device *xe,
>>>>>> struct drm_gpuva_op *op)  static struct drm_gpuva_ops *
>>>>>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>>>>>     			 u64 bo_offset_or_userptr, u64 addr, u64 range,
>>>>>> -			 u32 operation, u32 flags, u8 tile_mask, u32 region)
>>>>>> +			 u32 operation, u32 flags, u8 tile_mask, u32 region,
>>>>>> +			 u16 pat_index)
>>>>>>     {
>>>>>>     	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>>>>>     	struct drm_gpuva_ops *ops;
>>>>>> @@ -2194,6 +2200,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
>>>>>> struct xe_bo *bo,
>>>>>>     			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>>>>>
>>>>>>     			op->tile_mask = tile_mask;
>>>>>> +			op->pat_index = pat_index;
>>>>>>     			op->map.immediate =
>>>>>>     				flags & XE_VM_BIND_FLAG_IMMEDIATE;
>>>>>>     			op->map.read_only =
>>>>>> @@ -2221,6 +2228,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
>>>>>> struct xe_bo *bo,
>>>>>>     			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>>>>>
>>>>>>     			op->tile_mask = tile_mask;
>>>>>> +			op->pat_index = pat_index;
>>>>>>     			op->prefetch.region = region;
>>>>>>     		}
>>>>>>     		break;
>>>>>> @@ -2263,7 +2271,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm,
>>>>>> struct xe_bo *bo,  }
>>>>>>
>>>>>>     static struct xe_vma *new_vma(struct xe_vm *vm, struct
>>>>>> drm_gpuva_op_map *op,
>>>>>> -			      u8 tile_mask, bool read_only, bool is_null)
>>>>>> +			      u8 tile_mask, bool read_only, bool is_null,
>>>>>> +			      u16 pat_index)
>>>>>>     {
>>>>>>     	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) :
>> NULL;
>>>>>>     	struct xe_vma *vma;
>>>>>> @@ -2279,7 +2288,7 @@ static struct xe_vma *new_vma(struct xe_vm
>>>> *vm,
>>>>>> struct drm_gpuva_op_map *op,
>>>>>>     	vma = xe_vma_create(vm, bo, op->gem.offset,
>>>>>>     			    op->va.addr, op->va.addr +
>>>>>>     			    op->va.range - 1, read_only, is_null,
>>>>>> -			    tile_mask);
>>>>>> +			    tile_mask, pat_index);
>>>>>>     	if (bo)
>>>>>>     		xe_bo_unlock(bo);
>>>>>>
>>>>>> @@ -2425,7 +2434,7 @@ static int vm_bind_ioctl_ops_parse(struct
>>>>>> xe_vm *vm, struct xe_exec_queue *q,
>>>>>>
>>>>>>     			vma = new_vma(vm, &op->base.map,
>>>>>>     				      op->tile_mask, op->map.read_only,
>>>>>> -				      op->map.is_null);
>>>>>> +				      op->map.is_null, op->pat_index);
>>>>>>     			if (IS_ERR(vma))
>>>>>>     				return PTR_ERR(vma);
>>>>>>
>>>>>> @@ -2451,7 +2460,7 @@ static int vm_bind_ioctl_ops_parse(struct
>>>>>> xe_vm *vm, struct xe_exec_queue *q,
>>>>>>
>>>>>>     				vma = new_vma(vm, op->base.remap.prev,
>>>>>>     					      op->tile_mask, read_only,
>>>>>> -					      is_null);
>>>>>> +					      is_null, op->pat_index);
>>>>>>     				if (IS_ERR(vma))
>>>>>>     					return PTR_ERR(vma);
>>>>>>
>>>>>> @@ -2485,7 +2494,7 @@ static int vm_bind_ioctl_ops_parse(struct
>>>>>> xe_vm *vm, struct xe_exec_queue *q,
>>>>>>
>>>>>>     				vma = new_vma(vm, op->base.remap.next,
>>>>>>     					      op->tile_mask, read_only,
>>>>>> -					      is_null);
>>>>>> +					      is_null, op->pat_index);
>>>>>>     				if (IS_ERR(vma))
>>>>>>     					return PTR_ERR(vma);
>>>>>>
>>>>>> @@ -2883,6 +2892,26 @@ static int vm_bind_ioctl_check_args(struct
>>>>>> xe_device *xe,
>>>>>>     		u64 obj_offset = (*bind_ops)[i].obj_offset;
>>>>>>     		u32 region = (*bind_ops)[i].region;
>>>>>>     		bool is_null = flags & XE_VM_BIND_FLAG_NULL;
>>>>>> +		u16 pat_index = (*bind_ops)[i].pat_index;
>>>>>> +		u16 coh_mode;
>>>>>> +
>>>>>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe->pat.n_entries)) {
>>>>>> +			err = -EINVAL;
>>>>>> +			goto free_bind_ops;
>>>>>> +		}
>>>>>> +
>>>>>> +		pat_index = array_index_nospec(pat_index, xe-
>>> pat.n_entries);
>>>>>> +		(*bind_ops)[i].pat_index = pat_index;
>>>>>> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>>>>>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) { /* hw reserved */
>>>>>> +			err = -EINVAL;
>>>>>> +			goto free_bind_ops;
>>>>>> +		}
>>>>>> +
>>>>>> +		if (XE_WARN_ON(coh_mode > XE_COH_AT_LEAST_1WAY)) {
>>>>>> +			err = -EINVAL;
>>>>>> +			goto free_bind_ops;
>>>>>> +		}
>>>>>>
>>>>>>     		if (i == 0) {
>>>>>>     			*async = !!(flags & XE_VM_BIND_FLAG_ASYNC); @@
>> -
>>>>>> 2913,6 +2942,8 @@ static int vm_bind_ioctl_check_args(struct
>>>>>> xe_device
>>>> *xe,
>>>>>>     				 op == XE_VM_BIND_OP_UNMAP_ALL) ||
>>>>>>     		    XE_IOCTL_DBG(xe, obj &&
>>>>>>     				 op == XE_VM_BIND_OP_MAP_USERPTR) ||
>>>>>> +		    XE_IOCTL_DBG(xe, coh_mode == XE_COH_NONE &&
>>>>>> +				 op == XE_VM_BIND_OP_MAP_USERPTR) ||
>>>>>>     		    XE_IOCTL_DBG(xe, obj &&
>>>>>>     				 op == XE_VM_BIND_OP_PREFETCH) ||
>>>>>>     		    XE_IOCTL_DBG(xe, region && @@ -3046,6 +3077,8 @@ int
>>>>>> xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct
>>>>>> drm_file *file)
>>>>>>     		u64 addr = bind_ops[i].addr;
>>>>>>     		u32 obj = bind_ops[i].obj;
>>>>>>     		u64 obj_offset = bind_ops[i].obj_offset;
>>>>>> +		u16 pat_index = bind_ops[i].pat_index;
>>>>>> +		u16 coh_mode;
>>>>>>
>>>>>>     		if (!obj)
>>>>>>     			continue;
>>>>>> @@ -3073,6 +3106,24 @@ int xe_vm_bind_ioctl(struct drm_device
>> *dev,
>>>>>> void *data, struct drm_file *file)
>>>>>>     				goto put_obj;
>>>>>>     			}
>>>>>>     		}
>>>>>> +
>>>>>> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>>>>>> +		if (bos[i]->cpu_caching) {
>>>>>> +			if (XE_IOCTL_DBG(xe, coh_mode == XE_COH_NONE
>>>>>> &&
>>>>>> +					 bos[i]->cpu_caching ==
>>>>>> DRM_XE_GEM_CPU_CACHING_WB)) {
>>>>>> +				err = -EINVAL;
>>>>>> +				goto put_obj;
>>>>>> +			}
>>>>>> +		} else if (XE_IOCTL_DBG(xe, coh_mode == XE_COH_NONE)) {
>>>>>> +			/*
>>>>>> +			 * Imported dma-buf from a different device should
>>>>>> +			 * require 1way or 2way coherency since we don't
>>>>>> know
>>>>>> +			 * how it was mapped on the CPU. Just assume is it
>>>>>> +			 * potentially cached on CPU side.
>>>>>> +			 */
>>>>>> +			err = -EINVAL;
>>>>>> +			goto put_obj;
>>>>>> +		}
>>>>>>     	}
>>>>>>
>>>>>>     	if (args->num_syncs) {
>>>>>> @@ -3100,10 +3151,11 @@ int xe_vm_bind_ioctl(struct drm_device
>>>>>> *dev, void *data, struct drm_file *file)
>>>>>>     		u64 obj_offset = bind_ops[i].obj_offset;
>>>>>>     		u8 tile_mask = bind_ops[i].tile_mask;
>>>>>>     		u32 region = bind_ops[i].region;
>>>>>> +		u16 pat_index = bind_ops[i].pat_index;
>>>>>>
>>>>>>     		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>>>>>>     						  addr, range, op, flags,
>>>>>> -						  tile_mask, region);
>>>>>> +						  tile_mask, region,
>>>>>> pat_index);
>>>>>>     		if (IS_ERR(ops[i])) {
>>>>>>     			err = PTR_ERR(ops[i]);
>>>>>>     			ops[i] = NULL;
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>> b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>> index aaf0c7101019..9ca84e1783ed 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>> @@ -110,6 +110,11 @@ struct xe_vma {
>>>>>>     	 */
>>>>>>     	u8 tile_present;
>>>>>>
>>>>>> +	/**
>>>>>> +	 * @pat_index: The pat index to use when encoding the PTEs for
>>>>>> +this
>>>>>> vma.
>>>>>> +	 */
>>>>>> +	u16 pat_index;
>>>>>> +
>>>>>>     	struct {
>>>>>>     		struct list_head rebind_link;
>>>>>>     	} notifier;
>>>>>> @@ -402,6 +407,8 @@ struct xe_vma_op {
>>>>>>     	struct list_head link;
>>>>>>     	/** @tile_mask: gt mask for this operation */
>>>>>>     	u8 tile_mask;
>>>>>> +	/** @pat_index: The pat index to use for this operation. */
>>>>>> +	u16 pat_index;
>>>>>>     	/** @flags: operation flags */
>>>>>>     	enum xe_vma_op_flags flags;
>>>>>>
>>>>>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>>>>>> index
>>>>>> e064ed475bf6..6f4af6d6c415 100644
>>>>>> --- a/include/uapi/drm/xe_drm.h
>>>>>> +++ b/include/uapi/drm/xe_drm.h
>>>>>> @@ -643,8 +643,48 @@ struct drm_xe_vm_bind_op {
>>>>>>     	 */
>>>>>>     	__u32 obj;
>>>>>>
>>>>>> +	/**
>>>>>> +	 * @pat_index: The platform defined @pat_index to use for this
>>>>>> mapping.
>>>>>> +	 * The index basically maps to some predefined memory attributes,
>>>>>> +	 * including things like caching, coherency, compression etc.  The exact
>>>>>> +	 * meaning of the pat_index is platform specific and defined in the
>>>>>> +	 * Bspec and PRMs.  When the KMD sets up the binding the index
>>>>>> +here
>>>>>> is
>>>>>> +	 * encoded into the ppGTT PTE.
>>>>>> +	 *
>>>>>> +	 * For coherency the @pat_index needs to be at least 1way
>>>>>> +coherent
>>>>>> when
>>>>>> +	 * drm_xe_gem_create.cpu_caching is
>>>>>> DRM_XE_GEM_CPU_CACHING_WB. The KMD
>>>>>> +	 * will extract the coherency mode from the @pat_index and reject if
>>>>>> +	 * there is a mismatch (see note below for pre-MTL platforms).
>>>>>> +	 *
>>>>>> +	 * Note: On pre-MTL platforms there is only a caching mode and no
>>>>>> +	 * explicit coherency mode, but on such hardware there is always a
>>>>>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent
>> with
>>>>>> +	 * CPU caches even with the caching mode set as uncached.  It's
>>>>>> +only
>>>>>> the
>>>>>> +	 * display engine that is incoherent (on dgpu it must be in VRAM which
>>>>>> +	 * is always mapped as WC on the CPU). However to keep the uapi
>>>>>> somewhat
>>>>>> +	 * consistent with newer platforms the KMD groups the different
>> cache
>>>>>> +	 * levels into the following coherency buckets on all pre-MTL
>> platforms:
>>>>>> +	 *
>>>>>> +	 *	ppGTT UC -> COH_NONE
>>>>>> +	 *	ppGTT WC -> COH_NONE
>>>>>> +	 *	ppGTT WT -> COH_NONE
>>>>>> +	 *	ppGTT WB -> COH_AT_LEAST_1WAY
>>>>>> +	 *
>>>>>> +	 * In practice UC/WC/WT should only ever used for scanout surfaces
>> on
>>>>>> +	 * such platforms (or perhaps in general for dma-buf if shared with
>>>>>> +	 * another device) since it is only the display engine that is actually
>>>>>> +	 * incoherent.  Everything else should typically use WB given that we
>>>>>> +	 * have a shared-LLC.  On MTL+ this completely changes and the HW
>>>>>> +	 * defines the coherency mode as part of the @pat_index, where
>>>>>> +	 * incoherent GT access is possible.
>>>>>> +	 *
>>>>>> +	 * Note: For userptr and externally imported dma-buf the kernel
>>>>>> expects
>>>>>> +	 * either 1WAY or 2WAY for the @pat_index.
>>>>>> +	 */
>>>>>> +	__u16 pat_index;
>>>>>> +
>>>>>>     	/** @pad: MBZ */
>>>>>> -	__u32 pad;
>>>>>> +	__u16 pad;
>>>>>>
>>>>>>     	union {
>>>>>>     		/**
>>>>>> --
>>>>>> 2.41.0
>>>>>