[Intel-xe] [PATCH v9 3/3] drm/xe/uapi: support pat_index selection with vm_bind

Thu Nov 9 15:42:23 UTC 2023

Hi,

On 09/11/2023 15:03, Zhang, Carl wrote:
> Hi Matthew,
> Sorry for the delay response, questions about this statement:
> "For imported dma-buf (from  a different device) the coherency mode is also implicit and must also be either 1WAY or 2WAY."
> 1. how about imported dma-buf from a same device/GPU? Same requirement?

For self-import (i.e same device) you can also use coh_none, however it 
only works if it makes sense given the original cpu_caching. See the 
prime_self_import_coh test here:

https://patchwork.freedesktop.org/patch/566785/?series=124667&rev=9

 From KMD pov there is not much difference between self-import and just 
some normal userspace object.

> 2. how about the exporting side? If it was created as a UC, the coherency will be treated as NONE,  then it was exported , and imported by another component, it should be implicit and be 1-way or 2way. KMD will handle these difference?

If it was imported from a difference device then KMD has no concept of 
cpu_caching for that object. For now that means requiring 1way+.

See the prime_external_import_coh test for example, in the same link 
above. It uses vgem to create some object (but in principle it could 
also be i915, amdgpu, etc) which Xe then imports and maps using vm_bind 
1way+.

> 3. about dma-buf, producer and consumer maybe set different cpu_caching?  As you know,  in producer side, may not know whether it will be exported during creation or 

With dma-buf only the producer controls the cpu_caching or equivalent. 
The consumer can only control how they wish to map it on the gpu, i.e 
the PAT index.

vmbind, it may be determined by application after these operations.
> 3 looks a bit duplicate with 2.
> 
> Thanks
> Carl
> 
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld at intel.com>
>> Sent: Friday, November 3, 2023 11:43 PM
>> To: intel-xe at lists.freedesktop.org
>> Cc: Mishra, Pallavi <pallavi.mishra at intel.com>; Thomas Hellström
>> <thomas.hellstrom at linux.intel.com>; Joonas Lahtinen
>> <joonas.lahtinen at linux.intel.com>; De Marchi, Lucas
>> <lucas.demarchi at intel.com>; Roper, Matthew D
>> <matthew.d.roper at intel.com>; Souza, Jose <jose.souza at intel.com>; Hazubski,
>> Filip <filip.hazubski at intel.com>; Zhang, Carl <carl.zhang at intel.com>; Yu, Effie
>> <effie.yu at intel.com>; Xu, Zhengguo <zhengguo.xu at intel.com>; Dugast,
>> Francois <francois.dugast at intel.com>
>> Subject: [PATCH v9 3/3] drm/xe/uapi: support pat_index selection with
>> vm_bind
>>
>> Allow userspace to directly control the pat_index for a given vm binding. This
>> should allow directly controlling the coherency, caching behaviour,
>> compression and potentially other stuff in the future for the ppGTT binding.
>>
>> The exact meaning behind the pat_index is very platform specific (see BSpec or
>> PRMs) but effectively maps to some predefined memory attributes. From the
>> KMD pov we only care about the coherency that is provided by the pat_index,
>> which falls into either NONE, 1WAY or 2WAY.
>> The vm_bind coherency mode for the given pat_index needs to be at least
>> 1way coherent when using cpu_caching with
>> DRM_XE_GEM_CPU_CACHING_WB. For platforms that lack the explicit
>> coherency mode attribute, we treat UC/WT/WC as NONE and WB as
>> AT_LEAST_1WAY.
>>
>> For userptr mappings we lack a corresponding gem object, so the expected
>> coherency mode is instead implicit and must fall into either 1WAY or 2WAY.
>> Trying to use NONE will be rejected by the kernel. For imported dma-buf (from
>> a different device) the coherency mode is also implicit and must also be either
>> 1WAY or 2WAY.
>>
>> v2:
>>    - Undefined coh_mode(pat_index) can now be treated as programmer
>>      error. (Matt Roper)
>>    - We now allow gem_create.coh_mode <= coh_mode(pat_index), rather than
>>      having to match exactly. This ensures imported dma-buf can always
>>      just use 1way (or even 2way), now that we also bundle 1way/2way into
>>      at_least_1way. We still require 1way/2way for external dma-buf, but
>>      the policy can now be the same for self-import, if desired.
>>    - Use u16 for pat_index in uapi. u32 is massive overkill. (José)
>>    - Move as much of the pat_index validation as we can into
>>      vm_bind_ioctl_check_args. (José)
>> v3 (Matt Roper):
>>    - Split the pte_encode() refactoring into separate patch.
>> v4:
>>    - Rebase
>> v5:
>>    - Check for and reject !coh_mode which would indicate hw reserved
>>      pat_index on xe2.
>> v6:
>>    - Rebase on removal of coh_mode from uapi. We just need to reject
>>      cpu_caching=wb + pat_index with coh_none.
>>
>> Testcase: igt at xe_pat
>> Bspec: 45101, 44235 #xe
>> Bspec: 70552, 71582, 59400 #xe2
>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
>> Cc: Pallavi Mishra <pallavi.mishra at intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>> Cc: Matt Roper <matthew.d.roper at intel.com>
>> Cc: José Roberto de Souza <jose.souza at intel.com>
>> Cc: Filip Hazubski <filip.hazubski at intel.com>
>> Cc: Carl Zhang <carl.zhang at intel.com>
>> Cc: Effie Yu <effie.yu at intel.com>
>> Cc: Zhengguo Xu <zhengguo.xu at intel.com>
>> Cc: Francois Dugast <francois.dugast at intel.com>
>> Tested-by: José Roberto de Souza <jose.souza at intel.com>
>> Reviewed-by: José Roberto de Souza <jose.souza at intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_pt.c       | 11 +-----
>>   drivers/gpu/drm/xe/xe_vm.c       | 68 ++++++++++++++++++++++++++++----
>>   drivers/gpu/drm/xe/xe_vm_types.h |  7 ++++
>>   include/uapi/drm/xe_drm.h        | 42 +++++++++++++++++++-
>>   4 files changed, 110 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c index
>> 31afab617b4e..a0f31f991f34 100644
>> --- a/drivers/gpu/drm/xe/xe_pt.c
>> +++ b/drivers/gpu/drm/xe/xe_pt.c
>> @@ -290,8 +290,6 @@ struct xe_pt_stage_bind_walk {
>>   	struct xe_vm *vm;
>>   	/** @tile: The tile we're building for. */
>>   	struct xe_tile *tile;
>> -	/** @cache: Desired cache level for the ptes */
>> -	enum xe_cache_level cache;
>>   	/** @default_pte: PTE flag only template. No address is associated */
>>   	u64 default_pte;
>>   	/** @dma_offset: DMA offset to add to the PTE. */ @@ -511,7 +509,7
>> @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t offset,  {
>>   	struct xe_pt_stage_bind_walk *xe_walk =
>>   		container_of(walk, typeof(*xe_walk), base);
>> -	u16 pat_index = tile_to_xe(xe_walk->tile)->pat.idx[xe_walk->cache];
>> +	u16 pat_index = xe_walk->vma->pat_index;
>>   	struct xe_pt *xe_parent = container_of(parent, typeof(*xe_parent),
>> base);
>>   	struct xe_vm *vm = xe_walk->vm;
>>   	struct xe_pt *xe_child;
>> @@ -657,13 +655,8 @@ xe_pt_stage_bind(struct xe_tile *tile, struct xe_vma
>> *vma,
>>   	if (is_devmem) {
>>   		xe_walk.default_pte |= XE_PPGTT_PTE_DM;
>>   		xe_walk.dma_offset = vram_region_gpu_offset(bo-
>>> ttm.resource);
>> -		xe_walk.cache = XE_CACHE_WB;
>> -	} else {
>> -		if (!xe_vma_has_no_bo(vma) && bo->flags &
>> XE_BO_SCANOUT_BIT)
>> -			xe_walk.cache = XE_CACHE_WT;
>> -		else
>> -			xe_walk.cache = XE_CACHE_WB;
>>   	}
>> +
>>   	if (!xe_vma_has_no_bo(vma) && xe_bo_is_stolen(bo))
>>   		xe_walk.dma_offset =
>> xe_ttm_stolen_gpu_offset(xe_bo_device(bo));
>>
>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c index
>> d26c90f0d702..0723974f1fc8 100644
>> --- a/drivers/gpu/drm/xe/xe_vm.c
>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>> @@ -6,6 +6,7 @@
>>   #include "xe_vm.h"
>>
>>   #include <linux/dma-fence-array.h>
>> +#include <linux/nospec.h>
>>
>>   #include <drm/drm_exec.h>
>>   #include <drm/drm_print.h>
>> @@ -26,6 +27,7 @@
>>   #include "xe_gt_pagefault.h"
>>   #include "xe_gt_tlb_invalidation.h"
>>   #include "xe_migrate.h"
>> +#include "xe_pat.h"
>>   #include "xe_pm.h"
>>   #include "xe_preempt_fence.h"
>>   #include "xe_pt.h"
>> @@ -864,7 +866,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
>> *vm,
>>   				    u64 start, u64 end,
>>   				    bool read_only,
>>   				    bool is_null,
>> -				    u8 tile_mask)
>> +				    u8 tile_mask,
>> +				    u16 pat_index)
>>   {
>>   	struct xe_vma *vma;
>>   	struct xe_tile *tile;
>> @@ -906,6 +909,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm
>> *vm,
>>   	if (GRAPHICS_VER(vm->xe) >= 20 || vm->xe->info.platform == XE_PVC)
>>   		vma->gpuva.flags |= XE_VMA_ATOMIC_PTE_BIT;
>>
>> +	vma->pat_index = pat_index;
>> +
>>   	if (bo) {
>>   		xe_bo_assert_held(bo);
>>
>> @@ -2167,7 +2172,8 @@ static void print_op(struct xe_device *xe, struct
>> drm_gpuva_op *op)  static struct drm_gpuva_ops *
>> vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_bo *bo,
>>   			 u64 bo_offset_or_userptr, u64 addr, u64 range,
>> -			 u32 operation, u32 flags, u8 tile_mask, u32 region)
>> +			 u32 operation, u32 flags, u8 tile_mask, u32 region,
>> +			 u16 pat_index)
>>   {
>>   	struct drm_gem_object *obj = bo ? &bo->ttm.base : NULL;
>>   	struct drm_gpuva_ops *ops;
>> @@ -2194,6 +2200,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct
>> xe_bo *bo,
>>   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>
>>   			op->tile_mask = tile_mask;
>> +			op->pat_index = pat_index;
>>   			op->map.immediate =
>>   				flags & XE_VM_BIND_FLAG_IMMEDIATE;
>>   			op->map.read_only =
>> @@ -2221,6 +2228,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct
>> xe_bo *bo,
>>   			struct xe_vma_op *op = gpuva_op_to_vma_op(__op);
>>
>>   			op->tile_mask = tile_mask;
>> +			op->pat_index = pat_index;
>>   			op->prefetch.region = region;
>>   		}
>>   		break;
>> @@ -2263,7 +2271,8 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct
>> xe_bo *bo,  }
>>
>>   static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map
>> *op,
>> -			      u8 tile_mask, bool read_only, bool is_null)
>> +			      u8 tile_mask, bool read_only, bool is_null,
>> +			      u16 pat_index)
>>   {
>>   	struct xe_bo *bo = op->gem.obj ? gem_to_xe_bo(op->gem.obj) : NULL;
>>   	struct xe_vma *vma;
>> @@ -2279,7 +2288,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm,
>> struct drm_gpuva_op_map *op,
>>   	vma = xe_vma_create(vm, bo, op->gem.offset,
>>   			    op->va.addr, op->va.addr +
>>   			    op->va.range - 1, read_only, is_null,
>> -			    tile_mask);
>> +			    tile_mask, pat_index);
>>   	if (bo)
>>   		xe_bo_unlock(bo);
>>
>> @@ -2425,7 +2434,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
>> *vm, struct xe_exec_queue *q,
>>
>>   			vma = new_vma(vm, &op->base.map,
>>   				      op->tile_mask, op->map.read_only,
>> -				      op->map.is_null);
>> +				      op->map.is_null, op->pat_index);
>>   			if (IS_ERR(vma))
>>   				return PTR_ERR(vma);
>>
>> @@ -2451,7 +2460,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
>> *vm, struct xe_exec_queue *q,
>>
>>   				vma = new_vma(vm, op->base.remap.prev,
>>   					      op->tile_mask, read_only,
>> -					      is_null);
>> +					      is_null, op->pat_index);
>>   				if (IS_ERR(vma))
>>   					return PTR_ERR(vma);
>>
>> @@ -2485,7 +2494,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm
>> *vm, struct xe_exec_queue *q,
>>
>>   				vma = new_vma(vm, op->base.remap.next,
>>   					      op->tile_mask, read_only,
>> -					      is_null);
>> +					      is_null, op->pat_index);
>>   				if (IS_ERR(vma))
>>   					return PTR_ERR(vma);
>>
>> @@ -2883,6 +2892,26 @@ static int vm_bind_ioctl_check_args(struct
>> xe_device *xe,
>>   		u64 obj_offset = (*bind_ops)[i].obj_offset;
>>   		u32 region = (*bind_ops)[i].region;
>>   		bool is_null = flags & XE_VM_BIND_FLAG_NULL;
>> +		u16 pat_index = (*bind_ops)[i].pat_index;
>> +		u16 coh_mode;
>> +
>> +		if (XE_IOCTL_DBG(xe, pat_index >= xe->pat.n_entries)) {
>> +			err = -EINVAL;
>> +			goto free_bind_ops;
>> +		}
>> +
>> +		pat_index = array_index_nospec(pat_index, xe->pat.n_entries);
>> +		(*bind_ops)[i].pat_index = pat_index;
>> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>> +		if (XE_IOCTL_DBG(xe, !coh_mode)) { /* hw reserved */
>> +			err = -EINVAL;
>> +			goto free_bind_ops;
>> +		}
>> +
>> +		if (XE_WARN_ON(coh_mode > XE_COH_AT_LEAST_1WAY)) {
>> +			err = -EINVAL;
>> +			goto free_bind_ops;
>> +		}
>>
>>   		if (i == 0) {
>>   			*async = !!(flags & XE_VM_BIND_FLAG_ASYNC); @@ -
>> 2913,6 +2942,8 @@ static int vm_bind_ioctl_check_args(struct xe_device *xe,
>>   				 op == XE_VM_BIND_OP_UNMAP_ALL) ||
>>   		    XE_IOCTL_DBG(xe, obj &&
>>   				 op == XE_VM_BIND_OP_MAP_USERPTR) ||
>> +		    XE_IOCTL_DBG(xe, coh_mode == XE_COH_NONE &&
>> +				 op == XE_VM_BIND_OP_MAP_USERPTR) ||
>>   		    XE_IOCTL_DBG(xe, obj &&
>>   				 op == XE_VM_BIND_OP_PREFETCH) ||
>>   		    XE_IOCTL_DBG(xe, region &&
>> @@ -3046,6 +3077,8 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void
>> *data, struct drm_file *file)
>>   		u64 addr = bind_ops[i].addr;
>>   		u32 obj = bind_ops[i].obj;
>>   		u64 obj_offset = bind_ops[i].obj_offset;
>> +		u16 pat_index = bind_ops[i].pat_index;
>> +		u16 coh_mode;
>>
>>   		if (!obj)
>>   			continue;
>> @@ -3073,6 +3106,24 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void
>> *data, struct drm_file *file)
>>   				goto put_obj;
>>   			}
>>   		}
>> +
>> +		coh_mode = xe_pat_index_get_coh_mode(xe, pat_index);
>> +		if (bos[i]->cpu_caching) {
>> +			if (XE_IOCTL_DBG(xe, coh_mode == XE_COH_NONE
>> &&
>> +					 bos[i]->cpu_caching ==
>> DRM_XE_GEM_CPU_CACHING_WB)) {
>> +				err = -EINVAL;
>> +				goto put_obj;
>> +			}
>> +		} else if (XE_IOCTL_DBG(xe, coh_mode == XE_COH_NONE)) {
>> +			/*
>> +			 * Imported dma-buf from a different device should
>> +			 * require 1way or 2way coherency since we don't
>> know
>> +			 * how it was mapped on the CPU. Just assume is it
>> +			 * potentially cached on CPU side.
>> +			 */
>> +			err = -EINVAL;
>> +			goto put_obj;
>> +		}
>>   	}
>>
>>   	if (args->num_syncs) {
>> @@ -3100,10 +3151,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void
>> *data, struct drm_file *file)
>>   		u64 obj_offset = bind_ops[i].obj_offset;
>>   		u8 tile_mask = bind_ops[i].tile_mask;
>>   		u32 region = bind_ops[i].region;
>> +		u16 pat_index = bind_ops[i].pat_index;
>>
>>   		ops[i] = vm_bind_ioctl_ops_create(vm, bos[i], obj_offset,
>>   						  addr, range, op, flags,
>> -						  tile_mask, region);
>> +						  tile_mask, region,
>> pat_index);
>>   		if (IS_ERR(ops[i])) {
>>   			err = PTR_ERR(ops[i]);
>>   			ops[i] = NULL;
>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h
>> b/drivers/gpu/drm/xe/xe_vm_types.h
>> index aaf0c7101019..9ca84e1783ed 100644
>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>> @@ -110,6 +110,11 @@ struct xe_vma {
>>   	 */
>>   	u8 tile_present;
>>
>> +	/**
>> +	 * @pat_index: The pat index to use when encoding the PTEs for this
>> vma.
>> +	 */
>> +	u16 pat_index;
>> +
>>   	struct {
>>   		struct list_head rebind_link;
>>   	} notifier;
>> @@ -402,6 +407,8 @@ struct xe_vma_op {
>>   	struct list_head link;
>>   	/** @tile_mask: gt mask for this operation */
>>   	u8 tile_mask;
>> +	/** @pat_index: The pat index to use for this operation. */
>> +	u16 pat_index;
>>   	/** @flags: operation flags */
>>   	enum xe_vma_op_flags flags;
>>
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h index
>> e064ed475bf6..6f4af6d6c415 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -643,8 +643,48 @@ struct drm_xe_vm_bind_op {
>>   	 */
>>   	__u32 obj;
>>
>> +	/**
>> +	 * @pat_index: The platform defined @pat_index to use for this
>> mapping.
>> +	 * The index basically maps to some predefined memory attributes,
>> +	 * including things like caching, coherency, compression etc.  The exact
>> +	 * meaning of the pat_index is platform specific and defined in the
>> +	 * Bspec and PRMs.  When the KMD sets up the binding the index here
>> is
>> +	 * encoded into the ppGTT PTE.
>> +	 *
>> +	 * For coherency the @pat_index needs to be at least 1way coherent
>> when
>> +	 * drm_xe_gem_create.cpu_caching is
>> DRM_XE_GEM_CPU_CACHING_WB. The KMD
>> +	 * will extract the coherency mode from the @pat_index and reject if
>> +	 * there is a mismatch (see note below for pre-MTL platforms).
>> +	 *
>> +	 * Note: On pre-MTL platforms there is only a caching mode and no
>> +	 * explicit coherency mode, but on such hardware there is always a
>> +	 * shared-LLC (or is dgpu) so all GT memory accesses are coherent with
>> +	 * CPU caches even with the caching mode set as uncached.  It's only
>> the
>> +	 * display engine that is incoherent (on dgpu it must be in VRAM which
>> +	 * is always mapped as WC on the CPU). However to keep the uapi
>> somewhat
>> +	 * consistent with newer platforms the KMD groups the different cache
>> +	 * levels into the following coherency buckets on all pre-MTL platforms:
>> +	 *
>> +	 *	ppGTT UC -> COH_NONE
>> +	 *	ppGTT WC -> COH_NONE
>> +	 *	ppGTT WT -> COH_NONE
>> +	 *	ppGTT WB -> COH_AT_LEAST_1WAY
>> +	 *
>> +	 * In practice UC/WC/WT should only ever used for scanout surfaces on
>> +	 * such platforms (or perhaps in general for dma-buf if shared with
>> +	 * another device) since it is only the display engine that is actually
>> +	 * incoherent.  Everything else should typically use WB given that we
>> +	 * have a shared-LLC.  On MTL+ this completely changes and the HW
>> +	 * defines the coherency mode as part of the @pat_index, where
>> +	 * incoherent GT access is possible.
>> +	 *
>> +	 * Note: For userptr and externally imported dma-buf the kernel
>> expects
>> +	 * either 1WAY or 2WAY for the @pat_index.
>> +	 */
>> +	__u16 pat_index;
>> +
>>   	/** @pad: MBZ */
>> -	__u32 pad;
>> +	__u16 pad;
>>
>>   	union {
>>   		/**
>> --
>> 2.41.0
>