[Intel-gfx] [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
Lionel Landwerlin
lionel.g.landwerlin at intel.com
Mon Jul 9 13:48:26 UTC 2018
On 09/07/18 14:20, Tomasz Lis wrote:
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
>
> Rationale:
>
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
>
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
>
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
>
> 2. Why do we need a global coherency switch?
>
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
> ________________
> | NODE1 |
> | uint64_t data |
> +----------------|
> | NODE* | NODE*|
> +--------+-------+
> / \
> ________________/ \________________
> | NODE2 | | NODE3 |
> | uint64_t data | | uint64_t data |
> +----------------| +----------------|
> | NODE* | NODE*| | NODE* | NODE*|
> +--------+-------+ +--------+-------+
>
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
>
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
>
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
>
> 3. Will coherency switch be used frequently?
>
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
> Removed redundant GEM_WARN_ON()s. Improved to coding standard.
> Introduced a macro for checking whether hardware supports the feature.
>
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski at intel.com>
>
> Bspec: 11419
> Signed-off-by: Tomasz Lis <tomasz.lis at intel.com>
> ---
> drivers/gpu/drm/i915/i915_drv.h | 1 +
> drivers/gpu/drm/i915/i915_gem_context.c | 41 +++++++++++++++++++++++++++
> drivers/gpu/drm/i915/i915_gem_context.h | 6 ++++
> drivers/gpu/drm/i915/intel_lrc.c | 49 +++++++++++++++++++++++++++++++++
> drivers/gpu/drm/i915/intel_lrc.h | 4 +++
> include/uapi/drm/i915_drm.h | 6 ++++
> 6 files changed, 107 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 09ab124..7d4bbd5 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private *dev_priv)
> #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap & EDRAM_ENABLED))
> #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
> IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
Reading the documentation it seems that the bit you want to set is gone
in ICL/Gen11.
Maybe limit this to >= 9 && < 11?
Cheers,
-
Lionel
>
> #define HWS_NEEDS_PHYSICAL(dev_priv) ((dev_priv)->info.hws_needs_physical)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index b10770c..6db352e 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -711,6 +711,26 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
> return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
> }
>
> +static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + int ret;
> +
> + ret = intel_lr_context_modify_data_port_coherency(ctx, true);
> + if (!ret)
> + __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> + return ret;
> +}
> +
> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + int ret;
> +
> + ret = intel_lr_context_modify_data_port_coherency(ctx, false);
> + if (!ret)
> + __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> + return ret;
> +}
> +
> int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> + struct drm_i915_private *dev_priv = to_i915(dev);
> struct drm_i915_file_private *file_priv = file->driver_priv;
> struct drm_i915_gem_context_param *args = data;
> struct i915_gem_context *ctx;
> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> case I915_CONTEXT_PARAM_PRIORITY:
> args->value = ctx->sched.priority;
> break;
> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> + if (!HAS_DATA_PORT_COHERENCY(dev_priv))
> + ret = -ENODEV;
> + else
> + args->value = i915_gem_context_is_data_port_coherent(ctx);
> + break;
> default:
> ret = -EINVAL;
> break;
> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> + struct drm_i915_private *dev_priv = to_i915(dev);
> struct drm_i915_file_private *file_priv = file->driver_priv;
> struct drm_i915_gem_context_param *args = data;
> struct i915_gem_context *ctx;
> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
> }
> break;
>
> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> + if (args->size)
> + ret = -EINVAL;
> + else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
> + ret = -ENODEV;
> + else if (args->value == 1)
> + ret = i915_gem_context_set_data_port_coherent(ctx);
> + else if (args->value == 0)
> + ret = i915_gem_context_clear_data_port_coherent(ctx);
> + else
> + ret = -EINVAL;
> + break;
> +
> default:
> ret = -EINVAL;
> break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index b116e49..e8ccb70 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -126,6 +126,7 @@ struct i915_gem_context {
> #define CONTEXT_BANNABLE 3
> #define CONTEXT_BANNED 4
> #define CONTEXT_FORCE_SINGLE_SUBMISSION 5
> +#define CONTEXT_DATA_PORT_COHERENT 6
>
> /**
> * @hw_id: - unique identifier for the context
> @@ -257,6 +258,11 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
> __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
> }
>
> +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +}
> +
> static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
> {
> return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ab89dab..1f037e3 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
> ce->lrc_desc = desc;
> }
>
> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
> +{
> + u32 *cs;
> + i915_reg_t reg;
> +
> + GEM_BUG_ON(req->engine->class != RENDER_CLASS);
> + GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
> +
> + cs = intel_ring_begin(req, 4);
> + if (IS_ERR(cs))
> + return PTR_ERR(cs);
> +
> + if (INTEL_GEN(req->i915) >= 10)
> + reg = CNL_HDC_CHICKEN0;
> + else
> + reg = HDC_CHICKEN0;
> +
> + *cs++ = MI_LOAD_REGISTER_IMM(1);
> + *cs++ = i915_mmio_reg_offset(reg);
> + /* Enabling coherency means disabling the bit which forces it off */
> + if (enable)
> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> + else
> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> + *cs++ = MI_NOOP;
> +
> + intel_ring_advance(req, cs);
> +
> + return 0;
> +}
> +
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> + bool enable)
> +{
> + struct i915_request *req;
> + int ret;
> +
> + req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
> + if (IS_ERR(req))
> + return PTR_ERR(req);
> +
> + ret = emit_set_data_port_coherency(req, enable);
> +
> + i915_request_add(req);
> +
> + return ret;
> +}
> +
> static struct i915_priolist *
> lookup_priolist(struct intel_engine_cs *engine, int prio)
> {
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 1593194..f6965ae 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,8 @@ struct i915_gem_context;
>
> void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> + bool enable);
> +
> #endif /* _INTEL_LRC_H_ */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..e677bea 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
> #define I915_CONTEXT_DEFAULT_PRIORITY 0
> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU will update memory
> + * buffers shared with CPU, by forcing internal cache units to send memory
> + * writes to real RAM faster. Keeping such coherency has performance cost.
> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
> __u64 value;
> };
>
More information about the Intel-gfx
mailing list