[Intel-gfx] [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
Lis, Tomasz
tomasz.lis at intel.com
Mon Jul 9 14:03:11 UTC 2018
On 2018-07-09 15:48, Lionel Landwerlin wrote:
> On 09/07/18 14:20, Tomasz Lis wrote:
>> The patch adds a parameter to control the data port coherency
>> functionality
>> on a per-context level. When the IOCTL is called, a command to switch
>> data
>> port coherency state is added to the ordered list. All prior requests
>> are
>> executed on old coherency settings, and all exec requests after the
>> IOCTL
>> will use new settings.
>>
>> Rationale:
>>
>> The OpenCL driver develpers requested a functionality to control cache
>> coherency at data port level. Keeping the coherency at that level is
>> disabled
>> by default due to its performance costs. OpenCL driver is planning to
>> enable it for a small subset of submissions, when such functionality is
>> required. Below are answers to basic question explaining background
>> of the functionality and reasoning for the proposed implementation:
>>
>> 1. Why do we need a coherency enable/disable switch for memory that
>> is shared
>> between CPU and GEN (GPU)?
>>
>> Memory coherency between CPU and GEN, while being a great feature
>> that enables
>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN
>> architecture, adds
>> overhead related to tracking (snooping) memory inside different cache
>> units
>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence
>> require
>> memory coherency between CPU and GPU). The goal of coherency
>> enable/disable
>> switch is to remove overhead of memory coherency when memory
>> coherency is not
>> needed.
>>
>> 2. Why do we need a global coherency switch?
>>
>> In order to support I/O commands from within EUs (Execution Units),
>> Intel GEN
>> ISA (GEN Instruction Set Assembly) contains dedicated "send"
>> instructions.
>> These send instructions provide several addressing models. One of these
>> addressing models (named "stateless") provides most flexible I/O
>> using plain
>> virtual addresses (as opposed to buffer_handle+offset models). This
>> "stateless"
>> model is similar to regular memory load/store operations available on
>> typical
>> CPUs. Since this model provides I/O using arbitrary virtual
>> addresses, it
>> enables algorithmic designs that are based on pointer-to-pointer
>> (e.g. buffer
>> of pointers) concepts. For instance, it allows creating tree-like data
>> structures such as:
>> ________________
>> | NODE1 |
>> | uint64_t data |
>> +----------------|
>> | NODE* | NODE*|
>> +--------+-------+
>> / \
>> ________________/ \________________
>> | NODE2 | | NODE3 |
>> | uint64_t data | | uint64_t data |
>> +----------------| +----------------|
>> | NODE* | NODE*| | NODE* | NODE*|
>> +--------+-------+ +--------+-------+
>>
>> Please note that pointers inside such structures can point to memory
>> locations
>> in different OCL allocations - e.g. NODE1 and NODE2 can reside in
>> one OCL
>> allocation while NODE3 resides in a completely separate OCL allocation.
>> Additionally, such pointers can be shared with CPU (i.e. using SVM -
>> Shared
>> Virtual Memory feature). Using pointers from different allocations
>> doesn't
>> affect the stateless addressing model which even allows scattered
>> reading from
>> different allocations at the same time (i.e. by utilizing SIMD-nature
>> of send
>> instructions).
>>
>> When it comes to coherency programming, send instructions in
>> stateless model
>> can be encoded (at ISA level) to either use or disable coherency.
>> However, for
>> generic OCL applications (such as example with tree-like data
>> structure), OCL
>> compiler is not able to determine origin of memory pointed to by an
>> arbitrary
>> pointer - i.e. is not able to track given pointer back to a specific
>> allocation. As such, it's not able to decide whether coherency is
>> needed or not
>> for specific pointer (or for specific I/O instruction). As a result,
>> compiler
>> encodes all stateless sends as coherent (doing otherwise would lead to
>> functional issues resulting from data corruption). Please note that
>> it would be
>> possible to workaround this (e.g. based on allocations map and
>> pointer bounds
>> checking prior to each I/O instruction) but the performance cost of such
>> workaround would be many times greater than the cost of keeping
>> coherency
>> always enabled. As such, enabling/disabling memory coherency at GEN
>> ISA level
>> is not feasible and alternative method is needed.
>>
>> Such alternative solution is to have a global coherency switch that
>> allows
>> disabling coherency for single (though entire) GPU submission. This is
>> beneficial because this way we:
>> * can enable (and pay for) coherency only in submissions that
>> actually need
>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>> * don't care about coherency at GEN ISA granularity (no performance
>> impact)
>>
>> 3. Will coherency switch be used frequently?
>>
>> There are scenarios that will require frequent toggling of the coherency
>> switch.
>> E.g. an application has two OCL compute kernels: kern_master and
>> kern_worker.
>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>> computational work that needs to be executed. kern_master analyzes
>> incoming
>> work descriptors and populates a plain OCL buffer (non-fine-grain)
>> with payload
>> for kern_worker. Once kern_master is done, kern_worker kicks-in and
>> processes
>> the payload that kern_master produced. These two kernels work in a
>> loop, one
>> after another. Since only kern_master requires coherency, kern_worker
>> should
>> not be forced to pay for it. This means that we need to have the
>> ability to
>> toggle coherency switch on or off per each GPU submission:
>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker ->
>> (ENABLE
>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>
>> v2: Fixed compilation warning.
>> v3: Refactored the patch to add IOCTL instead of exec flag.
>> v4: Renamed and documented the API flag. Used strict values.
>> Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>> Introduced a macro for checking whether hardware supports the
>> feature.
>>
>> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>> Cc: Michal Winiarski <michal.winiarski at intel.com>
>>
>> Bspec: 11419
>> Signed-off-by: Tomasz Lis <tomasz.lis at intel.com>
>> ---
>> drivers/gpu/drm/i915/i915_drv.h | 1 +
>> drivers/gpu/drm/i915/i915_gem_context.c | 41
>> +++++++++++++++++++++++++++
>> drivers/gpu/drm/i915/i915_gem_context.h | 6 ++++
>> drivers/gpu/drm/i915/intel_lrc.c | 49
>> +++++++++++++++++++++++++++++++++
>> drivers/gpu/drm/i915/intel_lrc.h | 4 +++
>> include/uapi/drm/i915_drm.h | 6 ++++
>> 6 files changed, 107 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h
>> b/drivers/gpu/drm/i915/i915_drv.h
>> index 09ab124..7d4bbd5 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private
>> *dev_priv)
>> #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap &
>> EDRAM_ENABLED))
>> #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
>> IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>
> Reading the documentation it seems that the bit you want to set is
> gone in ICL/Gen11.
> Maybe limit this to >= 9 && < 11?
Icelake actually has the bit as well, just the address is different.
I will add its support as a separate patch as soon as the change which
defines ICL_HDC_CHICKEN0 is accepted.
But in the current form - you are right, ICL is not supported.
I will update the condition.
-Tomasz
>
> Cheers,
>
> -
> Lionel
>
>> #define HWS_NEEDS_PHYSICAL(dev_priv)
>> ((dev_priv)->info.hws_needs_physical)
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
>> b/drivers/gpu/drm/i915/i915_gem_context.c
>> index b10770c..6db352e 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -711,6 +711,26 @@ static bool client_is_banned(struct
>> drm_i915_file_private *file_priv)
>> return atomic_read(&file_priv->ban_score) >=
>> I915_CLIENT_SCORE_BANNED;
>> }
>> +static int i915_gem_context_set_data_port_coherent(struct
>> i915_gem_context *ctx)
>> +{
>> + int ret;
>> +
>> + ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>> + if (!ret)
>> + __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> + return ret;
>> +}
>> +
>> +static int i915_gem_context_clear_data_port_coherent(struct
>> i915_gem_context *ctx)
>> +{
>> + int ret;
>> +
>> + ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>> + if (!ret)
>> + __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> + return ret;
>> +}
>> +
>> int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>> struct drm_file *file)
>> {
>> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct
>> drm_device *dev, void *data,
>> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void
>> *data,
>> struct drm_file *file)
>> {
>> + struct drm_i915_private *dev_priv = to_i915(dev);
>> struct drm_i915_file_private *file_priv = file->driver_priv;
>> struct drm_i915_gem_context_param *args = data;
>> struct i915_gem_context *ctx;
>> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct
>> drm_device *dev, void *data,
>> case I915_CONTEXT_PARAM_PRIORITY:
>> args->value = ctx->sched.priority;
>> break;
>> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> + if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>> + ret = -ENODEV;
>> + else
>> + args->value = i915_gem_context_is_data_port_coherent(ctx);
>> + break;
>> default:
>> ret = -EINVAL;
>> break;
>> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct
>> drm_device *dev, void *data,
>> int i915_gem_context_setparam_ioctl(struct drm_device *dev, void
>> *data,
>> struct drm_file *file)
>> {
>> + struct drm_i915_private *dev_priv = to_i915(dev);
>> struct drm_i915_file_private *file_priv = file->driver_priv;
>> struct drm_i915_gem_context_param *args = data;
>> struct i915_gem_context *ctx;
>> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct
>> drm_device *dev, void *data,
>> }
>> break;
>> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> + if (args->size)
>> + ret = -EINVAL;
>> + else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>> + ret = -ENODEV;
>> + else if (args->value == 1)
>> + ret = i915_gem_context_set_data_port_coherent(ctx);
>> + else if (args->value == 0)
>> + ret = i915_gem_context_clear_data_port_coherent(ctx);
>> + else
>> + ret = -EINVAL;
>> + break;
>> +
>> default:
>> ret = -EINVAL;
>> break;
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h
>> b/drivers/gpu/drm/i915/i915_gem_context.h
>> index b116e49..e8ccb70 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>> @@ -126,6 +126,7 @@ struct i915_gem_context {
>> #define CONTEXT_BANNABLE 3
>> #define CONTEXT_BANNED 4
>> #define CONTEXT_FORCE_SINGLE_SUBMISSION 5
>> +#define CONTEXT_DATA_PORT_COHERENT 6
>> /**
>> * @hw_id: - unique identifier for the context
>> @@ -257,6 +258,11 @@ static inline void
>> i915_gem_context_set_force_single_submission(struct i915_gem_
>> __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>> }
>> +static inline bool i915_gem_context_is_data_port_coherent(struct
>> i915_gem_context *ctx)
>> +{
>> + return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +}
>> +
>> static inline bool i915_gem_context_is_default(const struct
>> i915_gem_context *c)
>> {
>> return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c
>> b/drivers/gpu/drm/i915/intel_lrc.c
>> index ab89dab..1f037e3 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct
>> i915_gem_context *ctx,
>> ce->lrc_desc = desc;
>> }
>> +static int emit_set_data_port_coherency(struct i915_request *req,
>> bool enable)
>> +{
>> + u32 *cs;
>> + i915_reg_t reg;
>> +
>> + GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>> + GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>> +
>> + cs = intel_ring_begin(req, 4);
>> + if (IS_ERR(cs))
>> + return PTR_ERR(cs);
>> +
>> + if (INTEL_GEN(req->i915) >= 10)
>> + reg = CNL_HDC_CHICKEN0;
>> + else
>> + reg = HDC_CHICKEN0;
>> +
>> + *cs++ = MI_LOAD_REGISTER_IMM(1);
>> + *cs++ = i915_mmio_reg_offset(reg);
>> + /* Enabling coherency means disabling the bit which forces it
>> off */
>> + if (enable)
>> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> + else
>> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> + *cs++ = MI_NOOP;
>> +
>> + intel_ring_advance(req, cs);
>> +
>> + return 0;
>> +}
>> +
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context
>> *ctx,
>> + bool enable)
>> +{
>> + struct i915_request *req;
>> + int ret;
>> +
>> + req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>> + if (IS_ERR(req))
>> + return PTR_ERR(req);
>> +
>> + ret = emit_set_data_port_coherency(req, enable);
>> +
>> + i915_request_add(req);
>> +
>> + return ret;
>> +}
>> +
>> static struct i915_priolist *
>> lookup_priolist(struct intel_engine_cs *engine, int prio)
>> {
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h
>> b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..f6965ae 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>> void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context
>> *ctx,
>> + bool enable);
>> +
>> #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..e677bea 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
>> #define I915_CONTEXT_DEFAULT_PRIORITY 0
>> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
>> +/*
>> + * When data port level coherency is enabled, the GPU will update
>> memory
>> + * buffers shared with CPU, by forcing internal cache units to send
>> memory
>> + * writes to real RAM faster. Keeping such coherency has performance
>> cost.
>> + */
>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
>> __u64 value;
>> };
>
>
More information about the Intel-gfx
mailing list