[Intel-gfx] [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
Lis, Tomasz
tomasz.lis at intel.com
Mon Jul 9 15:21:12 UTC 2018
On 2018-07-09 16:24, Lionel Landwerlin wrote:
> On 09/07/18 15:03, Lis, Tomasz wrote:
>>
>>
>> On 2018-07-09 15:48, Lionel Landwerlin wrote:
>>> On 09/07/18 14:20, Tomasz Lis wrote:
>>>> The patch adds a parameter to control the data port coherency
>>>> functionality
>>>> on a per-context level. When the IOCTL is called, a command to
>>>> switch data
>>>> port coherency state is added to the ordered list. All prior
>>>> requests are
>>>> executed on old coherency settings, and all exec requests after the
>>>> IOCTL
>>>> will use new settings.
>>>>
>>>> Rationale:
>>>>
>>>> The OpenCL driver develpers requested a functionality to control cache
>>>> coherency at data port level. Keeping the coherency at that level
>>>> is disabled
>>>> by default due to its performance costs. OpenCL driver is planning to
>>>> enable it for a small subset of submissions, when such
>>>> functionality is
>>>> required. Below are answers to basic question explaining background
>>>> of the functionality and reasoning for the proposed implementation:
>>>>
>>>> 1. Why do we need a coherency enable/disable switch for memory that
>>>> is shared
>>>> between CPU and GEN (GPU)?
>>>>
>>>> Memory coherency between CPU and GEN, while being a great feature
>>>> that enables
>>>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN
>>>> architecture, adds
>>>> overhead related to tracking (snooping) memory inside different
>>>> cache units
>>>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>>>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence
>>>> require
>>>> memory coherency between CPU and GPU). The goal of coherency
>>>> enable/disable
>>>> switch is to remove overhead of memory coherency when memory
>>>> coherency is not
>>>> needed.
>>>>
>>>> 2. Why do we need a global coherency switch?
>>>>
>>>> In order to support I/O commands from within EUs (Execution Units),
>>>> Intel GEN
>>>> ISA (GEN Instruction Set Assembly) contains dedicated "send"
>>>> instructions.
>>>> These send instructions provide several addressing models. One of
>>>> these
>>>> addressing models (named "stateless") provides most flexible I/O
>>>> using plain
>>>> virtual addresses (as opposed to buffer_handle+offset models). This
>>>> "stateless"
>>>> model is similar to regular memory load/store operations available
>>>> on typical
>>>> CPUs. Since this model provides I/O using arbitrary virtual
>>>> addresses, it
>>>> enables algorithmic designs that are based on pointer-to-pointer
>>>> (e.g. buffer
>>>> of pointers) concepts. For instance, it allows creating tree-like data
>>>> structures such as:
>>>> ________________
>>>> | NODE1 |
>>>> | uint64_t data |
>>>> +----------------|
>>>> | NODE* | NODE*|
>>>> +--------+-------+
>>>> / \
>>>> ________________/ \________________
>>>> | NODE2 | | NODE3 |
>>>> | uint64_t data | | uint64_t data |
>>>> +----------------| +----------------|
>>>> | NODE* | NODE*| | NODE* | NODE*|
>>>> +--------+-------+ +--------+-------+
>>>>
>>>> Please note that pointers inside such structures can point to
>>>> memory locations
>>>> in different OCL allocations - e.g. NODE1 and NODE2 can reside in
>>>> one OCL
>>>> allocation while NODE3 resides in a completely separate OCL
>>>> allocation.
>>>> Additionally, such pointers can be shared with CPU (i.e. using SVM
>>>> - Shared
>>>> Virtual Memory feature). Using pointers from different allocations
>>>> doesn't
>>>> affect the stateless addressing model which even allows scattered
>>>> reading from
>>>> different allocations at the same time (i.e. by utilizing
>>>> SIMD-nature of send
>>>> instructions).
>>>>
>>>> When it comes to coherency programming, send instructions in
>>>> stateless model
>>>> can be encoded (at ISA level) to either use or disable coherency.
>>>> However, for
>>>> generic OCL applications (such as example with tree-like data
>>>> structure), OCL
>>>> compiler is not able to determine origin of memory pointed to by an
>>>> arbitrary
>>>> pointer - i.e. is not able to track given pointer back to a specific
>>>> allocation. As such, it's not able to decide whether coherency is
>>>> needed or not
>>>> for specific pointer (or for specific I/O instruction). As a
>>>> result, compiler
>>>> encodes all stateless sends as coherent (doing otherwise would lead to
>>>> functional issues resulting from data corruption). Please note that
>>>> it would be
>>>> possible to workaround this (e.g. based on allocations map and
>>>> pointer bounds
>>>> checking prior to each I/O instruction) but the performance cost of
>>>> such
>>>> workaround would be many times greater than the cost of keeping
>>>> coherency
>>>> always enabled. As such, enabling/disabling memory coherency at GEN
>>>> ISA level
>>>> is not feasible and alternative method is needed.
>>>>
>>>> Such alternative solution is to have a global coherency switch that
>>>> allows
>>>> disabling coherency for single (though entire) GPU submission. This is
>>>> beneficial because this way we:
>>>> * can enable (and pay for) coherency only in submissions that
>>>> actually need
>>>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER
>>>> resources)
>>>> * don't care about coherency at GEN ISA granularity (no performance
>>>> impact)
>>>>
>>>> 3. Will coherency switch be used frequently?
>>>>
>>>> There are scenarios that will require frequent toggling of the
>>>> coherency
>>>> switch.
>>>> E.g. an application has two OCL compute kernels: kern_master and
>>>> kern_worker.
>>>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>>>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>>>> computational work that needs to be executed. kern_master analyzes
>>>> incoming
>>>> work descriptors and populates a plain OCL buffer (non-fine-grain)
>>>> with payload
>>>> for kern_worker. Once kern_master is done, kern_worker kicks-in and
>>>> processes
>>>> the payload that kern_master produced. These two kernels work in a
>>>> loop, one
>>>> after another. Since only kern_master requires coherency,
>>>> kern_worker should
>>>> not be forced to pay for it. This means that we need to have the
>>>> ability to
>>>> toggle coherency switch on or off per each GPU submission:
>>>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker ->
>>>> (ENABLE
>>>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>>>
>>>> v2: Fixed compilation warning.
>>>> v3: Refactored the patch to add IOCTL instead of exec flag.
>>>> v4: Renamed and documented the API flag. Used strict values.
>>>> Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>>>> Introduced a macro for checking whether hardware supports the
>>>> feature.
>>>>
>>>> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>>>> Cc: Michal Winiarski <michal.winiarski at intel.com>
>>>>
>>>> Bspec: 11419
>>>> Signed-off-by: Tomasz Lis <tomasz.lis at intel.com>
>>>> ---
>>>> drivers/gpu/drm/i915/i915_drv.h | 1 +
>>>> drivers/gpu/drm/i915/i915_gem_context.c | 41
>>>> +++++++++++++++++++++++++++
>>>> drivers/gpu/drm/i915/i915_gem_context.h | 6 ++++
>>>> drivers/gpu/drm/i915/intel_lrc.c | 49
>>>> +++++++++++++++++++++++++++++++++
>>>> drivers/gpu/drm/i915/intel_lrc.h | 4 +++
>>>> include/uapi/drm/i915_drm.h | 6 ++++
>>>> 6 files changed, 107 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h
>>>> b/drivers/gpu/drm/i915/i915_drv.h
>>>> index 09ab124..7d4bbd5 100644
>>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private
>>>> *dev_priv)
>>>> #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap &
>>>> EDRAM_ENABLED))
>>>> #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
>>>> IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>>>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>>>
>>> Reading the documentation it seems that the bit you want to set is
>>> gone in ICL/Gen11.
>>> Maybe limit this to >= 9 && < 11?
>> Icelake actually has the bit as well, just the address is different.
>> I will add its support as a separate patch as soon as the change
>> which defines ICL_HDC_CHICKEN0 is accepted.
>> But in the current form - you are right, ICL is not supported.
>> I will update the condition.
>> -Tomasz
>
> Just out of curiosity, what address is ICL_HD_CHICKEN0 at?
It was defined as _MMIO(0xE5F4). But now I see it is renamed to
ICL_HDC_MODE, and already on the tip.
Bspec: 19175
Wow, looks like I can include the gen11 support already. Will add in
next version.
Thank you!
>
> Thanks,
>
> -
> Lionel
>
>>>
>>> Cheers,
>>>
>>> -
>>> Lionel
>>>
>>>> #define HWS_NEEDS_PHYSICAL(dev_priv)
>>>> ((dev_priv)->info.hws_needs_physical)
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
>>>> b/drivers/gpu/drm/i915/i915_gem_context.c
>>>> index b10770c..6db352e 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>>>> @@ -711,6 +711,26 @@ static bool client_is_banned(struct
>>>> drm_i915_file_private *file_priv)
>>>> return atomic_read(&file_priv->ban_score) >=
>>>> I915_CLIENT_SCORE_BANNED;
>>>> }
>>>> +static int i915_gem_context_set_data_port_coherent(struct
>>>> i915_gem_context *ctx)
>>>> +{
>>>> + int ret;
>>>> +
>>>> + ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>>>> + if (!ret)
>>>> + __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>>> + return ret;
>>>> +}
>>>> +
>>>> +static int i915_gem_context_clear_data_port_coherent(struct
>>>> i915_gem_context *ctx)
>>>> +{
>>>> + int ret;
>>>> +
>>>> + ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>>>> + if (!ret)
>>>> + __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>>> + return ret;
>>>> +}
>>>> +
>>>> int i915_gem_context_create_ioctl(struct drm_device *dev, void
>>>> *data,
>>>> struct drm_file *file)
>>>> {
>>>> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct
>>>> drm_device *dev, void *data,
>>>> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void
>>>> *data,
>>>> struct drm_file *file)
>>>> {
>>>> + struct drm_i915_private *dev_priv = to_i915(dev);
>>>> struct drm_i915_file_private *file_priv = file->driver_priv;
>>>> struct drm_i915_gem_context_param *args = data;
>>>> struct i915_gem_context *ctx;
>>>> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct
>>>> drm_device *dev, void *data,
>>>> case I915_CONTEXT_PARAM_PRIORITY:
>>>> args->value = ctx->sched.priority;
>>>> break;
>>>> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>>> + if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>>> + ret = -ENODEV;
>>>> + else
>>>> + args->value =
>>>> i915_gem_context_is_data_port_coherent(ctx);
>>>> + break;
>>>> default:
>>>> ret = -EINVAL;
>>>> break;
>>>> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct
>>>> drm_device *dev, void *data,
>>>> int i915_gem_context_setparam_ioctl(struct drm_device *dev, void
>>>> *data,
>>>> struct drm_file *file)
>>>> {
>>>> + struct drm_i915_private *dev_priv = to_i915(dev);
>>>> struct drm_i915_file_private *file_priv = file->driver_priv;
>>>> struct drm_i915_gem_context_param *args = data;
>>>> struct i915_gem_context *ctx;
>>>> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct
>>>> drm_device *dev, void *data,
>>>> }
>>>> break;
>>>> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>>> + if (args->size)
>>>> + ret = -EINVAL;
>>>> + else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>>> + ret = -ENODEV;
>>>> + else if (args->value == 1)
>>>> + ret = i915_gem_context_set_data_port_coherent(ctx);
>>>> + else if (args->value == 0)
>>>> + ret = i915_gem_context_clear_data_port_coherent(ctx);
>>>> + else
>>>> + ret = -EINVAL;
>>>> + break;
>>>> +
>>>> default:
>>>> ret = -EINVAL;
>>>> break;
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h
>>>> b/drivers/gpu/drm/i915/i915_gem_context.h
>>>> index b116e49..e8ccb70 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>>>> @@ -126,6 +126,7 @@ struct i915_gem_context {
>>>> #define CONTEXT_BANNABLE 3
>>>> #define CONTEXT_BANNED 4
>>>> #define CONTEXT_FORCE_SINGLE_SUBMISSION 5
>>>> +#define CONTEXT_DATA_PORT_COHERENT 6
>>>> /**
>>>> * @hw_id: - unique identifier for the context
>>>> @@ -257,6 +258,11 @@ static inline void
>>>> i915_gem_context_set_force_single_submission(struct i915_gem_
>>>> __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>>>> }
>>>> +static inline bool i915_gem_context_is_data_port_coherent(struct
>>>> i915_gem_context *ctx)
>>>> +{
>>>> + return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>>> +}
>>>> +
>>>> static inline bool i915_gem_context_is_default(const struct
>>>> i915_gem_context *c)
>>>> {
>>>> return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c
>>>> b/drivers/gpu/drm/i915/intel_lrc.c
>>>> index ab89dab..1f037e3 100644
>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>>> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct
>>>> i915_gem_context *ctx,
>>>> ce->lrc_desc = desc;
>>>> }
>>>> +static int emit_set_data_port_coherency(struct i915_request
>>>> *req, bool enable)
>>>> +{
>>>> + u32 *cs;
>>>> + i915_reg_t reg;
>>>> +
>>>> + GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>>>> + GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>>>> +
>>>> + cs = intel_ring_begin(req, 4);
>>>> + if (IS_ERR(cs))
>>>> + return PTR_ERR(cs);
>>>> +
>>>> + if (INTEL_GEN(req->i915) >= 10)
>>>> + reg = CNL_HDC_CHICKEN0;
>>>> + else
>>>> + reg = HDC_CHICKEN0;
>>>> +
>>>> + *cs++ = MI_LOAD_REGISTER_IMM(1);
>>>> + *cs++ = i915_mmio_reg_offset(reg);
>>>> + /* Enabling coherency means disabling the bit which forces it
>>>> off */
>>>> + if (enable)
>>>> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>>>> + else
>>>> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>>>> + *cs++ = MI_NOOP;
>>>> +
>>>> + intel_ring_advance(req, cs);
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> +int
>>>> +intel_lr_context_modify_data_port_coherency(struct
>>>> i915_gem_context *ctx,
>>>> + bool enable)
>>>> +{
>>>> + struct i915_request *req;
>>>> + int ret;
>>>> +
>>>> + req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>>>> + if (IS_ERR(req))
>>>> + return PTR_ERR(req);
>>>> +
>>>> + ret = emit_set_data_port_coherency(req, enable);
>>>> +
>>>> + i915_request_add(req);
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> static struct i915_priolist *
>>>> lookup_priolist(struct intel_engine_cs *engine, int prio)
>>>> {
>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h
>>>> b/drivers/gpu/drm/i915/intel_lrc.h
>>>> index 1593194..f6965ae 100644
>>>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>>>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>>> void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>>> +int
>>>> +intel_lr_context_modify_data_port_coherency(struct
>>>> i915_gem_context *ctx,
>>>> + bool enable);
>>>> +
>>>> #endif /* _INTEL_LRC_H_ */
>>>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>>>> index 7f5634c..e677bea 100644
>>>> --- a/include/uapi/drm/i915_drm.h
>>>> +++ b/include/uapi/drm/i915_drm.h
>>>> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>>>> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
>>>> #define I915_CONTEXT_DEFAULT_PRIORITY 0
>>>> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
>>>> +/*
>>>> + * When data port level coherency is enabled, the GPU will update
>>>> memory
>>>> + * buffers shared with CPU, by forcing internal cache units to
>>>> send memory
>>>> + * writes to real RAM faster. Keeping such coherency has
>>>> performance cost.
>>>> + */
>>>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
>>>> __u64 value;
>>>> };
>>>
>>>
>>
>>
>
More information about the Intel-gfx
mailing list