[Intel-gfx] [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.

Mon Jul 9 14:24:01 UTC 2018

On 09/07/18 15:03, Lis, Tomasz wrote:
>
>
> On 2018-07-09 15:48, Lionel Landwerlin wrote:
>> On 09/07/18 14:20, Tomasz Lis wrote:
>>> The patch adds a parameter to control the data port coherency 
>>> functionality
>>> on a per-context level. When the IOCTL is called, a command to 
>>> switch data
>>> port coherency state is added to the ordered list. All prior 
>>> requests are
>>> executed on old coherency settings, and all exec requests after the 
>>> IOCTL
>>> will use new settings.
>>>
>>> Rationale:
>>>
>>> The OpenCL driver develpers requested a functionality to control cache
>>> coherency at data port level. Keeping the coherency at that level is 
>>> disabled
>>> by default due to its performance costs. OpenCL driver is planning to
>>> enable it for a small subset of submissions, when such functionality is
>>> required. Below are answers to basic question explaining background
>>> of the functionality and reasoning for the proposed implementation:
>>>
>>> 1. Why do we need a coherency enable/disable switch for memory that 
>>> is shared
>>> between CPU and GEN (GPU)?
>>>
>>> Memory coherency between CPU and GEN, while being a great feature 
>>> that enables
>>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN 
>>> architecture, adds
>>> overhead related to tracking (snooping) memory inside different 
>>> cache units
>>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence 
>>> require
>>> memory coherency between CPU and GPU). The goal of coherency 
>>> enable/disable
>>> switch is to remove overhead of memory coherency when memory 
>>> coherency is not
>>> needed.
>>>
>>> 2. Why do we need a global coherency switch?
>>>
>>> In order to support I/O commands from within EUs (Execution Units), 
>>> Intel GEN
>>> ISA (GEN Instruction Set Assembly) contains dedicated "send" 
>>> instructions.
>>> These send instructions provide several addressing models. One of these
>>> addressing models (named "stateless") provides most flexible I/O 
>>> using plain
>>> virtual addresses (as opposed to buffer_handle+offset models). This 
>>> "stateless"
>>> model is similar to regular memory load/store operations available 
>>> on typical
>>> CPUs. Since this model provides I/O using arbitrary virtual 
>>> addresses, it
>>> enables algorithmic designs that are based on pointer-to-pointer 
>>> (e.g. buffer
>>> of pointers) concepts. For instance, it allows creating tree-like data
>>> structures such as:
>>>                     ________________
>>>                    |      NODE1     |
>>>                    | uint64_t data  |
>>>                    +----------------|
>>>                    | NODE*  |  NODE*|
>>>                    +--------+-------+
>>>                      /              \
>>>     ________________/                \________________
>>>    |      NODE2     |                |      NODE3     |
>>>    | uint64_t data  |                | uint64_t data  |
>>>    +----------------|                +----------------|
>>>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>>>    +--------+-------+                +--------+-------+
>>>
>>> Please note that pointers inside such structures can point to memory 
>>> locations
>>> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in 
>>> one OCL
>>> allocation while NODE3 resides in a completely separate OCL allocation.
>>> Additionally, such pointers can be shared with CPU (i.e. using SVM - 
>>> Shared
>>> Virtual Memory feature). Using pointers from different allocations 
>>> doesn't
>>> affect the stateless addressing model which even allows scattered 
>>> reading from
>>> different allocations at the same time (i.e. by utilizing 
>>> SIMD-nature of send
>>> instructions).
>>>
>>> When it comes to coherency programming, send instructions in 
>>> stateless model
>>> can be encoded (at ISA level) to either use or disable coherency. 
>>> However, for
>>> generic OCL applications (such as example with tree-like data 
>>> structure), OCL
>>> compiler is not able to determine origin of memory pointed to by an 
>>> arbitrary
>>> pointer - i.e. is not able to track given pointer back to a specific
>>> allocation. As such, it's not able to decide whether coherency is 
>>> needed or not
>>> for specific pointer (or for specific I/O instruction). As a result, 
>>> compiler
>>> encodes all stateless sends as coherent (doing otherwise would lead to
>>> functional issues resulting from data corruption). Please note that 
>>> it would be
>>> possible to workaround this (e.g. based on allocations map and 
>>> pointer bounds
>>> checking prior to each I/O instruction) but the performance cost of 
>>> such
>>> workaround would be many times greater than the cost of keeping 
>>> coherency
>>> always enabled. As such, enabling/disabling memory coherency at GEN 
>>> ISA level
>>> is not feasible and alternative method is needed.
>>>
>>> Such alternative solution is to have a global coherency switch that 
>>> allows
>>> disabling coherency for single (though entire) GPU submission. This is
>>> beneficial because this way we:
>>> * can enable (and pay for) coherency only in submissions that 
>>> actually need
>>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>>> * don't care about coherency at GEN ISA granularity (no performance 
>>> impact)
>>>
>>> 3. Will coherency switch be used frequently?
>>>
>>> There are scenarios that will require frequent toggling of the 
>>> coherency
>>> switch.
>>> E.g. an application has two OCL compute kernels: kern_master and 
>>> kern_worker.
>>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>>> computational work that needs to be executed. kern_master analyzes 
>>> incoming
>>> work descriptors and populates a plain OCL buffer (non-fine-grain) 
>>> with payload
>>> for kern_worker. Once kern_master is done, kern_worker kicks-in and 
>>> processes
>>> the payload that kern_master produced. These two kernels work in a 
>>> loop, one
>>> after another. Since only kern_master requires coherency, 
>>> kern_worker should
>>> not be forced to pay for it. This means that we need to have the 
>>> ability to
>>> toggle coherency switch on or off per each GPU submission:
>>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> 
>>> (ENABLE
>>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>>
>>> v2: Fixed compilation warning.
>>> v3: Refactored the patch to add IOCTL instead of exec flag.
>>> v4: Renamed and documented the API flag. Used strict values.
>>>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>>>      Introduced a macro for checking whether hardware supports the 
>>> feature.
>>>
>>> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>>> Cc: Michal Winiarski <michal.winiarski at intel.com>
>>>
>>> Bspec: 11419
>>> Signed-off-by: Tomasz Lis <tomasz.lis at intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>>>   drivers/gpu/drm/i915/i915_gem_context.c | 41 
>>> +++++++++++++++++++++++++++
>>>   drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
>>>   drivers/gpu/drm/i915/intel_lrc.c        | 49 
>>> +++++++++++++++++++++++++++++++++
>>>   drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
>>>   include/uapi/drm/i915_drm.h             |  6 ++++
>>>   6 files changed, 107 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h 
>>> b/drivers/gpu/drm/i915/i915_drv.h
>>> index 09ab124..7d4bbd5 100644
>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private 
>>> *dev_priv)
>>>   #define HAS_EDRAM(dev_priv)    (!!((dev_priv)->edram_cap & 
>>> EDRAM_ENABLED))
>>>   #define HAS_WT(dev_priv)    ((IS_HASWELL(dev_priv) || \
>>>                    IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>>
>> Reading the documentation it seems that the bit you want to set is 
>> gone in ICL/Gen11.
>> Maybe limit this to >= 9 && < 11?
> Icelake actually has the bit as well, just the address is different.
> I will add its support as a separate patch as soon as the change which 
> defines ICL_HDC_CHICKEN0 is accepted.
> But in the current form - you are right, ICL is not supported.
> I will update the condition.
> -Tomasz

Just out of curiosity, what address is ICL_HD_CHICKEN0 at?

Thanks,

-
Lionel

>>
>> Cheers,
>>
>> -
>> Lionel
>>
>>>     #define HWS_NEEDS_PHYSICAL(dev_priv) 
>>> ((dev_priv)->info.hws_needs_physical)
>>>   diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
>>> b/drivers/gpu/drm/i915/i915_gem_context.c
>>> index b10770c..6db352e 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>>> @@ -711,6 +711,26 @@ static bool client_is_banned(struct 
>>> drm_i915_file_private *file_priv)
>>>       return atomic_read(&file_priv->ban_score) >= 
>>> I915_CLIENT_SCORE_BANNED;
>>>   }
>>>   +static int i915_gem_context_set_data_port_coherent(struct 
>>> i915_gem_context *ctx)
>>> +{
>>> +    int ret;
>>> +
>>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>>> +    if (!ret)
>>> +        __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>> +    return ret;
>>> +}
>>> +
>>> +static int i915_gem_context_clear_data_port_coherent(struct 
>>> i915_gem_context *ctx)
>>> +{
>>> +    int ret;
>>> +
>>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>>> +    if (!ret)
>>> +        __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>> +    return ret;
>>> +}
>>> +
>>>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>>>                     struct drm_file *file)
>>>   {
>>> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct 
>>> drm_device *dev, void *data,
>>>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void 
>>> *data,
>>>                       struct drm_file *file)
>>>   {
>>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>>       struct drm_i915_gem_context_param *args = data;
>>>       struct i915_gem_context *ctx;
>>> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct 
>>> drm_device *dev, void *data,
>>>       case I915_CONTEXT_PARAM_PRIORITY:
>>>           args->value = ctx->sched.priority;
>>>           break;
>>> +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>> +        if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>> +            ret = -ENODEV;
>>> +        else
>>> +            args->value = i915_gem_context_is_data_port_coherent(ctx);
>>> +        break;
>>>       default:
>>>           ret = -EINVAL;
>>>           break;
>>> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct 
>>> drm_device *dev, void *data,
>>>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void 
>>> *data,
>>>                       struct drm_file *file)
>>>   {
>>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>>       struct drm_i915_gem_context_param *args = data;
>>>       struct i915_gem_context *ctx;
>>> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct 
>>> drm_device *dev, void *data,
>>>           }
>>>           break;
>>>   +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>> +        if (args->size)
>>> +            ret = -EINVAL;
>>> +        else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>> +            ret = -ENODEV;
>>> +        else if (args->value == 1)
>>> +            ret = i915_gem_context_set_data_port_coherent(ctx);
>>> +        else if (args->value == 0)
>>> +            ret = i915_gem_context_clear_data_port_coherent(ctx);
>>> +        else
>>> +            ret = -EINVAL;
>>> +        break;
>>> +
>>>       default:
>>>           ret = -EINVAL;
>>>           break;
>>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h 
>>> b/drivers/gpu/drm/i915/i915_gem_context.h
>>> index b116e49..e8ccb70 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>>> @@ -126,6 +126,7 @@ struct i915_gem_context {
>>>   #define CONTEXT_BANNABLE        3
>>>   #define CONTEXT_BANNED            4
>>>   #define CONTEXT_FORCE_SINGLE_SUBMISSION    5
>>> +#define CONTEXT_DATA_PORT_COHERENT    6
>>>         /**
>>>        * @hw_id: - unique identifier for the context
>>> @@ -257,6 +258,11 @@ static inline void 
>>> i915_gem_context_set_force_single_submission(struct i915_gem_
>>>       __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>>>   }
>>>   +static inline bool i915_gem_context_is_data_port_coherent(struct 
>>> i915_gem_context *ctx)
>>> +{
>>> +    return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>> +}
>>> +
>>>   static inline bool i915_gem_context_is_default(const struct 
>>> i915_gem_context *c)
>>>   {
>>>       return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c 
>>> b/drivers/gpu/drm/i915/intel_lrc.c
>>> index ab89dab..1f037e3 100644
>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct 
>>> i915_gem_context *ctx,
>>>       ce->lrc_desc = desc;
>>>   }
>>>   +static int emit_set_data_port_coherency(struct i915_request *req, 
>>> bool enable)
>>> +{
>>> +    u32 *cs;
>>> +    i915_reg_t reg;
>>> +
>>> +    GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>>> +    GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>>> +
>>> +    cs = intel_ring_begin(req, 4);
>>> +    if (IS_ERR(cs))
>>> +        return PTR_ERR(cs);
>>> +
>>> +    if (INTEL_GEN(req->i915) >= 10)
>>> +        reg = CNL_HDC_CHICKEN0;
>>> +    else
>>> +        reg = HDC_CHICKEN0;
>>> +
>>> +    *cs++ = MI_LOAD_REGISTER_IMM(1);
>>> +    *cs++ = i915_mmio_reg_offset(reg);
>>> +    /* Enabling coherency means disabling the bit which forces it 
>>> off */
>>> +    if (enable)
>>> +        *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>>> +    else
>>> +        *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>>> +    *cs++ = MI_NOOP;
>>> +
>>> +    intel_ring_advance(req, cs);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +int
>>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context 
>>> *ctx,
>>> +                        bool enable)
>>> +{
>>> +    struct i915_request *req;
>>> +    int ret;
>>> +
>>> +    req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>>> +    if (IS_ERR(req))
>>> +        return PTR_ERR(req);
>>> +
>>> +    ret = emit_set_data_port_coherency(req, enable);
>>> +
>>> +    i915_request_add(req);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>   static struct i915_priolist *
>>>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>>>   {
>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h 
>>> b/drivers/gpu/drm/i915/intel_lrc.h
>>> index 1593194..f6965ae 100644
>>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>>     void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>>   +int
>>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context 
>>> *ctx,
>>> +                        bool enable);
>>> +
>>>   #endif /* _INTEL_LRC_H_ */
>>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>>> index 7f5634c..e677bea 100644
>>> --- a/include/uapi/drm/i915_drm.h
>>> +++ b/include/uapi/drm/i915_drm.h
>>> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>>>   #define   I915_CONTEXT_MAX_USER_PRIORITY    1023 /* inclusive */
>>>   #define   I915_CONTEXT_DEFAULT_PRIORITY        0
>>>   #define   I915_CONTEXT_MIN_USER_PRIORITY    -1023 /* inclusive */
>>> +/*
>>> + * When data port level coherency is enabled, the GPU will update 
>>> memory
>>> + * buffers shared with CPU, by forcing internal cache units to send 
>>> memory
>>> + * writes to real RAM faster. Keeping such coherency has 
>>> performance cost.
>>> + */
>>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY    0x7
>>>       __u64 value;
>>>   };
>>
>>
>
>