[Intel-gfx] [PATCH v5] drm/i915: Add IOCTL Param to control data port coherency.

Fri Jul 13 17:44:18 UTC 2018


On 2018-07-13 12:40, Tvrtko Ursulin wrote:
>
> On 12/07/2018 16:10, Tomasz Lis wrote:
>> The patch adds a parameter to control the data port coherency 
>> functionality
>> on a per-context level. When the IOCTL is called, a command to switch 
>> data
>> port coherency state is added to the ordered list. All prior requests 
>> are
>> executed on old coherency settings, and all exec requests after the 
>> IOCTL
>> will use new settings.
>>
>> Rationale:
>>
>> The OpenCL driver develpers requested a functionality to control cache
>> coherency at data port level. Keeping the coherency at that level is 
>> disabled
>> by default due to its performance costs. OpenCL driver is planning to
>> enable it for a small subset of submissions, when such functionality is
>> required. Below are answers to basic question explaining background
>> of the functionality and reasoning for the proposed implementation:
>>
>> 1. Why do we need a coherency enable/disable switch for memory that 
>> is shared
>> between CPU and GEN (GPU)?
>>
>> Memory coherency between CPU and GEN, while being a great feature 
>> that enables
>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN 
>> architecture, adds
>> overhead related to tracking (snooping) memory inside different cache 
>> units
>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence 
>> require
>> memory coherency between CPU and GPU). The goal of coherency 
>> enable/disable
>> switch is to remove overhead of memory coherency when memory 
>> coherency is not
>> needed.
>>
>> 2. Why do we need a global coherency switch?
>>
>> In order to support I/O commands from within EUs (Execution Units), 
>> Intel GEN
>> ISA (GEN Instruction Set Assembly) contains dedicated "send" 
>> instructions.
>> These send instructions provide several addressing models. One of these
>> addressing models (named "stateless") provides most flexible I/O 
>> using plain
>> virtual addresses (as opposed to buffer_handle+offset models). This 
>> "stateless"
>> model is similar to regular memory load/store operations available on 
>> typical
>> CPUs. Since this model provides I/O using arbitrary virtual 
>> addresses, it
>> enables algorithmic designs that are based on pointer-to-pointer 
>> (e.g. buffer
>> of pointers) concepts. For instance, it allows creating tree-like data
>> structures such as:
>>                     ________________
>>                    |      NODE1     |
>>                    | uint64_t data  |
>>                    +----------------|
>>                    | NODE*  |  NODE*|
>>                    +--------+-------+
>>                      /              \
>>     ________________/                \________________
>>    |      NODE2     |                |      NODE3     |
>>    | uint64_t data  |                | uint64_t data  |
>>    +----------------|                +----------------|
>>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>>    +--------+-------+                +--------+-------+
>>
>> Please note that pointers inside such structures can point to memory 
>> locations
>> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in 
>> one OCL
>> allocation while NODE3 resides in a completely separate OCL allocation.
>> Additionally, such pointers can be shared with CPU (i.e. using SVM - 
>> Shared
>> Virtual Memory feature). Using pointers from different allocations 
>> doesn't
>> affect the stateless addressing model which even allows scattered 
>> reading from
>> different allocations at the same time (i.e. by utilizing SIMD-nature 
>> of send
>> instructions).
>>
>> When it comes to coherency programming, send instructions in 
>> stateless model
>> can be encoded (at ISA level) to either use or disable coherency. 
>> However, for
>> generic OCL applications (such as example with tree-like data 
>> structure), OCL
>> compiler is not able to determine origin of memory pointed to by an 
>> arbitrary
>> pointer - i.e. is not able to track given pointer back to a specific
>> allocation. As such, it's not able to decide whether coherency is 
>> needed or not
>> for specific pointer (or for specific I/O instruction). As a result, 
>> compiler
>> encodes all stateless sends as coherent (doing otherwise would lead to
>> functional issues resulting from data corruption). Please note that 
>> it would be
>> possible to workaround this (e.g. based on allocations map and 
>> pointer bounds
>> checking prior to each I/O instruction) but the performance cost of such
>> workaround would be many times greater than the cost of keeping 
>> coherency
>> always enabled. As such, enabling/disabling memory coherency at GEN 
>> ISA level
>> is not feasible and alternative method is needed.
>>
>> Such alternative solution is to have a global coherency switch that 
>> allows
>> disabling coherency for single (though entire) GPU submission. This is
>> beneficial because this way we:
>> * can enable (and pay for) coherency only in submissions that 
>> actually need
>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>> * don't care about coherency at GEN ISA granularity (no performance 
>> impact)
>>
>> 3. Will coherency switch be used frequently?
>>
>> There are scenarios that will require frequent toggling of the coherency
>> switch.
>> E.g. an application has two OCL compute kernels: kern_master and 
>> kern_worker.
>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>> computational work that needs to be executed. kern_master analyzes 
>> incoming
>> work descriptors and populates a plain OCL buffer (non-fine-grain) 
>> with payload
>> for kern_worker. Once kern_master is done, kern_worker kicks-in and 
>> processes
>> the payload that kern_master produced. These two kernels work in a 
>> loop, one
>> after another. Since only kern_master requires coherency, kern_worker 
>> should
>> not be forced to pay for it. This means that we need to have the 
>> ability to
>> toggle coherency switch on or off per each GPU submission:
>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> 
>> (ENABLE
>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>
>> v2: Fixed compilation warning.
>> v3: Refactored the patch to add IOCTL instead of exec flag.
>> v4: Renamed and documented the API flag. Used strict values.
>>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>>      Introduced a macro for checking whether hardware supports the 
>> feature.
>> v5: Renamed some locals. Made the flag write to be lazy.
>>      Updated comments to remove misconceptions. Added gen11 support.
>>
>> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>> Cc: Michal Winiarski <michal.winiarski at intel.com>
>>
>> Bspec: 11419
>> Bspec: 19175
>> Signed-off-by: Tomasz Lis <tomasz.lis at intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_drv.h            |  1 +
>>   drivers/gpu/drm/i915/i915_gem_context.c    | 29 +++++++++++++---
>>   drivers/gpu/drm/i915/i915_gem_context.h    | 17 +++++++++
>>   drivers/gpu/drm/i915/i915_gem_execbuffer.c |  6 ++++
>>   drivers/gpu/drm/i915/intel_lrc.c           | 55 
>> ++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/i915/intel_lrc.h           |  4 +++
>>   include/uapi/drm/i915_drm.h                |  7 ++++
>>   7 files changed, 115 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h 
>> b/drivers/gpu/drm/i915/i915_drv.h
>> index 01dd298..73192e1 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private 
>> *dev_priv)
>>   #define HAS_EDRAM(dev_priv)    (!!((dev_priv)->edram_cap & 
>> EDRAM_ENABLED))
>>   #define HAS_WT(dev_priv)    ((IS_HASWELL(dev_priv) || \
>>                    IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>>     #define HWS_NEEDS_PHYSICAL(dev_priv) 
>> ((dev_priv)->info.hws_needs_physical)
>>   diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
>> b/drivers/gpu/drm/i915/i915_gem_context.c
>> index b10770c..b5b63ac 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -784,6 +784,7 @@ int i915_gem_context_destroy_ioctl(struct 
>> drm_device *dev, void *data,
>>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void 
>> *data,
>>                       struct drm_file *file)
>>   {
>> +    struct drm_i915_private *i915 = to_i915(dev);
>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>       struct drm_i915_gem_context_param *args = data;
>>       struct i915_gem_context *ctx;
>> @@ -804,10 +805,10 @@ int i915_gem_context_getparam_ioctl(struct 
>> drm_device *dev, void *data,
>>       case I915_CONTEXT_PARAM_GTT_SIZE:
>>           if (ctx->ppgtt)
>>               args->value = ctx->ppgtt->vm.total;
>> -        else if (to_i915(dev)->mm.aliasing_ppgtt)
>> -            args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
>> +        else if (i915->mm.aliasing_ppgtt)
>> +            args->value = i915->mm.aliasing_ppgtt->vm.total;
>>           else
>> -            args->value = to_i915(dev)->ggtt.vm.total;
>> +            args->value = i915->ggtt.vm.total;
>>           break;
>>       case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
>>           args->value = i915_gem_context_no_error_capture(ctx);
>> @@ -818,6 +819,12 @@ int i915_gem_context_getparam_ioctl(struct 
>> drm_device *dev, void *data,
>>       case I915_CONTEXT_PARAM_PRIORITY:
>>           args->value = ctx->sched.priority;
>>           break;
>> +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> +        if (!HAS_DATA_PORT_COHERENCY(i915))
>> +            ret = -ENODEV;
>> +        else
>> +            args->value = 
>> i915_gem_context_is_data_port_coherent_requested(ctx);
>
> Feels a bit like overly long name so maybe drop the _requested suffix 
> but a suggestion only.
I was considering this as well; will do.
>
>> +        break;
>>       default:
>>           ret = -EINVAL;
>>           break;
>> @@ -830,6 +837,7 @@ int i915_gem_context_getparam_ioctl(struct 
>> drm_device *dev, void *data,
>>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void 
>> *data,
>>                       struct drm_file *file)
>>   {
>> +    struct drm_i915_private *i915 = to_i915(dev);
>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>       struct drm_i915_gem_context_param *args = data;
>>       struct i915_gem_context *ctx;
>> @@ -880,7 +888,7 @@ int i915_gem_context_setparam_ioctl(struct 
>> drm_device *dev, void *data,
>>                 if (args->size)
>>                   ret = -EINVAL;
>> -            else if (!(to_i915(dev)->caps.scheduler & 
>> I915_SCHEDULER_CAP_PRIORITY))
>> +            else if (!(i915->caps.scheduler & 
>> I915_SCHEDULER_CAP_PRIORITY))
>>                   ret = -ENODEV;
>>               else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
>>                    priority < I915_CONTEXT_MIN_USER_PRIORITY)
>> @@ -893,6 +901,19 @@ int i915_gem_context_setparam_ioctl(struct 
>> drm_device *dev, void *data,
>>           }
>>           break;
>>   +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> +        if (args->size)
>> +            ret = -EINVAL;
>> +        else if (!HAS_DATA_PORT_COHERENCY(i915))
>> +            ret = -ENODEV;
>> +        else if (args->value == 1)
>> + i915_gem_context_set_data_port_coherent_requested(ctx);
>> +        else if (args->value == 0)
>> + i915_gem_context_clear_data_port_coherent_requested(ctx);
>> +        else
>> +            ret = -EINVAL;
>> +        break;
>> +
>>       default:
>>           ret = -EINVAL;
>>           break;
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h 
>> b/drivers/gpu/drm/i915/i915_gem_context.h
>> index b116e49..826af84 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>> @@ -126,6 +126,8 @@ struct i915_gem_context {
>>   #define CONTEXT_BANNABLE        3
>>   #define CONTEXT_BANNED            4
>>   #define CONTEXT_FORCE_SINGLE_SUBMISSION    5
>> +#define CONTEXT_DATA_PORT_COHERENT_REQUESTED    6
>> +#define CONTEXT_DATA_PORT_COHERENT_ACTIVE    7
>>         /**
>>        * @hw_id: - unique identifier for the context
>> @@ -257,6 +259,21 @@ static inline void 
>> i915_gem_context_set_force_single_submission(struct i915_gem_
>>       __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>>   }
>>   +static inline bool 
>> i915_gem_context_is_data_port_coherent_requested(struct 
>> i915_gem_context *ctx)
>> +{
>> +    return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
>> +}
>> +
>> +static inline void 
>> i915_gem_context_set_data_port_coherent_requested(struct 
>> i915_gem_context *ctx)
>> +{
>> +    __set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
>> +}
>> +
>> +static inline void 
>> i915_gem_context_clear_data_port_coherent_requested(struct 
>> i915_gem_context *ctx)
>> +{
>> +    __clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
>> +}
>> +
>>   static inline bool i915_gem_context_is_default(const struct 
>> i915_gem_context *c)
>>   {
>>       return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c 
>> b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
>> index 3f0c612..64a7cd4 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
>> @@ -2361,6 +2361,12 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>>           goto err_batch_unpin;
>>       }
>>   +    /* Emit the switch of data port coherency state if needed */
>> +    err = intel_lr_context_modify_data_port_coherency(eb.request,
>> + i915_gem_context_is_data_port_coherent_requested(eb.ctx));
>> +    if (GEM_WARN_ON(err))
>> +        DRM_DEBUG("Data Port Coherency toggle failed, keeping old 
>> setting.\n");
>
> I think we should propagate the error to userspace here. By the virtue 
> of MIN_SPACE_FOR_ADD_REQUEST* we guarantee there must be space for 
> request emission.
>
> GEM_WARN_ON is therefore okay to let us know we got the value of 
> MIN_SPACE_FOR_ADD_REQUEST wrong. Just remove the "keeping old setting" 
> from the debug message.
ack
>
> * Having looked at the commit which last increased 
> MIN_SPACE_FOR_ADD_REQUEST I suspect the current value is large enough 
> for this addition and that we could probably look at decreasing it. It 
> is a manual process though so not straightforward.
>
> But also since this is >= GEN9 code I think it needs to be done 
> deeper. Like in the backend layer sounds right to me.
>
> Maybe intel_lrc.c/gen8_emit_flush_render in the EMIT_INVALIDATE mode? 
> That is the request preamble dealing with invalidating caches so 
> modifying cache coherency mode there as well sounds like a fit to me.
>
Agreed. Will move.
>> +
>>       if (in_fence) {
>>           err = i915_request_await_dma_fence(eb.request, in_fence);
>>           if (err < 0)
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c 
>> b/drivers/gpu/drm/i915/intel_lrc.c
>> index 35d37af..fcee03d 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -259,6 +259,61 @@ intel_lr_context_descriptor_update(struct 
>> i915_gem_context *ctx,
>>       ce->lrc_desc = desc;
>>   }
>>   +static int emit_set_data_port_coherency(struct i915_request *rq, 
>> bool enable)
>> +{
>> +    u32 *cs;
>> +    i915_reg_t reg;
>> +
>> +    GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
>> +    GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
>> +
>> +    cs = intel_ring_begin(rq, 4);
>> +    if (IS_ERR(cs))
>> +        return PTR_ERR(cs);
>> +
>> +    if (INTEL_GEN(rq->i915) >= 11)
>> +        reg = ICL_HDC_MODE;
>> +    else if (INTEL_GEN(rq->i915) >= 10)
>> +        reg = CNL_HDC_CHICKEN0;
>> +    else
>> +        reg = HDC_CHICKEN0;
>> +
>> +    *cs++ = MI_LOAD_REGISTER_IMM(1);
>> +    *cs++ = i915_mmio_reg_offset(reg);
>> +    /* Enabling coherency means disabling the bit which forces it 
>> off */
>> +    if (enable)
>> +        *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> +    else
>> +        *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> +    *cs++ = MI_NOOP;
>> +
>> +    intel_ring_advance(rq, cs);
>> +
>> +    return 0;
>> +}
>> +
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_request *rq,
>> +                        bool enable)
>> +{
>> +    struct i915_gem_context *ctx = rq->gem_context;
>> +    int ret;
>> +
>
> I'd put a lockdep_assert_held on struct_mutex here to mark it up for 
> the future.
ok, will do.
>
>> +    if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == 
>> enable)
>
> You don't need to pass in enable to this function since it can figure 
> out what to do from the flags on its own:
>
>     if ((ctx->flags & REQUESTED) == (ctx->flags & ACTIVE))
>         return 0;
>
> After which functions should proabbly be renamed to 
> intel_lr_context_update_data_port_coherency?
>
ack
>> +        return 0;
>> +
>> +    ret = emit_set_data_port_coherency(rq, enable);
>
> And then:
>
>     ..(rq, ctx->flags & REQUESTED)
ok, I will use a local though.
>
>> +
>> +    if (!ret) {
>> +        if (enable)
>> +            __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
>> +        else
>> +            __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, 
>> &ctx->flags);
>> +    }
>> +
>> +       return ret;
>> +}
>> +
>>   static struct i915_priolist *
>>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>>   {
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h 
>> b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..20e8664 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>     void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>   +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_request *rq,
>> +                        bool enable);
>> +
>>   #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..0a4e31f 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1456,6 +1456,13 @@ struct drm_i915_gem_context_param {
>>   #define   I915_CONTEXT_MAX_USER_PRIORITY    1023 /* inclusive */
>>   #define   I915_CONTEXT_DEFAULT_PRIORITY        0
>>   #define   I915_CONTEXT_MIN_USER_PRIORITY    -1023 /* inclusive */
>> +/*
>> + * When data port level coherency is enabled, the GPU will update 
>> memory
>> + * buffers shared with CPU, by forcing internal cache units to send 
>> memory
>> + * writes to higher level caches faster. Enabling data port 
>> coherency has
>> + * performance cost.
>
> "has _a_ performance cost" I think but not a native speaker so might 
> be wrong.
Agreed.
Will send the update as soon as it's tested.
>
>> + */
>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY    0x7
>>       __u64 value;
>>   };
>>
>
> Regards,
>
> Tvrtko