[Intel-gfx] [PATCH v5 08/10] drm/i915/perf: allow holding preemption on filtered ctx

Thu Jun 27 14:27:40 UTC 2019

On 27/06/2019 12:53, Chris Wilson wrote:
> Quoting Lionel Landwerlin (2019-06-27 09:00:43)
>> We would like to make use of perf in Vulkan. The Vulkan API is much
>> lower level than OpenGL, with applications directly exposed to the
>> concept of command buffers (pretty much equivalent to our batch
>> buffers). In Vulkan, queries are always limited in scope to a command
>> buffer. In OpenGL, the lack of command buffer concept meant that
>> queries' duration could span multiple command buffers.
>>
>> With that restriction gone in Vulkan, we would like to simplify
>> measuring performance just by measuring the deltas between the counter
>> snapshots written by 2 MI_RECORD_PERF_COUNT commands, rather than the
>> more complex scheme we currently have in the GL driver, using 2
>> MI_RECORD_PERF_COUNT commands and doing some post processing on the
>> stream of OA reports, coming from the global OA buffer, to remove any
>> unrelated deltas in between the 2 MI_RECORD_PERF_COUNT.
>>
>> Disabling preemption only apply to a single context with which want to
>> query performance counters for and is considered a privileged
>> operation, by default protected by CAP_SYS_ADMIN. It is possible to
>> enable it for a normal user by disabling the paranoid stream setting.
>>
>> v2: Store preemption setting in intel_context (Chris)
>>
>> v3: Use priorities to avoid preemption rather than the HW mechanism
>>
>> v4: Just modify the port priority reporting function
>>
>> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin at intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_lrc.c         |  7 ++++++-
>>   drivers/gpu/drm/i915/i915_drv.c             |  2 +-
>>   drivers/gpu/drm/i915/i915_drv.h             | 13 ++++++++++++
>>   drivers/gpu/drm/i915/i915_perf.c            | 22 +++++++++++++++++++--
>>   drivers/gpu/drm/i915/i915_priolist_types.h  |  7 +++++++
>>   drivers/gpu/drm/i915/i915_request.c         |  1 +
>>   drivers/gpu/drm/i915/i915_request.h         |  1 +
>>   drivers/gpu/drm/i915/intel_guc_submission.c | 12 ++++++++++-
>>   include/uapi/drm/i915_drm.h                 | 11 +++++++++++
>>   9 files changed, 71 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
>> index a6ea468986d1..52f4d69cdb2f 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>> @@ -256,7 +256,12 @@ static inline int rq_prio(const struct i915_request *rq)
>>   
>>   static int effective_prio(const struct i915_request *rq)
>>   {
>> -       int prio = rq_prio(rq);
>> +       int prio;
>> +
>> +       if (rq->has_perf)
>> +               prio = I915_USER_PRIORITY(I915_PRIORITY_PERF);
>> +       else
>> +               prio = rq_prio(rq);
> This doesn't cover everything. You will get timesliced now.
>
> But I do think this is the safest approach that evades all the PI sanity
> checks and possible deadlocks.
>
> There is also the problem where we check the priority on the last
> request in the queue, but preemption may need to be disabled at the head
> of the queue (the executing request). Is that a concern? I was thinking
> of marking the context as non-preemptible until after the oa is removed.

What do you mean by timesliced?

I don't really have a good way of working around the global counters so 
whatever you think is best to allow queries in batch buffers to complete.

>
>>          /*
>>           * On unwinding the active request, we give it a priority bump
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index ef54cf1c54a5..68e29b1ed1c8 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -484,7 +484,7 @@ static int i915_getparam_ioctl(struct drm_device *dev, void *data,
>>                  value = INTEL_INFO(dev_priv)->has_coherent_ggtt;
>>                  break;
>>          case I915_PARAM_PERF_REVISION:
>> -               value = 1;
>> +               value = 2;
>>                  break;
>>          case I915_PARAM_HAS_EXEC_PERF_CONFIG:
>>                  /* Obviously requires perf support. */
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index 51b103493623..f1b8a0ada986 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -1232,6 +1232,19 @@ struct i915_perf_stream {
>>           */
>>          bool enabled;
>>   
>> +       /**
>> +        * @hold_preemption: Whether preemption is put on hold for command
>> +        * submissions done on the @ctx. This is useful for some drivers that
>> +        * cannot easily post process the OA buffer context to subtract delta
>> +        * of performance counters not associated with @ctx.
>> +        */
>> +       bool hold_preemption;
>> +
>> +       /**
>> +        * @last_config_request
>> +        */
>> +       struct i915_request *last_config_request;
>> +
>>          /**
>>           * @ops: The callbacks providing the implementation of this specific
>>           * type of configured stream.
>> diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
>> index 34d74fa64c74..bf4f5fee6764 100644
>> --- a/drivers/gpu/drm/i915/i915_perf.c
>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>> @@ -343,6 +343,8 @@ static const struct i915_oa_format gen8_plus_oa_formats[I915_OA_FORMAT_MAX] = {
>>    * struct perf_open_properties - for validated properties given to open a stream
>>    * @sample_flags: `DRM_I915_PERF_PROP_SAMPLE_*` properties are tracked as flags
>>    * @single_context: Whether a single or all gpu contexts should be monitored
>> + * @hold_preemption: Whether the preemption is disabled for the filtered
>> + *                   context
>>    * @ctx_handle: A gem ctx handle for use with @single_context
>>    * @metrics_set: An ID for an OA unit metric set advertised via sysfs
>>    * @oa_format: An OA unit HW report format
>> @@ -357,6 +359,7 @@ struct perf_open_properties {
>>          u32 sample_flags;
>>   
>>          u64 single_context:1;
>> +       u64 hold_preemption:1;
>>          u64 ctx_handle;
>>   
>>          /* OA sampling state */
>> @@ -2352,6 +2355,8 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
>>          stream->sample_flags |= SAMPLE_OA_REPORT;
>>          stream->sample_size += format_size;
>>   
>> +       stream->hold_preemption = props->hold_preemption;
>> +
>>          dev_priv->perf.oa.oa_buffer.format_size = format_size;
>>          if (WARN_ON(dev_priv->perf.oa.oa_buffer.format_size == 0))
>>                  return -EINVAL;
>> @@ -2890,6 +2895,15 @@ i915_perf_open_ioctl_locked(struct drm_i915_private *dev_priv,
>>                  }
>>          }
>>   
>> +       if (props->hold_preemption) {
>> +               if (!props->single_context) {
>> +                       DRM_DEBUG("preemption disable with no context\n");
>> +                       ret = -EINVAL;
>> +                       goto err;
>> +               }
>> +               privileged_op = true;
>> +       }
>> +
>>          /*
>>           * On Haswell the OA unit supports clock gating off for a specific
>>           * context and in this mode there's no visibility of metrics for the
>> @@ -2904,8 +2918,9 @@ i915_perf_open_ioctl_locked(struct drm_i915_private *dev_priv,
>>           * MI_REPORT_PERF_COUNT commands and so consider it a privileged op to
>>           * enable the OA unit by default.
>>           */
>> -       if (IS_HASWELL(dev_priv) && specific_ctx)
>> +       if (IS_HASWELL(dev_priv) && specific_ctx && !props->hold_preemption) {
>>                  privileged_op = false;
>> +       }
>>   
>>          /* Similar to perf's kernel.perf_paranoid_cpu sysctl option
>>           * we check a dev.i915.perf_stream_paranoid sysctl option
>> @@ -2914,7 +2929,7 @@ i915_perf_open_ioctl_locked(struct drm_i915_private *dev_priv,
>>           */
>>          if (privileged_op &&
>>              i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {
>> -               DRM_DEBUG("Insufficient privileges to open system-wide i915 perf stream\n");
>> +               DRM_DEBUG("Insufficient privileges to open i915 perf stream\n");
>>                  ret = -EACCES;
>>                  goto err_ctx;
>>          }
>> @@ -3106,6 +3121,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
>>                          props->oa_periodic = true;
>>                          props->oa_period_exponent = value;
>>                          break;
>> +               case DRM_I915_PERF_PROP_HOLD_PREEMPTION:
>> +                       props->hold_preemption = value != 0 ? 1 : 0;
> !!value.

Done.

>
>> +                       break;
>>                  case DRM_I915_PERF_PROP_MAX:
>>                          MISSING_CASE(id);
>>                          return -EINVAL;
>> diff --git a/drivers/gpu/drm/i915/i915_priolist_types.h b/drivers/gpu/drm/i915/i915_priolist_types.h
>> index 49709de69875..fc1c1fe7ce58 100644
>> --- a/drivers/gpu/drm/i915/i915_priolist_types.h
>> +++ b/drivers/gpu/drm/i915/i915_priolist_types.h
>> @@ -17,6 +17,13 @@ enum {
>>          I915_PRIORITY_NORMAL = I915_CONTEXT_DEFAULT_PRIORITY,
>>          I915_PRIORITY_MAX = I915_CONTEXT_MAX_USER_PRIORITY + 1,
>>   
>> +       /* Requests containing performance queries must not be preempted by
>> +        * another context. They get scheduled with their default priority and
>> +        * once they reach the execlist ports we bump them to
>> +        * I915_PRIORITY_PERF so that they stick to the HW until they finish.
>> +        */
>> +       I915_PRIORITY_PERF = I915_CONTEXT_MAX_USER_PRIORITY + 2,
> Just let it be next, one above max?

Done.

>
>> +
>>          I915_PRIORITY_INVALID = INT_MIN
>>   };
>>   
>> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
>> index 5ff87c4a0cd5..44b4b2bfedf9 100644
>> --- a/drivers/gpu/drm/i915/i915_request.c
>> +++ b/drivers/gpu/drm/i915/i915_request.c
>> @@ -685,6 +685,7 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp)
>>          rq->batch = NULL;
>>          rq->capture_list = NULL;
>>          rq->waitboost = false;
>> +       rq->has_perf = false;
>>          rq->execution_mask = ALL_ENGINES;
>>   
>>          INIT_LIST_HEAD(&rq->active_list);
>> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
>> index b58ceef92e20..21be6b44602d 100644
>> --- a/drivers/gpu/drm/i915/i915_request.h
>> +++ b/drivers/gpu/drm/i915/i915_request.h
>> @@ -217,6 +217,7 @@ struct i915_request {
>>          unsigned long emitted_jiffies;
>>   
>>          bool waitboost;
>> +       bool has_perf;
> 2 bools, probably time for a flags if we stick with this approach.
>
> Also note previous patch didn't compile but rq->perf was out of
> sequence.
> -Chris
>
Done.