[Intel-gfx] [PATCH v6 12/15] drm/i915/perf: Add OA unit support for Gen 8+

Tue Apr 25 17:47:49 UTC 2017

On 25 April 2017 at 18:10, Lionel Landwerlin
<lionel.g.landwerlin at intel.com> wrote:
> On 25/04/17 09:42, Matthew Auld wrote:
>>
>> On 24 April 2017 at 19:49, Lionel Landwerlin
>> <lionel.g.landwerlin at intel.com> wrote:
>>>
>>> From: Robert Bragg <robert at sixbynine.org>
>>>
>>> Enables access to OA unit metrics for BDW, CHV, SKL and BXT which all
>>> share (more-or-less) the same OA unit design.
>>>
>>> Of particular note in comparison to Haswell: some OA unit HW config
>>> state has become per-context state and as a consequence it is somewhat
>>> more complicated to manage synchronous state changes from the cpu while
>>> there's no guarantee of what context (if any) is currently actively
>>> running on the gpu.
>>>
>>> The periodic sampling frequency which can be particularly useful for
>>> system-wide analysis (as opposed to command stream synchronised
>>> MI_REPORT_PERF_COUNT commands) is perhaps the most surprising state to
>>> have become per-context save and restored (while the OABUFFER
>>> destination is still a shared, system-wide resource).
>>>
>>> This support for gen8+ takes care to consider a number of timing
>>> challenges involved in synchronously updating per-context state
>>> primarily by programming all config state from the cpu and updating all
>>> current and saved contexts synchronously while the OA unit is still
>>> disabled.
>>>
>>> The driver intentionally avoids depending on command streamer
>>> programming to update OA state considering the lack of synchronization
>>> between the automatic loading of OACTXCONTROL state (that includes the
>>> periodic sampling state and enable state) on context restore and the
>>> parsing of any general purpose BB the driver can control. I.e. this
>>> implementation is careful to avoid the possibility of a context restore
>>> temporarily enabling any out-of-date periodic sampling state. In
>>> addition to the risk of transiently-out-of-date state being loaded
>>> automatically; there are also internal HW latencies involved in the
>>> loading of MUX configurations which would be difficult to account for
>>> from the command streamer (and we only want to enable the unit when once
>>> the MUX configuration is complete).
>>>
>>> Since the Gen8+ OA unit design no longer supports clock gating the unit
>>> off for a single given context (which effectively stopped any progress
>>> of counters while any other context was running) and instead supports
>>> tagging OA reports with a context ID for filtering on the CPU, it means
>>> we can no longer hide the system-wide progress of counters from a
>>> non-privileged application only interested in metrics for its own
>>> context. Although we could theoretically try and subtract the progress
>>> of other contexts before forwarding reports via read() we aren't in a
>>> position to filter reports captured via MI_REPORT_PERF_COUNT commands.
>>> As a result, for Gen8+, we always require the
>>> dev.i915.perf_stream_paranoid to be unset for any access to OA metrics
>>> if not root.
>>>
>>> v5: Drain submitted requests when enabling metric set to ensure no
>>>      lite-restore erases the context image we just updated (Lionel)
>>>
>>> v6: In addition to drain, switch to kernel context & update all
>>>      context in place (Chris)
>>>
>>> Signed-off-by: Robert Bragg <robert at sixbynine.org>
>>> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin at intel.com>
>>> Reviewed-by: Matthew Auld <matthew.auld at intel.com> \o/
>>> ---
>>>   drivers/gpu/drm/i915/i915_drv.h  |  45 +-
>>>   drivers/gpu/drm/i915/i915_perf.c | 957
>>> ++++++++++++++++++++++++++++++++++++---
>>>   drivers/gpu/drm/i915/i915_reg.h  |  22 +
>>>   drivers/gpu/drm/i915/intel_lrc.c |   2 +
>>>   include/uapi/drm/i915_drm.h      |  19 +-
>>>   5 files changed, 952 insertions(+), 93 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h
>>> b/drivers/gpu/drm/i915/i915_drv.h
>>> index ffa1fc5eddfd..676b1227067c 100644
>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>> @@ -2064,9 +2064,17 @@ struct i915_oa_ops {
>>>          void (*init_oa_buffer)(struct drm_i915_private *dev_priv);
>>>
>>>          /**
>>> -        * @enable_metric_set: Applies any MUX configuration to set up
>>> the
>>> -        * Boolean and Custom (B/C) counters that are part of the counter
>>> -        * reports being sampled. May apply system constraints such as
>>> +        * @select_metric_set: The auto generated code that checks
>>> whether a
>>> +        * requested OA config is applicable to the system and if so sets
>>> up
>>> +        * the mux, oa and flex eu register config pointers according to
>>> the
>>> +        * current dev_priv->perf.oa.metrics_set.
>>> +        */
>>> +       int (*select_metric_set)(struct drm_i915_private *dev_priv);
>>> +
>>> +       /**
>>> +        * @enable_metric_set: Selects and applies any MUX configuration
>>> to set
>>> +        * up the Boolean and Custom (B/C) counters that are part of the
>>> +        * counter reports being sampled. May apply system constraints
>>> such as
>>>           * disabling EU clock gating as required.
>>>           */
>>>          int (*enable_metric_set)(struct drm_i915_private *dev_priv);
>>> @@ -2097,20 +2105,13 @@ struct i915_oa_ops {
>>>                      size_t *offset);
>>>
>>>          /**
>>> -        * @oa_buffer_check: Check for OA buffer data + update tail
>>> -        *
>>> -        * This is either called via fops or the poll check hrtimer
>>> (atomic
>>> -        * ctx) without any locks taken.
>>> +        * @oa_hw_tail_read: read the OA tail pointer register
>>>           *
>>> -        * It's safe to read OA config state here unlocked, assuming that
>>> this
>>> -        * is only called while the stream is enabled, while the global
>>> OA
>>> -        * configuration can't be modified.
>>> -        *
>>> -        * Efficiency is more important than avoiding some false
>>> positives
>>> -        * here, which will be handled gracefully - likely resulting in
>>> an
>>> -        * %EAGAIN error for userspace.
>>> +        * In particular this enables us to share all the fiddly code for
>>> +        * handling the OA unit tail pointer race that affects multiple
>>> +        * generations.
>>>           */
>>> -       bool (*oa_buffer_check)(struct drm_i915_private *dev_priv);
>>> +       u32 (*oa_hw_tail_read)(struct drm_i915_private *dev_priv);
>>>   };
>>>
>>>   struct intel_cdclk_state {
>>> @@ -2472,6 +2473,7 @@ struct drm_i915_private {
>>>                          struct {
>>>                                  struct i915_vma *vma;
>>>                                  u8 *vaddr;
>>> +                               u32 last_ctx_id;
>>>                                  int format;
>>>                                  int format_size;
>>>
>>> @@ -2541,6 +2543,14 @@ struct drm_i915_private {
>>>                          } oa_buffer;
>>>
>>>                          u32 gen7_latched_oastatus1;
>>> +                       u32 ctx_oactxctrl_offset;
>>> +                       u32 ctx_flexeu0_offset;
>>> +
>>> +                       /* The RPT_ID/reason field for Gen8+ includes a
>>> bit
>>> +                        * to determine if the CTX ID in the report is
>>> valid
>>> +                        * but the specific bit differs between Gen 8 and
>>> 9
>>> +                        */
>>> +                       u32 gen8_valid_ctx_bit;
>>>
>>>                          struct i915_oa_ops ops;
>>>                          const struct i915_oa_format *oa_formats;
>>> @@ -2851,6 +2861,8 @@ intel_info(const struct drm_i915_private *dev_priv)
>>>   #define IS_KBL_ULX(dev_priv)   (INTEL_DEVID(dev_priv) == 0x590E || \
>>>                                   INTEL_DEVID(dev_priv) == 0x5915 || \
>>>                                   INTEL_DEVID(dev_priv) == 0x591E)
>>> +#define IS_SKL_GT2(dev_priv)   (IS_SKYLAKE(dev_priv) && \
>>> +                                (INTEL_DEVID(dev_priv) & 0x00F0) ==
>>> 0x0010)
>>>   #define IS_SKL_GT3(dev_priv)   (IS_SKYLAKE(dev_priv) && \
>>>                                   (INTEL_DEVID(dev_priv) & 0x00F0) ==
>>> 0x0020)
>>>   #define IS_SKL_GT4(dev_priv)   (IS_SKYLAKE(dev_priv) && \
>>> @@ -3615,6 +3627,9 @@ i915_gem_context_lookup_timeline(struct
>>> i915_gem_context *ctx,
>>>
>>>   int i915_perf_open_ioctl(struct drm_device *dev, void *data,
>>>                           struct drm_file *file);
>>> +void i915_oa_init_reg_state(struct intel_engine_cs *engine,
>>> +                           struct i915_gem_context *ctx,
>>> +                           uint32_t *reg_state);
>>>
>>>   /* i915_gem_evict.c */
>>>   int __must_check i915_gem_evict_something(struct i915_address_space
>>> *vm,
>>> diff --git a/drivers/gpu/drm/i915/i915_perf.c
>>> b/drivers/gpu/drm/i915/i915_perf.c
>>> index 3277a52ce98e..f2688dee5a03 100644
>>> --- a/drivers/gpu/drm/i915/i915_perf.c
>>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>>> @@ -196,6 +196,12 @@
>>>
>>>   #include "i915_drv.h"
>>>   #include "i915_oa_hsw.h"
>>> +#include "i915_oa_bdw.h"
>>> +#include "i915_oa_chv.h"
>>> +#include "i915_oa_sklgt2.h"
>>> +#include "i915_oa_sklgt3.h"
>>> +#include "i915_oa_sklgt4.h"
>>> +#include "i915_oa_bxt.h"
>>>
>>>   /* HW requires this to be a power of two, between 128k and 16M, though
>>> driver
>>>    * is currently generally designed assuming the largest 16M size is
>>> used such
>>> @@ -215,7 +221,7 @@
>>>    *
>>>    * Although this can be observed explicitly while copying reports to
>>> userspace
>>>    * by checking for a zeroed report-id field in tail reports, we want to
>>> account
>>> - * for this earlier, as part of the _oa_buffer_check to avoid lots of
>>> redundant
>>> + * for this earlier, as part of the oa_buffer_check to avoid lots of
>>> redundant
>>>    * read() attempts.
>>>    *
>>>    * In effect we define a tail pointer for reading that lags the real
>>> tail
>>> @@ -237,7 +243,7 @@
>>>    * indicates that an updated tail pointer is needed.
>>>    *
>>>    * Most of the implementation details for this workaround are in
>>> - * gen7_oa_buffer_check_unlocked() and gen7_appand_oa_reports()
>>> + * oa_buffer_check_unlocked() and _append_oa_reports()
>>>    *
>>>    * Note for posterity: previously the driver used to define an
>>> effective tail
>>>    * pointer that lagged the real pointer by a 'tail margin' measured in
>>> bytes
>>> @@ -272,6 +278,13 @@ static u32 i915_perf_stream_paranoid = true;
>>>
>>>   #define INVALID_CTX_ID 0xffffffff
>>>
>>> +/* On Gen8+ automatically triggered OA reports include a 'reason'
>>> field... */
>>> +#define OAREPORT_REASON_MASK           0x3f
>>> +#define OAREPORT_REASON_SHIFT          19
>>> +#define OAREPORT_REASON_TIMER          (1<<0)
>>> +#define OAREPORT_REASON_CTX_SWITCH     (1<<3)
>>> +#define OAREPORT_REASON_CLK_RATIO      (1<<5)
>>> +
>>>
>>>   /* For sysctl proc_dointvec_minmax of i915_oa_max_sample_rate
>>>    *
>>> @@ -303,6 +316,13 @@ static struct i915_oa_format
>>> hsw_oa_formats[I915_OA_FORMAT_MAX] = {
>>>          [I915_OA_FORMAT_C4_B8]      = { 7, 64 },
>>>   };
>>>
>>> +static struct i915_oa_format gen8_plus_oa_formats[I915_OA_FORMAT_MAX] =
>>> {
>>> +       [I915_OA_FORMAT_A12]                = { 0, 64 },
>>> +       [I915_OA_FORMAT_A12_B8_C8]          = { 2, 128 },
>>> +       [I915_OA_FORMAT_A32u40_A4u32_B8_C8] = { 5, 256 },
>>> +       [I915_OA_FORMAT_C4_B8]              = { 7, 64 },
>>> +};
>>> +
>>>   #define SAMPLE_OA_REPORT      (1<<0)
>>>
>>>   /**
>>> @@ -332,8 +352,20 @@ struct perf_open_properties {
>>>          int oa_period_exponent;
>>>   };
>>>
>>> +static u32 gen8_oa_hw_tail_read(struct drm_i915_private *dev_priv)
>>> +{
>>> +       return I915_READ(GEN8_OATAILPTR) & GEN8_OATAILPTR_MASK;
>>> +}
>>> +
>>> +static u32 gen7_oa_hw_tail_read(struct drm_i915_private *dev_priv)
>>> +{
>>> +       u32 oastatus1 = I915_READ(GEN7_OASTATUS1);
>>> +
>>> +       return oastatus1 & GEN7_OASTATUS1_TAIL_MASK;
>>> +}
>>> +
>>>   /**
>>> - * gen7_oa_buffer_check_unlocked - check for data and update tail ptr
>>> state
>>> + * oa_buffer_check_unlocked - check for data and update tail ptr state
>>>    * @dev_priv: i915 device instance
>>>    *
>>>    * This is either called via fops (for blocking reads in user ctx) or
>>> the poll
>>> @@ -356,12 +388,11 @@ struct perf_open_properties {
>>>    *
>>>    * Returns: %true if the OA buffer contains data, else %false
>>>    */
>>> -static bool gen7_oa_buffer_check_unlocked(struct drm_i915_private
>>> *dev_priv)
>>> +static bool oa_buffer_check_unlocked(struct drm_i915_private *dev_priv)
>>>   {
>>>          int report_size = dev_priv->perf.oa.oa_buffer.format_size;
>>>          unsigned long flags;
>>>          unsigned int aged_idx;
>>> -       u32 oastatus1;
>>>          u32 head, hw_tail, aged_tail, aging_tail;
>>>          u64 now;
>>>
>>> @@ -381,8 +412,7 @@ static bool gen7_oa_buffer_check_unlocked(struct
>>> drm_i915_private *dev_priv)
>>>          aged_tail = dev_priv->perf.oa.oa_buffer.tails[aged_idx].offset;
>>>          aging_tail =
>>> dev_priv->perf.oa.oa_buffer.tails[!aged_idx].offset;
>>>
>>> -       oastatus1 = I915_READ(GEN7_OASTATUS1);
>>> -       hw_tail = oastatus1 & GEN7_OASTATUS1_TAIL_MASK;
>>> +       hw_tail = dev_priv->perf.oa.ops.oa_hw_tail_read(dev_priv);
>>>
>>>          /* The tail pointer increases in 64 byte increments,
>>>           * not in report_size steps...
>>> @@ -404,6 +434,7 @@ static bool gen7_oa_buffer_check_unlocked(struct
>>> drm_i915_private *dev_priv)
>>>          if (aging_tail != INVALID_TAIL_PTR &&
>>>              ((now - dev_priv->perf.oa.oa_buffer.aging_timestamp) >
>>>               OA_TAIL_MARGIN_NSEC)) {
>>> +
>>>                  aged_idx ^= 1;
>>>                  dev_priv->perf.oa.oa_buffer.aged_tail_idx = aged_idx;
>>>
>>> @@ -553,6 +584,287 @@ static int append_oa_sample(struct i915_perf_stream
>>> *stream,
>>>    *
>>>    * Returns: 0 on success, negative error code on failure.
>>>    */
>>> +static int gen8_append_oa_reports(struct i915_perf_stream *stream,
>>> +                                 char __user *buf,
>>> +                                 size_t count,
>>> +                                 size_t *offset)
>>> +{
>>> +       struct drm_i915_private *dev_priv = stream->dev_priv;
>>> +       int report_size = dev_priv->perf.oa.oa_buffer.format_size;
>>> +       u8 *oa_buf_base = dev_priv->perf.oa.oa_buffer.vaddr;
>>> +       u32 gtt_offset =
>>> i915_ggtt_offset(dev_priv->perf.oa.oa_buffer.vma);
>>> +       u32 mask = (OA_BUFFER_SIZE - 1);
>>> +       size_t start_offset = *offset;
>>> +       unsigned long flags;
>>> +       unsigned int aged_tail_idx;
>>> +       u32 head, tail;
>>> +       u32 taken;
>>> +       int ret = 0;
>>> +
>>> +       if (WARN_ON(!stream->enabled))
>>> +               return -EIO;
>>> +
>>> +       spin_lock_irqsave(&dev_priv->perf.oa.oa_buffer.ptr_lock, flags);
>>> +
>>> +       head = dev_priv->perf.oa.oa_buffer.head;
>>> +       aged_tail_idx = dev_priv->perf.oa.oa_buffer.aged_tail_idx;
>>> +       tail = dev_priv->perf.oa.oa_buffer.tails[aged_tail_idx].offset;
>>> +
>>> +       spin_unlock_irqrestore(&dev_priv->perf.oa.oa_buffer.ptr_lock,
>>> flags);
>>> +
>>> +       /* An invalid tail pointer here means we're still waiting for the
>>> poll
>>> +        * hrtimer callback to give us a pointer
>>> +        */
>>> +       if (tail == INVALID_TAIL_PTR)
>>> +               return -EAGAIN;
>>> +
>>> +       /* NB: oa_buffer.head/tail include the gtt_offset which we don't
>>> want
>>> +        * while indexing relative to oa_buf_base.
>>> +        */
>>> +       head -= gtt_offset;
>>> +       tail -= gtt_offset;
>>> +
>>> +       /* An out of bounds or misaligned head or tail pointer implies a
>>> driver
>>> +        * bug since we validate + align the tail pointers we read from
>>> the
>>> +        * hardware and we are in full control of the head pointer which
>>> should
>>> +        * only be incremented by multiples of the report size (notably
>>> also
>>> +        * all a power of two).
>>> +        */
>>> +       if (WARN_ONCE(head > OA_BUFFER_SIZE || head % report_size ||
>>> +                     tail > OA_BUFFER_SIZE || tail % report_size,
>>> +                     "Inconsistent OA buffer pointers: head = %u, tail =
>>> %u\n",
>>> +                     head, tail))
>>> +               return -EIO;
>>> +
>>> +
>>> +       for (/* none */;
>>> +            (taken = OA_TAKEN(tail, head));
>>> +            head = (head + report_size) & mask) {
>>> +               u8 *report = oa_buf_base + head;
>>> +               u32 *report32 = (void *)report;
>>> +               u32 ctx_id;
>>> +               u32 reason;
>>> +
>>> +               /* All the report sizes factor neatly into the buffer
>>> +                * size so we never expect to see a report split
>>> +                * between the beginning and end of the buffer.
>>> +                *
>>> +                * Given the initial alignment check a misalignment
>>> +                * here would imply a driver bug that would result
>>> +                * in an overrun.
>>> +                */
>>> +               if (WARN_ON((OA_BUFFER_SIZE - head) < report_size)) {
>>> +                       DRM_ERROR("Spurious OA head ptr: non-integral
>>> report offset\n");
>>> +                       break;
>>> +               }
>>> +
>>> +               /* The reason field includes flags identifying what
>>> +                * triggered this specific report (mostly timer
>>> +                * triggered or e.g. due to a context switch).
>>> +                *
>>> +                * This field is never expected to be zero so we can
>>> +                * check that the report isn't invalid before copying
>>> +                * it to userspace...
>>> +                */
>>> +               reason = ((report32[0] >> OAREPORT_REASON_SHIFT) &
>>> +                         OAREPORT_REASON_MASK);
>>> +               if (reason == 0) {
>>> +                       if
>>> (__ratelimit(&dev_priv->perf.oa.spurious_report_rs))
>>> +                               DRM_NOTE("Skipping spurious, invalid OA
>>> report\n");
>>> +                       continue;
>>> +               }
>>> +
>>> +               /* XXX: Just keep the lower 21 bits for now since I'm not
>>> +                * entirely sure if the HW touches any of the higher bits
>>> in
>>> +                * this field
>>> +                */
>>> +               ctx_id = report32[2] & 0x1fffff;
>>> +
>>> +               /* Squash whatever is in the CTX_ID field if it's marked
>>> as
>>> +                * invalid to be sure we avoid false-positive,
>>> single-context
>>> +                * filtering below...
>>> +                *
>>> +                * Note: that we don't clear the valid_ctx_bit so
>>> userspace can
>>> +                * understand that the ID has been squashed by the
>>> kernel.
>>> +                */
>>> +               if (!(report32[0] &
>>> dev_priv->perf.oa.gen8_valid_ctx_bit))
>>> +                       ctx_id = report32[2] = INVALID_CTX_ID;
>>> +
>>> +               /* NB: For Gen 8 the OA unit no longer supports clock
>>> gating
>>> +                * off for a specific context and the kernel can't
>>> securely
>>> +                * stop the counters from updating as system-wide /
>>> global
>>> +                * values.
>>> +                *
>>> +                * Automatic reports now include a context ID so reports
>>> can be
>>> +                * filtered on the cpu but it's not worth trying to
>>> +                * automatically subtract/hide counter progress for other
>>> +                * contexts while filtering since we can't stop userspace
>>> +                * issuing MI_REPORT_PERF_COUNT commands which would
>>> still
>>> +                * provide a side-band view of the real values.
>>> +                *
>>> +                * To allow userspace (such as
>>> Mesa/GL_INTEL_performance_query)
>>> +                * to normalize counters for a single filtered context
>>> then it
>>> +                * needs be forwarded bookend context-switch reports so
>>> that it
>>> +                * can track switches in between MI_REPORT_PERF_COUNT
>>> commands
>>> +                * and can itself subtract/ignore the progress of
>>> counters
>>> +                * associated with other contexts. Note that the hardware
>>> +                * automatically triggers reports when switching to a new
>>> +                * context which are tagged with the ID of the newly
>>> active
>>> +                * context. To avoid the complexity (and likely
>>> fragility) of
>>> +                * reading ahead while parsing reports to try and
>>> minimize
>>> +                * forwarding redundant context switch reports (i.e.
>>> between
>>> +                * other, unrelated contexts) we simply elect to forward
>>> them
>>> +                * all.
>>> +                *
>>> +                * We don't rely solely on the reason field to identify
>>> context
>>> +                * switches since it's not-uncommon for periodic samples
>>> to
>>> +                * identify a switch before any 'context switch' report.
>>> +                */
>>> +               if (!dev_priv->perf.oa.exclusive_stream->ctx ||
>>> +                   dev_priv->perf.oa.specific_ctx_id == ctx_id ||
>>> +                   (dev_priv->perf.oa.oa_buffer.last_ctx_id ==
>>> +                    dev_priv->perf.oa.specific_ctx_id) ||
>>> +                   reason & OAREPORT_REASON_CTX_SWITCH) {
>>> +
>>> +                       /* While filtering for a single context we avoid
>>> +                        * leaking the IDs of other contexts.
>>> +                        */
>>> +                       if (dev_priv->perf.oa.exclusive_stream->ctx &&
>>> +                           dev_priv->perf.oa.specific_ctx_id != ctx_id)
>>> {
>>> +                               report32[2] = INVALID_CTX_ID;
>>> +                       }
>>> +
>>> +                       ret = append_oa_sample(stream, buf, count,
>>> offset,
>>> +                                              report);
>>> +                       if (ret)
>>> +                               break;
>>> +
>>> +                       dev_priv->perf.oa.oa_buffer.last_ctx_id = ctx_id;
>>> +               }
>>> +
>>> +               /* The above reason field sanity check is based on
>>> +                * the assumption that the OA buffer is initially
>>> +                * zeroed and we reset the field after copying so the
>>> +                * check is still meaningful once old reports start
>>> +                * being overwritten.
>>> +                */
>>> +               report32[0] = 0;
>>> +       }
>>> +
>>> +       if (start_offset != *offset) {
>>> +               spin_lock_irqsave(&dev_priv->perf.oa.oa_buffer.ptr_lock,
>>> flags);
>>> +
>>> +               /* We removed the gtt_offset for the copy loop above,
>>> indexing
>>> +                * relative to oa_buf_base so put back here...
>>> +                */
>>> +               head += gtt_offset;
>>> +
>>> +               I915_WRITE(GEN8_OAHEADPTR, head & GEN8_OAHEADPTR_MASK);
>>> +               dev_priv->perf.oa.oa_buffer.head = head;
>>> +
>>> +
>>> spin_unlock_irqrestore(&dev_priv->perf.oa.oa_buffer.ptr_lock, flags);
>>> +       }
>>> +
>>> +       return ret;
>>> +}
>>> +
>>> +/**
>>> + * gen8_oa_read - copy status records then buffered OA reports
>>> + * @stream: An i915-perf stream opened for OA metrics
>>> + * @buf: destination buffer given by userspace
>>> + * @count: the number of bytes userspace wants to read
>>> + * @offset: (inout): the current position for writing into @buf
>>> + *
>>> + * Checks OA unit status registers and if necessary appends
>>> corresponding
>>> + * status records for userspace (such as for a buffer full condition)
>>> and then
>>> + * initiate appending any buffered OA reports.
>>> + *
>>> + * Updates @offset according to the number of bytes successfully copied
>>> into
>>> + * the userspace buffer.
>>> + *
>>> + * NB: some data may be successfully copied to the userspace buffer
>>> + * even if an error is returned, and this is reflected in the
>>> + * updated @offset.
>>> + *
>>> + * Returns: zero on success or a negative error code
>>> + */
>>> +static int gen8_oa_read(struct i915_perf_stream *stream,
>>> +                       char __user *buf,
>>> +                       size_t count,
>>> +                       size_t *offset)
>>> +{
>>> +       struct drm_i915_private *dev_priv = stream->dev_priv;
>>> +       u32 oastatus;
>>> +       int ret;
>>> +
>>> +       if (WARN_ON(!dev_priv->perf.oa.oa_buffer.vaddr))
>>> +               return -EIO;
>>> +
>>> +       oastatus = I915_READ(GEN8_OASTATUS);
>>> +
>>> +       /* We treat OABUFFER_OVERFLOW as a significant error:
>>> +        *
>>> +        * Although theoretically we could handle this more gracefully
>>> +        * sometimes, some Gens don't correctly suppress certain
>>> +        * automatically triggered reports in this condition and so we
>>> +        * have to assume that old reports are now being trampled
>>> +        * over.
>>> +        *
>>> +        * Considering how we don't currently give userspace control
>>> +        * over the OA buffer size and always configure a large 16MB
>>> +        * buffer, then a buffer overflow does anyway likely indicate
>>> +        * that something has gone quite badly wrong.
>>> +        */
>>> +       if (oastatus & GEN8_OASTATUS_OABUFFER_OVERFLOW) {
>>> +               ret = append_oa_status(stream, buf, count, offset,
>>> +
>>> DRM_I915_PERF_RECORD_OA_BUFFER_LOST);
>>> +               if (ret)
>>> +                       return ret;
>>> +
>>> +               DRM_DEBUG("OA buffer overflow (exponent = %d): force
>>> restart\n",
>>> +                         dev_priv->perf.oa.period_exponent);
>>> +
>>> +               dev_priv->perf.oa.ops.oa_disable(dev_priv);
>>> +               dev_priv->perf.oa.ops.oa_enable(dev_priv);
>>> +
>>> +               /* Note: .oa_enable() is expected to re-init the oabuffer
>>> +                * and reset GEN8_OASTATUS for us
>>> +                */
>>> +               oastatus = I915_READ(GEN8_OASTATUS);
>>> +       }
>>> +
>>> +       if (oastatus & GEN8_OASTATUS_REPORT_LOST) {
>>> +               ret = append_oa_status(stream, buf, count, offset,
>>> +
>>> DRM_I915_PERF_RECORD_OA_REPORT_LOST);
>>> +               if (ret)
>>> +                       return ret;
>>> +               I915_WRITE(GEN8_OASTATUS,
>>> +                          oastatus & ~GEN8_OASTATUS_REPORT_LOST);
>>> +       }
>>> +
>>> +       return gen8_append_oa_reports(stream, buf, count, offset);
>>> +}
>>> +
>>> +/**
>>> + * Copies all buffered OA reports into userspace read() buffer.
>>> + * @stream: An i915-perf stream opened for OA metrics
>>> + * @buf: destination buffer given by userspace
>>> + * @count: the number of bytes userspace wants to read
>>> + * @offset: (inout): the current position for writing into @buf
>>> + *
>>> + * Notably any error condition resulting in a short read (-%ENOSPC or
>>> + * -%EFAULT) will be returned even though one or more records may
>>> + * have been successfully copied. In this case it's up to the caller
>>> + * to decide if the error should be squashed before returning to
>>> + * userspace.
>>> + *
>>> + * Note: reports are consumed from the head, and appended to the
>>> + * tail, so the tail chases the head?... If you think that's mad
>>> + * and back-to-front you're not alone, but this follows the
>>> + * Gen PRM naming convention.
>>> + *
>>> + * Returns: 0 on success, negative error code on failure.
>>> + */
>>>   static int gen7_append_oa_reports(struct i915_perf_stream *stream,
>>>                                    char __user *buf,
>>>                                    size_t count,
>>> @@ -733,7 +1045,8 @@ static int gen7_oa_read(struct i915_perf_stream
>>> *stream,
>>>                  if (ret)
>>>                          return ret;
>>>
>>> -               DRM_DEBUG("OA buffer overflow: force restart\n");
>>> +               DRM_DEBUG("OA buffer overflow (exponent = %d): force
>>> restart\n",
>>> +                         dev_priv->perf.oa.period_exponent);
>>>
>>>                  dev_priv->perf.oa.ops.oa_disable(dev_priv);
>>>                  dev_priv->perf.oa.ops.oa_enable(dev_priv);
>>> @@ -776,7 +1089,7 @@ static int i915_oa_wait_unlocked(struct
>>> i915_perf_stream *stream)
>>>                  return -EIO;
>>>
>>>          return wait_event_interruptible(dev_priv->perf.oa.poll_wq,
>>> -
>>> dev_priv->perf.oa.ops.oa_buffer_check(dev_priv));
>>> +
>>> oa_buffer_check_unlocked(dev_priv));
>>>   }
>>>
>>>   /**
>>> @@ -833,33 +1146,37 @@ static int i915_oa_read(struct i915_perf_stream
>>> *stream,
>>>   static int oa_get_render_ctx_id(struct i915_perf_stream *stream)
>>>   {
>>>          struct drm_i915_private *dev_priv = stream->dev_priv;
>>> -       struct intel_engine_cs *engine = dev_priv->engine[RCS];
>>> -       int ret;
>>>
>>> -       ret = i915_mutex_lock_interruptible(&dev_priv->drm);
>>> -       if (ret)
>>> -               return ret;
>>> +       if (i915.enable_execlists)
>>> +               dev_priv->perf.oa.specific_ctx_id = stream->ctx->hw_id;
>>> +       else {
>>> +               struct intel_engine_cs *engine = dev_priv->engine[RCS];
>>> +               int ret;
>>>
>>> -       /* As the ID is the gtt offset of the context's vma we pin
>>> -        * the vma to ensure the ID remains fixed.
>>> -        *
>>> -        * NB: implied RCS engine...
>>> -        */
>>> -       ret = engine->context_pin(engine, stream->ctx);
>>> -       if (ret)
>>> -               goto unlock;
>>> +               ret = i915_mutex_lock_interruptible(&dev_priv->drm);
>>> +               if (ret)
>>> +                       return ret;
>>>
>>> -       /* Explicitly track the ID (instead of calling i915_ggtt_offset()
>>> -        * on the fly) considering the difference with gen8+ and
>>> -        * execlists
>>> -        */
>>> -       dev_priv->perf.oa.specific_ctx_id =
>>> -               i915_ggtt_offset(stream->ctx->engine[engine->id].state);
>>> +               /* As the ID is the gtt offset of the context's vma we
>>> pin
>>> +                * the vma to ensure the ID remains fixed.
>>> +                */
>>> +               ret = engine->context_pin(engine, stream->ctx);
>>> +               if (ret) {
>>> +                       mutex_unlock(&dev_priv->drm.struct_mutex);
>>> +                       return ret;
>>> +               }
>>>
>>> -unlock:
>>> -       mutex_unlock(&dev_priv->drm.struct_mutex);
>>> +               /* Explicitly track the ID (instead of calling
>>> +                * i915_ggtt_offset() on the fly) considering the
>>> difference
>>> +                * with gen8+ and execlists
>>> +                */
>>> +               dev_priv->perf.oa.specific_ctx_id =
>>> +
>>> i915_ggtt_offset(stream->ctx->engine[engine->id].state);
>>>
>>> -       return ret;
>>> +               mutex_unlock(&dev_priv->drm.struct_mutex);
>>> +       }
>>> +
>>> +       return 0;
>>>   }
>>>
>>>   /**
>>> @@ -874,12 +1191,16 @@ static void oa_put_render_ctx_id(struct
>>> i915_perf_stream *stream)
>>>          struct drm_i915_private *dev_priv = stream->dev_priv;
>>>          struct intel_engine_cs *engine = dev_priv->engine[RCS];
>>>
>>> -       mutex_lock(&dev_priv->drm.struct_mutex);
>>> +       if (i915.enable_execlists) {
>>> +               dev_priv->perf.oa.specific_ctx_id = INVALID_CTX_ID;
>>> +       } else {
>>> +               mutex_lock(&dev_priv->drm.struct_mutex);
>>>
>>> -       dev_priv->perf.oa.specific_ctx_id = INVALID_CTX_ID;
>>> -       engine->context_unpin(engine, stream->ctx);
>>> +               dev_priv->perf.oa.specific_ctx_id = INVALID_CTX_ID;
>>> +               engine->context_unpin(engine, stream->ctx);
>>>
>>> -       mutex_unlock(&dev_priv->drm.struct_mutex);
>>> +               mutex_unlock(&dev_priv->drm.struct_mutex);
>>> +       }
>>>   }
>>>
>>>   static void
>>> @@ -969,6 +1290,61 @@ static void gen7_init_oa_buffer(struct
>>> drm_i915_private *dev_priv)
>>>          dev_priv->perf.oa.pollin = false;
>>>   }
>>>
>>> +static void gen8_init_oa_buffer(struct drm_i915_private *dev_priv)
>>> +{
>>> +       u32 gtt_offset =
>>> i915_ggtt_offset(dev_priv->perf.oa.oa_buffer.vma);
>>> +       unsigned long flags;
>>> +
>>> +       spin_lock_irqsave(&dev_priv->perf.oa.oa_buffer.ptr_lock, flags);
>>> +
>>> +       I915_WRITE(GEN8_OASTATUS, 0);
>>> +       I915_WRITE(GEN8_OAHEADPTR, gtt_offset);
>>> +       dev_priv->perf.oa.oa_buffer.head = gtt_offset;
>>> +
>>> +       I915_WRITE(GEN8_OABUFFER_UDW, 0);
>>> +
>>> +       /* PRM says:
>>> +        *
>>> +        *  "This MMIO must be set before the OATAILPTR
>>> +        *  register and after the OAHEADPTR register. This is
>>> +        *  to enable proper functionality of the overflow
>>> +        *  bit."
>>> +        */
>>> +       I915_WRITE(GEN8_OABUFFER, gtt_offset |
>>> +                  OABUFFER_SIZE_16M | OA_MEM_SELECT_GGTT);
>>> +       I915_WRITE(GEN8_OATAILPTR, gtt_offset & GEN8_OATAILPTR_MASK);
>>> +
>>> +       /* Mark that we need updated tail pointers to read from... */
>>> +       dev_priv->perf.oa.oa_buffer.tails[0].offset = INVALID_TAIL_PTR;
>>> +       dev_priv->perf.oa.oa_buffer.tails[1].offset = INVALID_TAIL_PTR;
>>> +
>>> +       /* Reset state used to recognise context switches, affecting
>>> which
>>> +        * reports we will forward to userspace while filtering for a
>>> single
>>> +        * context.
>>> +        */
>>> +       dev_priv->perf.oa.oa_buffer.last_ctx_id = INVALID_CTX_ID;
>>> +
>>> +       spin_unlock_irqrestore(&dev_priv->perf.oa.oa_buffer.ptr_lock,
>>> flags);
>>> +
>>> +       /* NB: although the OA buffer will initially be allocated
>>> +        * zeroed via shmfs (and so this memset is redundant when
>>> +        * first allocating), we may re-init the OA buffer, either
>>> +        * when re-enabling a stream or in error/reset paths.
>>> +        *
>>> +        * The reason we clear the buffer for each re-init is for the
>>> +        * sanity check in gen8_append_oa_reports() that looks at the
>>> +        * reason field to make sure it's non-zero which relies on
>>> +        * the assumption that new reports are being written to zeroed
>>> +        * memory...
>>> +        */
>>> +       memset(dev_priv->perf.oa.oa_buffer.vaddr, 0, OA_BUFFER_SIZE);
>>> +
>>> +       /* Maybe make ->pollin per-stream state if we support multiple
>>> +        * concurrent streams in the future.
>>> +        */
>>> +       dev_priv->perf.oa.pollin = false;
>>> +}
>>> +
>>>   static int alloc_oa_buffer(struct drm_i915_private *dev_priv)
>>>   {
>>>          struct drm_i915_gem_object *bo;
>>> @@ -1113,6 +1489,280 @@ static void hsw_disable_metric_set(struct
>>> drm_i915_private *dev_priv)
>>>                                        ~GT_NOA_ENABLE));
>>>   }
>>>
>>> +/*
>>> + * From Broadwell PRM, 3D-Media-GPGPU -> Register State Context
>>> + *
>>> + * MMIO reads or writes to any of the registers listed in the
>>> + * “Register State Context image” subsections through HOST/IA
>>> + * MMIO interface for debug purposes must follow the steps below:
>>> + *
>>> + * - SW should set the Force Wakeup bit to prevent GT from entering C6.
>>> + * - Write 0x2050[31:0] = 0x00010001 (disable sequence).
>>> + * - Disable IDLE messaging in CS (Write 0x2050[31:0] = 0x00010001).
>>> + * - BDW:  Poll/Wait for register bits of 0x22AC[6:0] turn to 0x30
>>> value.
>>> + * - SKL+: Poll/Wait for register bits of 0x22A4[6:0] turn to 0x30
>>> value.
>>> + * - Read/Write to desired MMIO registers.
>>> + * - Enable IDLE messaging in CS (Write 0x2050[31:0] = 0x00010000).
>>> + * - Force Wakeup bit should be reset to enable C6 entry.
>>> + *
>>> + * XXX: don't nest or overlap calls to this API, it has no ref
>>> + * counting to track how many entities require the RCS to be
>>> + * blocked from being idle.
>>> + */
>>> +static int gen8_begin_ctx_mmio(struct drm_i915_private *dev_priv)
>>> +{
>>> +       i915_reg_t fsm_reg = INTEL_GEN(dev_priv) > 8 ?
>>> +               GEN9_RCS_FE_FSM2 : GEN6_RCS_PWR_FSM;
>>> +       int ret = 0;
>>> +
>>> +       /* There's only no active context while idle in execlist mode
>>> +        * (though we shouldn't be using this in any other case)
>>> +        */
>>> +       if (WARN_ON(!i915.enable_execlists))
>>> +               return -EINVAL;
>>> +
>>> +       intel_uncore_forcewake_get(dev_priv, FORCEWAKE_RENDER);
>>> +
>>> +       I915_WRITE(GEN6_RC_SLEEP_PSMI_CONTROL,
>>> +                  _MASKED_BIT_ENABLE(GEN6_PSMI_SLEEP_MSG_DISABLE));
>>> +
>>> +       /* Note: we don't currently have a good handle on the maximum
>>> +        * latency for this wake up so while we only need to hold rcs
>>> +        * busy from process context we at least keep the waiting
>>> +        * interruptible...
>>> +        */
>>> +       ret = intel_wait_for_register(dev_priv, fsm_reg, 0x3f, 0x30, 50);
>>> +       if (ret == -ETIMEDOUT)
>>> +               DRM_ERROR("Timeout waiting to be able to begin per-ctx
>>> mmio\n");
>>> +
>>> +       if (ret)
>>> +               intel_uncore_forcewake_put(dev_priv, FORCEWAKE_RENDER);
>>> +
>>> +       return ret;
>>> +}
>>> +
>>> +static void gen8_end_ctx_mmio(struct drm_i915_private *dev_priv)
>>> +{
>>> +       if (WARN_ON(!i915.enable_execlists))
>>> +               return;
>>> +
>>> +       I915_WRITE(GEN6_RC_SLEEP_PSMI_CONTROL,
>>> +                  _MASKED_BIT_DISABLE(GEN6_PSMI_SLEEP_MSG_DISABLE));
>>> +       intel_uncore_forcewake_put(dev_priv, FORCEWAKE_RENDER);
>>> +}
>>> +
>>> +/* NB: It must always remain pointer safe to run this even if the OA
>>> unit
>>> + * has been disabled.
>>> + *
>>> + * It's fine to put out-of-date values into these per-context registers
>>> + * in the case that the OA unit has been disabled.
>>> + */
>>> +static void gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
>>> +                                          u32 *reg_state)
>>> +{
>>> +       struct drm_i915_private *dev_priv = ctx->i915;
>>> +       const struct i915_oa_reg *flex_regs =
>>> dev_priv->perf.oa.flex_regs;
>>> +       int n_flex_regs = dev_priv->perf.oa.flex_regs_len;
>>> +       u32 ctx_oactxctrl = dev_priv->perf.oa.ctx_oactxctrl_offset;
>>> +       u32 ctx_flexeu0 = dev_priv->perf.oa.ctx_flexeu0_offset;
>>> +       /* The MMIO offsets for Flex EU registers aren't contiguous */
>>> +       u32 flex_mmio[] = {
>>> +               i915_mmio_reg_offset(EU_PERF_CNTL0),
>>> +               i915_mmio_reg_offset(EU_PERF_CNTL1),
>>> +               i915_mmio_reg_offset(EU_PERF_CNTL2),
>>> +               i915_mmio_reg_offset(EU_PERF_CNTL3),
>>> +               i915_mmio_reg_offset(EU_PERF_CNTL4),
>>> +               i915_mmio_reg_offset(EU_PERF_CNTL5),
>>> +               i915_mmio_reg_offset(EU_PERF_CNTL6),
>>> +       };
>>> +       int i;
>>> +
>>> +       reg_state[ctx_oactxctrl] =
>>> i915_mmio_reg_offset(GEN8_OACTXCONTROL);
>>> +       reg_state[ctx_oactxctrl+1] = (dev_priv->perf.oa.period_exponent
>>> <<
>>> +                                     GEN8_OA_TIMER_PERIOD_SHIFT) |
>>> +                                    (dev_priv->perf.oa.periodic ?
>>> +                                     GEN8_OA_TIMER_ENABLE : 0) |
>>> +                                    GEN8_OA_COUNTER_RESUME;
>>> +
>>> +       for (i = 0; i < ARRAY_SIZE(flex_mmio); i++) {
>>> +               u32 state_offset = ctx_flexeu0 + i * 2;
>>> +               u32 mmio = flex_mmio[i];
>>> +
>>> +               /* This arbitrary default will select the 'EU FPU0
>>> Pipeline
>>> +                * Active' event. In the future it's anticipated that
>>> there
>>> +                * will be an explicit 'No Event' we can select, but not
>>> yet...
>>> +                */
>>> +               u32 value = 0;
>>> +               int j;
>>> +
>>> +               for (j = 0; j < n_flex_regs; j++) {
>>> +                       if (i915_mmio_reg_offset(flex_regs[j].addr) ==
>>> mmio) {
>>> +                               value = flex_regs[j].value;
>>> +                               break;
>>> +                       }
>>> +               }
>>> +
>>> +               reg_state[state_offset] = mmio;
>>> +               reg_state[state_offset+1] = value;
>>> +       }
>>> +}
>>> +
>>> +/* Manages updating the per-context aspects of the OA stream
>>> + * configuration across all contexts.
>>> + *
>>> + * The awkward consideration here is that OACTXCONTROL controls the
>>> + * exponent for periodic sampling which is primarily used for system
>>> + * wide profiling where we'd like a consistent sampling period even in
>>> + * the face of context switches.
>>> + *
>>> + * Our approach of updating the register state context (as opposed to
>>> + * say using a workaround batch buffer) ensures that the hardware
>>> + * won't automatically reload an out-of-date timer exponent even
>>> + * transiently before a WA BB could be parsed.
>>> + *
>>> + * This function needs to:
>>> + * - Ensure the currently running context's per-context OA state is
>>> + *   updated
>>> + * - Ensure that all existing contexts will have the correct per-context
>>> + *   OA state if they are scheduled for use.
>>> + * - Ensure any new contexts will be initialized with the correct
>>> + *   per-context OA state.
>>> + *
>>> + * Note: it's only the RCS/Render context that has any OA state.
>>> + */
>>> +static int gen8_configure_all_contexts(struct drm_i915_private
>>> *dev_priv)
>>> +{
>>> +       struct i915_gem_context *ctx;
>>> +       int ret;
>>> +
>>> +       ret = i915_mutex_lock_interruptible(&dev_priv->drm);
>>> +       if (ret)
>>> +               return ret;
>>> +
>>> +       /* Switch away from any user context.  */
>>> +       ret = i915_gem_switch_to_kernel_context(dev_priv);
>>> +       if (ret)
>>
>>   mutex_unlock(&dev_priv->drm.struct_mutex);
>
>
> Duh! Thanks!
>
>>
>>> +               return ret;
>>> +
>>> +       /* The OA register config is setup through the context image.
>>> This image
>>> +        * might be written to by the GPU on context switch (in
>>> particular on
>>> +        * lite-restore). This means we can't safely update a context's
>>> image,
>>> +        * if this context is scheduled/submitted to run on the GPU.
>>> +        *
>>> +        * We could emit the OA register config through the batch buffer
>>> but
>>> +        * this might leave small interval of time where the OA unit is
>>> +        * configured at an invalid sampling period.
>>> +        *
>>> +        * So far the best way to work around this issue seems to be
>>> draining
>>> +        * the GPU from any submitted work.
>>> +        */
>>> +       ret = i915_gem_wait_for_idle(dev_priv,
>>> +                                    I915_WAIT_INTERRUPTIBLE |
>>> +                                    I915_WAIT_LOCKED);
>>> +       if (ret) {
>>> +               mutex_unlock(&dev_priv->drm.struct_mutex);
>>> +               return ret;
>>> +       }
>>> +
>>> +       /* Update all contexts now that we've stalled the submission. */
>>> +       list_for_each_entry(ctx, &dev_priv->context_list, link) {
>>> +               if (!ctx->engine[RCS].initialised)
>>> +                       continue;
>>> +
>>> +               gen8_update_reg_state_unlocked(ctx,
>>> +
>>> ctx->engine[RCS].lrc_reg_state);
>>> +       }
>>> +
>>> +       mutex_unlock(&dev_priv->drm.struct_mutex);
>>> +
>>> +       /* Now update the current context.
>>> +        *
>>> +        * Note: Using MMIO to update per-context registers requires
>>> +        * some extra care...
>>> +        */
>>> +       ret = gen8_begin_ctx_mmio(dev_priv);
>>> +       if (ret) {
>>> +               DRM_ERROR("Failed to bring RCS out of idle to update
>>> current ctx OA state\n");
>>> +               return ret;
>>> +       }
>>
>> So do we still need the mmio for the current context, given that we
>> now idle everything and then update the reg state?
>
>
> I would say yes, because we don't have any guarantee that another context is
> going to be scheduled after that.
> Yet we want the config to be applied before we return.

Okay, feel free to keep my r-b then.