[Intel-gfx] [PATCH 02/10] drm/i915/pmu: Expose a PMU interface for perf queries
Chris Wilson
chris at chris-wilson.co.uk
Fri Sep 29 12:46:00 UTC 2017
Quoting Tvrtko Ursulin (2017-09-29 13:34:52)
> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>
> From: Chris Wilson <chris at chris-wilson.co.uk>
> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> From: Dmitry Rogozhkin <dmitry.v.rogozhkin at intel.com>
>
> The first goal is to be able to measure GPU (and invidual ring) busyness
> without having to poll registers from userspace. (Which not only incurs
> holding the forcewake lock indefinitely, perturbing the system, but also
> runs the risk of hanging the machine.) As an alternative we can use the
> perf event counter interface to sample the ring registers periodically
> and send those results to userspace.
>
> Functionality we are exporting to userspace is via the existing perf PMU
> API and can be exercised via the existing tools. For example:
>
> perf stat -a -e i915/rcs0-busy/ -I 1000
>
> Will print the render engine busynnes once per second. All the performance
> counters can be enumerated (perf list) and have their unit of measure
> correctly reported in sysfs.
>
> v1-v2 (Chris Wilson):
>
> v2: Use a common timer for the ring sampling.
>
> v3: (Tvrtko Ursulin)
> * Decouple uAPI from i915 engine ids.
> * Complete uAPI defines.
> * Refactor some code to helpers for clarity.
> * Skip sampling disabled engines.
> * Expose counters in sysfs.
> * Pass in fake regs to avoid null ptr deref in perf core.
> * Convert to class/instance uAPI.
> * Use shared driver code for rc6 residency, power and frequency.
>
> v4: (Dmitry Rogozhkin)
> * Register PMU with .task_ctx_nr=perf_invalid_context
> * Expose cpumask for the PMU with the single CPU in the mask
> * Properly support pmu->stop(): it should call pmu->read()
> * Properly support pmu->del(): it should call stop(event, PERF_EF_UPDATE)
> * Introduce refcounting of event subscriptions.
> * Make pmu.busy_stats a refcounter to avoid busy stats going away
> with some deleted event.
> * Expose cpumask for i915 PMU to avoid multiple events creation of
> the same type followed by counter aggregation by perf-stat.
> * Track CPUs getting online/offline to migrate perf context. If (likely)
> cpumask will initially set CPU0, CONFIG_BOOTPARAM_HOTPLUG_CPU0 will be
> needed to see effect of CPU status tracking.
> * End result is that only global events are supported and perf stat
> works correctly.
> * Deny perf driver level sampling - it is prohibited for uncore PMU.
>
> v5: (Tvrtko Ursulin)
>
> * Don't hardcode number of engine samplers.
> * Rewrite event ref-counting for correctness and simplicity.
> * Store initial counter value when starting already enabled events
> to correctly report values to all listeners.
> * Fix RC6 residency readout.
> * Comments, GPL header.
>
> v6:
> * Add missing entry to v4 changelog.
> * Fix accounting in CPU hotplug case by copying the approach from
> arch/x86/events/intel/cstate.c. (Dmitry Rogozhkin)
>
> v7:
> * Log failure message only on failure.
> * Remove CPU hotplug notification state on unregister.
>
> v8:
> * Fix error unwind on failed registration.
> * Checkpatch cleanup.
>
> v9:
> * Drop the energy metric, it is available via intel_rapl_perf.
> (Ville Syrjälä)
> * Use HAS_RC6(p). (Chris Wilson)
> * Handle unsupported non-engine events. (Dmitry Rogozhkin)
> * Rebase for intel_rc6_residency_ns needing caller managed
> runtime pm.
> * Drop HAS_RC6 checks from the read callback since creating those
> events will be rejected at init time already.
> * Add counter units to sysfs so perf stat output is nicer.
> * Cleanup the attribute tables for brevity and readability.
>
> v10:
> * Fixed queued accounting.
>
> v11:
> * Move intel_engine_lookup_user to intel_engine_cs.c
> * Commit update. (Joonas Lahtinen)
>
> v12:
> * More accurate sampling. (Chris Wilson)
> * Store and report frequency in MHz for better usability from
> perf stat.
> * Removed metrics: queued, interrupts, rc6 counters.
> * Sample engine busyness based on seqno difference only
> for less MMIO (and forcewake) on all platforms. (Chris Wilson)
>
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin at intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin at intel.com>
> Cc: Peter Zijlstra <peterz at infradead.org>
> ---
> +static void
> +update_sample(struct i915_pmu_sample *sample, u32 unit, u32 val)
> +{
> + /*
> + * Since we are doing stohastical sampling for these counter,
stochastic
> + * average the delta with the previous value for better accuracy.
> + */
> + sample->cur += div_u64((u64)(sample->prev + val) * unit, 2);
div_u64(mul_32_32(sample->prev + val, unit), 2)
> + sample->prev = val;
> +}
> +
> +static void engines_sample(struct drm_i915_private *dev_priv)
> +{
> + struct intel_engine_cs *engine;
> + enum intel_engine_id id;
> + bool fw = false;
> +
> + if ((dev_priv->pmu.enable & ENGINE_SAMPLE_MASK) == 0)
> + return;
> +
> + if (!dev_priv->gt.awake)
> + return;
> +
> + if (!intel_runtime_pm_get_if_in_use(dev_priv))
> + return;
> +
> + for_each_engine(engine, dev_priv, id) {
> + u32 current_seqno = intel_engine_get_seqno(engine);
> + u32 last_seqno = intel_engine_last_submit(engine);
> + u32 val;
> +
> + val = !i915_seqno_passed(current_seqno, last_seqno);
> +
> + update_sample(&engine->pmu.sample[I915_SAMPLE_BUSY], PERIOD,
> + val);
Personally I think
update_sample(&engine->pmu.sample[I915_SAMPLE_BUSY],
PERIOD, val);
carries better visual weighting.
> +
> + if (val && (engine->pmu.enable &
> + (BIT(I915_SAMPLE_WAIT) | BIT(I915_SAMPLE_SEMA)))) {
> + fw = grab_forcewake(dev_priv, fw);
> +
> + val = I915_READ_FW(RING_CTL(engine->mmio_base));
> + } else {
> + val = 0;
> + }
> +
> + update_sample(&engine->pmu.sample[I915_SAMPLE_WAIT], PERIOD,
> + !!(val & RING_WAIT));
> +
> + update_sample(&engine->pmu.sample[I915_SAMPLE_SEMA], PERIOD,
> + !!(val & RING_WAIT_SEMAPHORE));
> + }
> +
> + if (fw)
> + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
> +
> + intel_runtime_pm_put(dev_priv);
Ok.
> +}
> +
> +static void frequency_sample(struct drm_i915_private *dev_priv)
> +{
> + if (dev_priv->pmu.enable &
> + config_enabled_mask(I915_PMU_ACTUAL_FREQUENCY)) {
> + u32 val;
> +
> + val = dev_priv->rps.cur_freq;
> + if (dev_priv->gt.awake &&
> + intel_runtime_pm_get_if_in_use(dev_priv)) {
> + val = intel_get_cagf(dev_priv,
> + I915_READ_NOTRACE(GEN6_RPSTAT1));
> + intel_runtime_pm_put(dev_priv);
> + }
> +
> + update_sample(&dev_priv->pmu.sample[__I915_SAMPLE_FREQ_ACT], 1,
> + intel_gpu_freq(dev_priv, val));
> + }
> +
> + if (dev_priv->pmu.enable &
> + config_enabled_mask(I915_PMU_REQUESTED_FREQUENCY)) {
> + update_sample(&dev_priv->pmu.sample[__I915_SAMPLE_FREQ_REQ], 1,
> + intel_gpu_freq(dev_priv, dev_priv->rps.cur_freq));
> + }
Ok.
> +}
Looks good as the introductory set of counters. I trust the improvements
to the perf_event integration are good (as you can tell from the earlier
patch, I have no idea what I'm doing there ;)
Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>
-Chris
More information about the Intel-gfx
mailing list