[Intel-gfx] [PATCH 3/8] drm/i915/pmu: Expose a PMU interface for perf queries

Tue Sep 26 13:00:59 UTC 2017

Quoting Tvrtko Ursulin (2017-09-26 13:28:03)
> On 25/09/2017 18:37, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2017-09-25 16:15:38)
> >> +       for_each_engine(engine, dev_priv, id) {
> >> +               u32 current_seqno = intel_engine_get_seqno(engine);
> >> +               u32 last_seqno = intel_engine_last_submit(engine);
> >> +               u32 enable = engine->pmu.enable;
> >> +
> >> +               if (i915_seqno_passed(current_seqno, last_seqno))
> >> +                       continue;
> >> +
> >> +               if ((enable & BIT(I915_SAMPLE_QUEUED)) &&
> >> +                   (last_seqno - 1 > current_seqno))
> > 
> > Hmm, this isn't queued. This is more than one rq submitted to hw.
> 
> Yes, that's what I thought we want from the PMU perspective. Maybe 
> rename to runnable if queued is not the clearest?

It's not runnable either. Everything on the execlists queue, or on hw
for legacy, is runnable. That count is what I think we want for
userspace.

> > Do you want total inflight (including waiting-for-fences)? That would be
> > engine->timeline->inflight_seqnos (additional problem in its laziness).
> > 
> > Hard to distinguish all other cases, but last - current is susceptible
> > to not treating all ready requests as being queued. 
> 
> Hm how?

Those not assigned a seqno for starters.

> 
> >And miscounts on
> > legacy across semaphores (which you've excluded for execlists).
> > 
> > Aiui you want
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> > index 4eb1a76731b2..19b8b31afbbc 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_request.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> > @@ -477,6 +477,7 @@ void __i915_gem_request_submit(struct drm_i915_gem_request *request)
> >          engine->emit_breadcrumb(request,
> >                                  request->ring->vaddr + request->postfix);
> >   
> > +       atomic_dec(&request->engine->queued);
> >          spin_lock(&request->timeline->lock);
> >          list_move_tail(&request->link, &timeline->requests);
> >          spin_unlock(&request->timeline->lock);
> > @@ -525,6 +526,7 @@ void __i915_gem_request_unsubmit(struct drm_i915_gem_request *request)
> >          spin_lock(&timeline->lock);
> >          list_move(&request->link, &timeline->requests);
> >          spin_unlock(&timeline->lock);
> > +       atomic_inc(&request->engine->queued);
> >   
> >          /* We don't need to wake_up any waiters on request->execute, they
> >           * will get woken by any other event or us re-adding this request
> > @@ -556,6 +558,7 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
> >          switch (state) {
> >          case FENCE_COMPLETE:
> >                  trace_i915_gem_request_submit(request);
> > +               atomic_inc(&request->engine->queued);
> >                  request->engine->submit_request(request);
> 
> Or have the counter normal unsigned int under the engine timeline lock, 
> moving into the execlists_submit_request etc?
> 
> >                  break;
> >   
> > 
> > That excludes the requests actually in the hw queue, just those in the sw
> > queue. Or both.
> 
> This looks good to me, I was just reluctant to suggest new software 
> tracking. Happy to go with this and to set the metric meaning to this?

I'm not convinced it's the metric we want either! :( Going back to the
wsim angle, I do think we want queue-depth / loadavg, the count of all
runnable tasks including those passed to hw. So what I think we want is
load/runnable = (last_seqno - current_seqno) + atomic_read(^^)

We can do the averaging trick as the same, so that we have a consistent
metric over the sample period, it will just be allowed to exceed 100%.

Then since you already have current_seqno around, SAMPLE_BUSY becomes
current_seqno != last_seqno. fw avoidance for everyone, \o/

> >> +static void frequency_sample(struct drm_i915_private *dev_priv)
> >> +{
> >> +       if (dev_priv->pmu.enable &
> >> +           config_enabled_mask(I915_PMU_ACTUAL_FREQUENCY)) {
> >> +               u64 val;
> >> +
> >> +               val = dev_priv->rps.cur_freq;
> >> +               if (dev_priv->gt.awake &&
> >> +                   intel_runtime_pm_get_if_in_use(dev_priv)) {
> >> +                       val = intel_get_cagf(dev_priv,
> >> +                                            I915_READ_NOTRACE(GEN6_RPSTAT1));
> >> +                       intel_runtime_pm_put(dev_priv);
> >> +               }
> >> +               val = intel_gpu_freq(dev_priv, val);
> >> +               dev_priv->pmu.sample[__I915_SAMPLE_FREQ_ACT] += val * PERIOD;
> >> +       }
> >> +
> >> +       if (dev_priv->pmu.enable &
> >> +           config_enabled_mask(I915_PMU_REQUESTED_FREQUENCY)) {
> >> +               u64 val = intel_gpu_freq(dev_priv, dev_priv->rps.cur_freq);
> >> +
> >> +               dev_priv->pmu.sample[__I915_SAMPLE_FREQ_REQ] += val * PERIOD;
> >> +       }
> > 
> > Ok. Do we need a comment to explain the importance of averaging here?
> > Should we be doing trapezium integration?
> 
> After reading up on it, do you mean increasing the accuracy by something 
> like sample[] += val * PERIOD + ((val * PERIOD - sample[]) * PERIOD / 2) ?

Hmm, I haven't seen that formulation before (it feels that using the
cumulative total like that is wrong). I was just thinking of
doing

sample[] += (prev + val) * PERIOD / 2;

which is no better in the long run, so the improvement is only at the
endpoints.

> I am not sure if it makes sense here since our freq changes are much 
> slower and less granular than the sampling period.

What's the current sampling frequency? At one point, I was just using
200Hz at which point rps starts being at around the same frequency.

> But it sounds like a good idea for other samplers, no? They are 
> potentially much more granular than the sampling period.

An unhelpful digression, sorry.

> >> +               ret = 0;
> >> +               switch (event->attr.config) {
> >> +               case I915_PMU_INTERRUPTS:
> >> +                       break;
> >> +               case I915_PMU_ACTUAL_FREQUENCY:
> >> +                       if (IS_VALLEYVIEW(i915) || IS_CHERRYVIEW(i915))
> >> +                                /* Requires a mutex for sampling! */
> >> +                               ret = -ENODEV;
> >> +               case I915_PMU_REQUESTED_FREQUENCY:
> >> +                       if (INTEL_GEN(i915) < 6)
> >> +                               ret = -ENODEV;
> >> +                       break;
> >> +               case I915_PMU_RC6_RESIDENCY:
> >> +                       if (!HAS_RC6(i915))
> >> +                               ret = -ENODEV;
> >> +                       break;
> >> +               case I915_PMU_RC6p_RESIDENCY:
> >> +               case I915_PMU_RC6pp_RESIDENCY:
> >> +                       if (!HAS_RC6p(i915))
> >> +                               ret = -ENODEV;
> > 
> > Must introduced you to my HAS_RC6pp() patch.
> 
> Can I have a reminder please? You have a patch which adds it? I thought 
> from the code that where we have RC6p we also have RC6pp, wrong?

https://patchwork.freedesktop.org/patch/146444/

We don't have any hw where RC6pp was validated.

> >> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> >> index fe25a01c81f2..682abafecc8c 100644
> >> --- a/include/uapi/drm/i915_drm.h
> >> +++ b/include/uapi/drm/i915_drm.h
> >> @@ -86,6 +86,63 @@ enum i915_mocs_table_index {
> >>          I915_MOCS_CACHED,
> >>   };
> > 
> > Hmm, how about uapi/drm/i915_perf.h? (Or i915_pmu.h)
> > We are not adding traditional drm-ioctl functionality (except for some
> > common enums). But I guess it is not huge, so no big concern. Just
> > thinking that we essentially copy this into igt, so we should really
> > make it a copy!
> 
> No objections from me. Definitely want it?

I wouldn't say no. I wouldn't suggest it must be done either. :)

> >> +#define I915_PMU_RC6_RESIDENCY         __I915_PMU_OTHER(3)
> >> +#define I915_PMU_RC6p_RESIDENCY                __I915_PMU_OTHER(4)
> >> +#define I915_PMU_RC6pp_RESIDENCY       __I915_PMU_OTHER(5)
> >> +
> 
> Would we want to add some more counters?

At a reasonable frequency, as you say almost every new feature will want
some counter or two. No more pushing stats into debugfs that no one
cares about (not looking at the guc)!

> I915_PMU_ENGINE_PREEPMT - number of preemption events on engine
> I915_PMU_ENGINE_RESCHEDULE - number of requests which needed to be moved 
> in the tree due new request arriving
> 
> Or leave all this for later, since you also had some ideas on what to add.

Keep a second patch around to add a counter or two, so we can check how
painful it would be to add more -- and hence minimise that pain now.
-Chris