[Intel-gfx] [PATCH] drm/i915/pmu: Do not assume fixed hrtimer period

Tue Jun 5 11:42:34 UTC 2018

Quoting Tvrtko Ursulin (2018-06-05 12:35:36)
> 
> On 05/06/2018 10:06, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-06-05 09:51:01)
> >>
> >> On 04/06/2018 16:11, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2018-06-04 16:01:26)
> >>>>
> >>>> On 04/06/2018 14:03, Chris Wilson wrote:
> >>>>> Quoting Chris Wilson (2018-06-04 13:59:12)
> >>>>>> Quoting Tvrtko Ursulin (2018-06-04 13:52:39)
> >>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >>>>>>>
> >>>>>>> As Chris has discovered on his Ivybridge, and later automated test runs
> >>>>>>> have confirmed, on most of our platforms hrtimer faced with heavy GPU load
> >>>>>>> ca occasionally become sufficiently imprecise to affect PMU sampling
> >>>>>>
> >>>>>> s/ca/can/
> >>>>>>
> >>>>>>> calculations.
> >>>>>>>
> >>>>>>> This means we cannot assume sampling frequency is what we asked for, but
> >>>>>>> we need to measure the interval ourselves.
> >>>>>>>
> >>>>>>> This patch is similar to Chris' original proposal for per-engine counters,
> >>>>>>> but instead of introducing a new set to work around the problem with
> >>>>>>> frequency sampling, it swaps around the way internal frequency accounting
> >>>>>>> is done. Instead of accumulating current frequency and dividing by
> >>>>>>> sampling frequency on readout, it accumulates frequency scaled by each
> >>>>>>> period.
> >>>>>>
> >>>>>> My ulterior motive failed to win favour ;)
> >>>>>>     
> >>>>>>> Testcase: igt/perf_pmu/*busy* # snb, ivb, hsw
> >>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >>>>>>> Suggested-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>>>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>>> Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>>
> >>>>> I should point out that even with this fix (or rather my patch), we can
> >>>>
> >>>> I did not mean to steal it,
> >>>
> >>> Don't mind, I view such patches as "this here is a problem, and what I
> >>> think can be done". I'm quite happy to be told "nope, the problem
> >>> is..."; the end result is we fix and move on.
> >>>
> >>>> just that you seemed very uncompromising on
> >>>> the new counters approach. If you prefer you can respin your patch with
> >>>> this approach for frequency counters, it is fine by me.
> >>>
> >>> I'm just not convinced yet by the frequency sampler, and I still feel
> >>> that we would be better off with cycles. I just haven't found a
> >>> convincing argument ;)
> >>
> >> Just imagine .unit file has Mcycles instead of MHz in it? :)
> >>
> >> Since we went with this when adding it initially I think we should
> >> exercise restrain with deprecating and adding so quickly.
> >>
> >>>>> still see errors of 25-30us, enough to fail the test. However, without
> >>>>> the fix our errors can be orders of magnitude worse (e.g. reporting 80us
> >>>>> busy instead of 500us).
> >>>>
> >>>> Yeah I guess if the timer is delayed it is delayed and all samples just
> >>>> won't be there. I don't see a way to fix that elegantly. Even practically.
> >>>
> >>> Right, but it's not the case that we aren't sampling. If the timer was
> >>> delayed entirely, then we would see 0 (start_val == end_val). I'm not
> >>> sure exactly what goes wrong, but it probably is related to the timer
> >>> being unreliable. (It's just that in this case we are sampling a square
> >>> wave, so really hard to mess up!)
> >>
> >> Hm maybe I am not completely understanding what you are seeing. I
> >> thought that multiple timer overruns (aka delay by multiple periods)
> >> explains everything.
> >>
> >> Then even with this patch, if they happen to happen (!) at the end of a
> >> sampling period (as observed from userspace), the large-ish chunk of
> >> samples (depending on the total sampling duration) may be missing
> >> (pending), and so still observed as fail.
> > 
> > Ah, but with the variable period timer, we can force the timer to run
> > before we report the same (perf_event_read). That should still be inside
> > the busy batch, and it still has the same inaccuracy (~28us, for a small
> > sample population, measuring the tail is on my wishlist):
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
> > index 34cc6f6..8578a43 100644
> > --- a/drivers/gpu/drm/i915/i915_pmu.c
> > +++ b/drivers/gpu/drm/i915/i915_pmu.c
> > @@ -534,6 +534,12 @@ static u64 __i915_pmu_event_read(struct perf_event *event)
> >                          val = engine->pmu.sample[sample].cur;
> >                  }
> >          } else {
> > +               if (hrtimer_cancel(&i915->pmu.timer)) {
> > +                       i915_sample(&i915->pmu.timer);
> > +                       hrtimer_start_expires(&i915->pmu.timer,
> > +                                             HRTIMER_MODE_ABS);
> > +               }
> > +
> >                  switch (event->attr.config) {
> >                  case I915_PMU_ACTUAL_FREQUENCY:
> >                          val =
> > 
> > I haven't thought enough about why it still has the tail, nor measured
> > the distribution of errors. I just wanted to warn that we aren't quite
> > out of the mess yet, but with the variable period we do at least stop
> > drowning!
> >   
> >> Difference after this patch is that when the timer eventually fires the
> >> counter will account for the delay, so subsequent queries will be fine.
> >>
> >> If readout waited for at least one real sampling tick, that would be
> >> solved, but by several other undesirable consequences. At least it would
> >> slow down readout to 200Hz, but I don't know from the top of my head if
> >> it wouldn't cause unresolvable deadlock scenarios, or if it is even
> >> technically possible within the perf framework. (Since I don't think we
> >> should do it, I don't even want to start thinking about it.)
> > 
> > The above should do what you have in mind, I think.
> 
> I think you got the direct sample criteria reversed - it should be if 
> the timer wasn't running. If it was, it means it just exited so the 
> sampling is already pretty correct.

Note it returns 1 if the timer was queued or active, 0 if the time
wasn't iirc; it's try_to_cancel that has the more complex interface.
(It should always be true along this path if we are using the timer.)

> Also I was thinking to limit this to only when the timer was delayed so 
> to avoid many readers messing with a single timer alot. In other words 
> we keep accepting to sampling error (0.5% I think for steady input 
> signal) and only improve the timer delay scenario.

You mean just if time_between_samples < 10us, return. It didn't seem
worth it. There are still a number of issues with the timer granularity
that makes chasing edge cases ... disappointing.
-Chris