[Intel-gfx] [PATCH] drm/i915/pmu: Do not assume fixed hrtimer period
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Tue Jun 5 11:35:36 UTC 2018
On 05/06/2018 10:06, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-06-05 09:51:01)
>>
>> On 04/06/2018 16:11, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-06-04 16:01:26)
>>>>
>>>> On 04/06/2018 14:03, Chris Wilson wrote:
>>>>> Quoting Chris Wilson (2018-06-04 13:59:12)
>>>>>> Quoting Tvrtko Ursulin (2018-06-04 13:52:39)
>>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>>>>
>>>>>>> As Chris has discovered on his Ivybridge, and later automated test runs
>>>>>>> have confirmed, on most of our platforms hrtimer faced with heavy GPU load
>>>>>>> ca occasionally become sufficiently imprecise to affect PMU sampling
>>>>>>
>>>>>> s/ca/can/
>>>>>>
>>>>>>> calculations.
>>>>>>>
>>>>>>> This means we cannot assume sampling frequency is what we asked for, but
>>>>>>> we need to measure the interval ourselves.
>>>>>>>
>>>>>>> This patch is similar to Chris' original proposal for per-engine counters,
>>>>>>> but instead of introducing a new set to work around the problem with
>>>>>>> frequency sampling, it swaps around the way internal frequency accounting
>>>>>>> is done. Instead of accumulating current frequency and dividing by
>>>>>>> sampling frequency on readout, it accumulates frequency scaled by each
>>>>>>> period.
>>>>>>
>>>>>> My ulterior motive failed to win favour ;)
>>>>>>
>>>>>>> Testcase: igt/perf_pmu/*busy* # snb, ivb, hsw
>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>>>> Suggested-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>>>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>>>>>> Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>>
>>>>> I should point out that even with this fix (or rather my patch), we can
>>>>
>>>> I did not mean to steal it,
>>>
>>> Don't mind, I view such patches as "this here is a problem, and what I
>>> think can be done". I'm quite happy to be told "nope, the problem
>>> is..."; the end result is we fix and move on.
>>>
>>>> just that you seemed very uncompromising on
>>>> the new counters approach. If you prefer you can respin your patch with
>>>> this approach for frequency counters, it is fine by me.
>>>
>>> I'm just not convinced yet by the frequency sampler, and I still feel
>>> that we would be better off with cycles. I just haven't found a
>>> convincing argument ;)
>>
>> Just imagine .unit file has Mcycles instead of MHz in it? :)
>>
>> Since we went with this when adding it initially I think we should
>> exercise restrain with deprecating and adding so quickly.
>>
>>>>> still see errors of 25-30us, enough to fail the test. However, without
>>>>> the fix our errors can be orders of magnitude worse (e.g. reporting 80us
>>>>> busy instead of 500us).
>>>>
>>>> Yeah I guess if the timer is delayed it is delayed and all samples just
>>>> won't be there. I don't see a way to fix that elegantly. Even practically.
>>>
>>> Right, but it's not the case that we aren't sampling. If the timer was
>>> delayed entirely, then we would see 0 (start_val == end_val). I'm not
>>> sure exactly what goes wrong, but it probably is related to the timer
>>> being unreliable. (It's just that in this case we are sampling a square
>>> wave, so really hard to mess up!)
>>
>> Hm maybe I am not completely understanding what you are seeing. I
>> thought that multiple timer overruns (aka delay by multiple periods)
>> explains everything.
>>
>> Then even with this patch, if they happen to happen (!) at the end of a
>> sampling period (as observed from userspace), the large-ish chunk of
>> samples (depending on the total sampling duration) may be missing
>> (pending), and so still observed as fail.
>
> Ah, but with the variable period timer, we can force the timer to run
> before we report the same (perf_event_read). That should still be inside
> the busy batch, and it still has the same inaccuracy (~28us, for a small
> sample population, measuring the tail is on my wishlist):
>
> diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
> index 34cc6f6..8578a43 100644
> --- a/drivers/gpu/drm/i915/i915_pmu.c
> +++ b/drivers/gpu/drm/i915/i915_pmu.c
> @@ -534,6 +534,12 @@ static u64 __i915_pmu_event_read(struct perf_event *event)
> val = engine->pmu.sample[sample].cur;
> }
> } else {
> + if (hrtimer_cancel(&i915->pmu.timer)) {
> + i915_sample(&i915->pmu.timer);
> + hrtimer_start_expires(&i915->pmu.timer,
> + HRTIMER_MODE_ABS);
> + }
> +
> switch (event->attr.config) {
> case I915_PMU_ACTUAL_FREQUENCY:
> val =
>
> I haven't thought enough about why it still has the tail, nor measured
> the distribution of errors. I just wanted to warn that we aren't quite
> out of the mess yet, but with the variable period we do at least stop
> drowning!
>
>> Difference after this patch is that when the timer eventually fires the
>> counter will account for the delay, so subsequent queries will be fine.
>>
>> If readout waited for at least one real sampling tick, that would be
>> solved, but by several other undesirable consequences. At least it would
>> slow down readout to 200Hz, but I don't know from the top of my head if
>> it wouldn't cause unresolvable deadlock scenarios, or if it is even
>> technically possible within the perf framework. (Since I don't think we
>> should do it, I don't even want to start thinking about it.)
>
> The above should do what you have in mind, I think.
I think you got the direct sample criteria reversed - it should be if
the timer wasn't running. If it was, it means it just exited so the
sampling is already pretty correct.
Also I was thinking to limit this to only when the timer was delayed so
to avoid many readers messing with a single timer alot. In other words
we keep accepting to sampling error (0.5% I think for steady input
signal) and only improve the timer delay scenario.
I'll send a patch to discuss on in more detail.
Regards,
Tvrtko
More information about the Intel-gfx
mailing list