[Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

Mon Apr 16 14:04:00 UTC 2018

Hi,

On 14.04.2018 07:01, Srinivas Pandruvada wrote:
> Hi Francisco,
> 
> [...]
> 
>> Are you no longer interested in improving those aspects of the non-
>> HWP
>> governor?  Is it that you're planning to delete it and move back to a
>> generic cpufreq governor for non-HWP platforms in the near future?
> 
> Yes that is the plan for Atom platforms, which are only non HWP
> platforms till now. You have to show good gain for performance and
> performance/watt to carry and maintain such big change. So we have to
> see your performance and power numbers.

For the active cases, you can look at the links at the beginning / 
bottom of this mail thread.  Francisco provided performance results for 
 >100 benchmarks.

At this side of Atlantic, we've been testing different versions of the 
patchset in past few months for >50 Linux 3D benchmarks on 6 different 
platforms.

On Geminilake and few BXT configurations (where 3D benchmarks are TDP 
limited), many tests' performance improves by 5-15%, also complex ones. 
And more importantly, there were no regressions.

(You can see details + links to more info in Jira ticket VIZ-12078.)

*On (fully) TDP limited cases, power usage (obviously) keeps the same, 
so performance/watt improvements can be derived from the measured 
performance improvements.*

We have data also for earlier platforms from slightly older versions of 
the patchset, but on those it didn't have any significant impact on 
performance.

I think the main reason for this is that BYT & BSW NUCs that we have, 
have space only for single memory module.  Without dual-memory channel 
configuration, benchmarks are too memory-bottlenecked to utilized GPU 
enough to make things TDP limited on those platforms.

However, now that I look at the old BYT & BSW data (for few benchmarks 
which improved most on BXT & GLK), I see that there's a reduction in the 
CPU power utilization according to RAPL, at least on BSW.

	- Eero

>>> This will benefit all architectures including x86 + non i915.
>>>
>>
>> The current design encourages re-use of the IO utilization statistic
>> (see PATCH 1) by other governors as a mechanism driving the trade-off
>> between energy efficiency and responsiveness based on whether the
>> system
>> is close to CPU-bound, in whatever way is applicable to each governor
>> (e.g. it would make sense for it to be hooked up to the EPP
>> preference
>> knob in the case of the intel_pstate HWP governor, which would allow
>> it
>> to achieve better energy efficiency in IO-bound situations just like
>> this series does for non-HWP parts).  There's nothing really x86- nor
>> i915-specific about it.
>>
>>> BTW intel-pstate can be driven by sched-util governor (passive
>>> mode),
>>> so if your prove benefits to Broxton, this can be a default.
>>> As before:
>>> - No regression to idle power at all. This is more important than
>>> benchmarks
>>> - Not just score, performance/watt is important
>>>
>>
>> Is schedutil actually on par with the intel_pstate non-HWP governor
>> as
>> of today, according to these metrics and the overall benchmark
>> numbers?
> Yes, except for few cases. I have not tested recently, so may be
> better.
> 
> Thanks,
> Srinivas
> 
> 
>>> Thanks,
>>> Srinivas
>>>
>>>
>>>>> controller does, even though the frequent IO waits may actually
>>>>> be
>>>>> an
>>>>> indication that the system is IO-bound (which means that the
>>>>> large
>>>>> energy usage increase may not be translated in any performance
>>>>> benefit
>>>>> in practice, not to speak of performance being impacted
>>>>> negatively
>>>>> in
>>>>> TDP-bound scenarios like GPU rendering).
>>>>>
>>>>> Regarding run-time complexity, I haven't observed this governor
>>>>> to
>>>>> be
>>>>> measurably more computationally intensive than the present
>>>>> one.  It's a
>>>>> bunch more instructions indeed, but still within the same
>>>>> ballpark
>>>>> as
>>>>> the current governor.  The average increase in CPU utilization
>>>>> on
>>>>> my BXT
>>>>> with this series is less than 0.03% (sampled via ftrace for v1,
>>>>> I
>>>>> can
>>>>> repeat the measurement for the v2 I have in the works, though I
>>>>> don't
>>>>> expect the result to be substantially different).  If this is a
>>>>> problem
>>>>> for you there are several optimization opportunities that would
>>>>> cut
>>>>> down
>>>>> the number of CPU cycles get_target_pstate_lp() takes to
>>>>> execute by
>>>>> a
>>>>> large percent (most of the optimization ideas I can think of
>>>>> right
>>>>> now
>>>>> though would come at some
>>>>> accuracy/maintainability/debuggability
>>>>> cost,
>>>>> but may still be worth pursuing), but the computational
>>>>> overhead is
>>>>> low
>>>>> enough at this point that the impact on any benchmark or real
>>>>> workload
>>>>> would be orders of magnitude lower than its variance, which
>>>>> makes
>>>>> it
>>>>> kind of difficult to keep the discussion data-driven [as
>>>>> possibly
>>>>> any
>>>>> performance optimization discussion should ever be ;)].
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Srinivas
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> [Absolute benchmark results are unfortunately omitted
>>>>>>>> from
>>>>>>>> this
>>>>>>>> letter
>>>>>>>> due to company policies, but the percent change and
>>>>>>>> Student's
>>>>>>>> T
>>>>>>>> p-value are included above and in the referenced
>>>>>>>> benchmark
>>>>>>>> results]
>>>>>>>>
>>>>>>>> The most obvious impact of this series will likely be the
>>>>>>>> overall
>>>>>>>> improvement in graphics performance on systems with an
>>>>>>>> IGP
>>>>>>>> integrated
>>>>>>>> into the processor package (though for the moment this is
>>>>>>>> only
>>>>>>>> enabled
>>>>>>>> on BXT+), because the TDP budget shared among CPU and GPU
>>>>>>>> can
>>>>>>>> frequently become a limiting factor in low-power
>>>>>>>> devices.  On
>>>>>>>> heavily
>>>>>>>> TDP-bound devices this series improves performance of
>>>>>>>> virtually any
>>>>>>>> non-trivial graphics rendering by a significant amount
>>>>>>>> (of
>>>>>>>> the
>>>>>>>> order
>>>>>>>> of the energy efficiency improvement for that workload
>>>>>>>> assuming the
>>>>>>>> optimization didn't cause it to become non-TDP-bound).
>>>>>>>>
>>>>>>>> See [1]-[5] for detailed numbers including various
>>>>>>>> graphics
>>>>>>>> benchmarks
>>>>>>>> and a sample of the Phoronix daily-system-tracker.  Some
>>>>>>>> popular
>>>>>>>> graphics benchmarks like GfxBench gl_manhattan31 and gl_4
>>>>>>>> improve
>>>>>>>> between 5% and 11% on our systems.  The exact improvement
>>>>>>>> can
>>>>>>>> vary
>>>>>>>> substantially between systems (compare the benchmark
>>>>>>>> results
>>>>>>>> from
>>>>>>>> the
>>>>>>>> two different J3455 systems [1] and [3]) due to a number
>>>>>>>> of
>>>>>>>> factors,
>>>>>>>> including the ratio between CPU and GPU processing power,
>>>>>>>> the
>>>>>>>> behavior
>>>>>>>> of the userspace graphics driver, the windowing system
>>>>>>>> and
>>>>>>>> resolution,
>>>>>>>> the BIOS (which has an influence on the package TDP), the
>>>>>>>> thermal
>>>>>>>> characteristics of the system, etc.
>>>>>>>>
>>>>>>>> Unigine Valley and Heaven improve by a similar factor on
>>>>>>>> some
>>>>>>>> systems
>>>>>>>> (see the J3455 results [1]), but on others the
>>>>>>>> improvement is
>>>>>>>> lower
>>>>>>>> because the benchmark fails to fully utilize the GPU,
>>>>>>>> which
>>>>>>>> causes
>>>>>>>> the
>>>>>>>> heuristic to remain in low-latency state for longer,
>>>>>>>> which
>>>>>>>> leaves a
>>>>>>>> reduced TDP budget available to the GPU, which prevents
>>>>>>>> performance
>>>>>>>> from increasing further.  This can be avoided by using
>>>>>>>> the
>>>>>>>> alternative
>>>>>>>> heuristic parameters suggested in the commit message of
>>>>>>>> PATCH
>>>>>>>> 8,
>>>>>>>> which
>>>>>>>> provide a lower IO utilization threshold and hysteresis
>>>>>>>> for
>>>>>>>> the
>>>>>>>> controller to attempt to save energy.  I'm not proposing
>>>>>>>> those for
>>>>>>>> upstream (yet) because they would also increase the risk
>>>>>>>> for
>>>>>>>> latency-sensitive IO-heavy workloads to regress (like
>>>>>>>> SynMark2
>>>>>>>> OglTerrainFly* and some arguably poorly designed IPC-
>>>>>>>> bound
>>>>>>>> X11
>>>>>>>> benchmarks).
>>>>>>>>
>>>>>>>> Discrete graphics aren't likely to experience that much
>>>>>>>> of a
>>>>>>>> visible
>>>>>>>> improvement from this, even though many non-IGP workloads
>>>>>>>> *could*
>>>>>>>> benefit by reducing the system's energy usage while the
>>>>>>>> discrete
>>>>>>>> GPU
>>>>>>>> (or really, any other IO device) becomes a bottleneck,
>>>>>>>> but
>>>>>>>> this is
>>>>>>>> not
>>>>>>>> attempted in this series, since that would involve making
>>>>>>>> an
>>>>>>>> energy
>>>>>>>> efficiency/latency trade-off that only the maintainers of
>>>>>>>> the
>>>>>>>> respective drivers are in a position to make.  The
>>>>>>>> cpufreq
>>>>>>>> interface
>>>>>>>> introduced in PATCH 1 to achieve this is left as an opt-
>>>>>>>> in
>>>>>>>> for that
>>>>>>>> reason, only the i915 DRM driver is hooked up since it
>>>>>>>> will
>>>>>>>> get the
>>>>>>>> most direct pay-off due to the increased energy budget
>>>>>>>> available to
>>>>>>>> the GPU, but other power-hungry third-party gadgets built
>>>>>>>> into the
>>>>>>>> same package (*cough* AMD *cough* Mali *cough* PowerVR
>>>>>>>> *cough*) may
>>>>>>>> be
>>>>>>>> able to benefit from this interface eventually by
>>>>>>>> instrumenting the
>>>>>>>> driver in a similar way.
>>>>>>>>
>>>>>>>> The cpufreq interface is not exclusively tied to the
>>>>>>>> intel_pstate
>>>>>>>> driver, because other governors can make use of the
>>>>>>>> statistic
>>>>>>>> calculated as result to avoid over-optimizing for latency
>>>>>>>> in
>>>>>>>> scenarios
>>>>>>>> where a lower frequency would be able to achieve similar
>>>>>>>> throughput
>>>>>>>> while using less energy.  The interpretation of this
>>>>>>>> statistic
>>>>>>>> relies
>>>>>>>> on the observation that for as long as the system is CPU-
>>>>>>>> bound, any
>>>>>>>> IO
>>>>>>>> load occurring as a result of the execution of a program
>>>>>>>> will
>>>>>>>> scale
>>>>>>>> roughly linearly with the clock frequency the program is
>>>>>>>> run
>>>>>>>> at, so
>>>>>>>> (assuming that the CPU has enough processing power) a
>>>>>>>> point
>>>>>>>> will be
>>>>>>>> reached at which the program won't be able to execute
>>>>>>>> faster
>>>>>>>> with
>>>>>>>> increasing CPU frequency because the throughput limits of
>>>>>>>> some
>>>>>>>> device
>>>>>>>> will have been attained.  Increasing frequencies past
>>>>>>>> that
>>>>>>>> point
>>>>>>>> only
>>>>>>>> pessimizes energy usage for no real benefit -- The
>>>>>>>> optimal
>>>>>>>> behavior
>>>>>>>> is
>>>>>>>> for the CPU to lock to the minimum frequency that is able
>>>>>>>> to
>>>>>>>> keep
>>>>>>>> the
>>>>>>>> IO devices involved fully utilized (assuming we are past
>>>>>>>> the
>>>>>>>> maximum-efficiency inflection point of the CPU's power-
>>>>>>>> to-
>>>>>>>> frequency
>>>>>>>> curve), which is roughly the goal of this series.
>>>>>>>>
>>>>>>>> PELT could be a useful extension for this model since its
>>>>>>>> largely
>>>>>>>> heuristic assumptions would become more accurate if the
>>>>>>>> IO
>>>>>>>> and CPU
>>>>>>>> load could be tracked separately for each scheduling
>>>>>>>> entity,
>>>>>>>> but
>>>>>>>> this
>>>>>>>> is not attempted in this series because the additional
>>>>>>>> complexity
>>>>>>>> and
>>>>>>>> computational cost of such an approach is hard to justify
>>>>>>>> at
>>>>>>>> this
>>>>>>>> stage, particularly since the current governor has
>>>>>>>> similar
>>>>>>>> limitations.
>>>>>>>>
>>>>>>>> Various frequency and step-function response graphs are
>>>>>>>> available
>>>>>>>> in
>>>>>>>> [6]-[9] for comparison (obtained empirically on a BXT
>>>>>>>> J3455
>>>>>>>> system).
>>>>>>>> The response curves for the low-latency and low-power
>>>>>>>> states
>>>>>>>> of the
>>>>>>>> heuristic are shown separately -- As you can see they
>>>>>>>> roughly
>>>>>>>> bracket
>>>>>>>> the frequency response curve of the current
>>>>>>>> governor.  The
>>>>>>>> step
>>>>>>>> response of the aggressive heuristic is within a single
>>>>>>>> update
>>>>>>>> period
>>>>>>>> (even though it's not quite obvious from the graph with
>>>>>>>> the
>>>>>>>> levels
>>>>>>>> of
>>>>>>>> zoom provided).  I'll attach benchmark results from a
>>>>>>>> slower
>>>>>>>> but
>>>>>>>> non-TDP-limited machine (which means there will be no TDP
>>>>>>>> budget
>>>>>>>> increase that could possibly mask a performance
>>>>>>>> regression of
>>>>>>>> other
>>>>>>>> kind) as soon as they come out.
>>>>>>>>
>>>>>>>> Thanks to Eero and Valtteri for testing a number of
>>>>>>>> intermediate
>>>>>>>> revisions of this series (and there were quite a few of
>>>>>>>> them)
>>>>>>>> in
>>>>>>>> more
>>>>>>>> than half a dozen systems, they helped spot quite a few
>>>>>>>> issues of
>>>>>>>> earlier versions of this heuristic.
>>>>>>>>
>>>>>>>> [PATCH 1/9] cpufreq: Implement infrastructure keeping
>>>>>>>> track
>>>>>>>> of
>>>>>>>> aggregated IO active time.
>>>>>>>> [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace
>>>>>>>> bxt_funcs
>>>>>>>> with
>>>>>>>> core_funcs"
>>>>>>>> [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a
>>>>>>>> couple
>>>>>>>> of long
>>>>>>>> names"
>>>>>>>> [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
>>>>>>>> intel_pstate_adjust_pstate()"
>>>>>>>> [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop
>>>>>>>> ->update_util
>>>>>>>> from
>>>>>>>> pstate_funcs"
>>>>>>>> [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-
>>>>>>>> pass
>>>>>>>> filtering controller for small core.
>>>>>>>> [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP
>>>>>>>> controller
>>>>>>>> based on ACPI FADT profile.
>>>>>>>> [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP
>>>>>>>> controller
>>>>>>>> parameters via debugfs.
>>>>>>>> [PATCH 9/9] drm/i915/execlists: Report GPU rendering as
>>>>>>>> IO
>>>>>>>> activity
>>>>>>>> to cpufreq.
>>>>>>>>
>>>>>>>> [1] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /bench
>>>>>>>> mark-perf-comparison-J3455.log
>>>>>>>> [2] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /bench
>>>>>>>> mark-perf-per-watt-comparison-J3455.log
>>>>>>>> [3] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /bench
>>>>>>>> mark-perf-comparison-J3455-1.log
>>>>>>>> [4] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /bench
>>>>>>>> mark-perf-comparison-J4205.log
>>>>>>>> [5] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /bench
>>>>>>>> mark-perf-comparison-J5005.log
>>>>>>>> [6] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /frequ
>>>>>>>> ency-response-magnitude-comparison.svg
>>>>>>>> [7] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /frequ
>>>>>>>> ency-response-phase-comparison.svg
>>>>>>>> [8] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /step-
>>>>>>>> response-comparison-1.svg
>>>>>>>> [9] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /step-
>>>>>>>> response-comparison-2.svg
>>>>>
>>>>> _______________________________________________
>>>>> Intel-gfx mailing list
>>>>> Intel-gfx at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx