[Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

Wed Apr 11 03:14:34 UTC 2018

On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote:
> Francisco Jerez <currojerez at riseup.net> writes:
> 
[...]

> For the case anyone is wondering what's going on, Srinivas pointed me
> at
> a larger idle power usage increase off-list, ultimately caused by the
> low-latency heuristic as discussed in the paragraph above.  I have a
> v2
> of PATCH 6 that gives the controller a third response curve roughly
> intermediate between the low-latency and low-power states of this
> revision, which avoids the energy usage increase while C0 residency
> is
> low (e.g. during idle) expected for v1.  The low-latency behavior of
> this revision is still going to be available based on a heuristic (in
> particular when a realtime-priority task is scheduled).  We're
> carrying
> out some additional testing, I'll post the code here eventually.

Please try sched-util governor also. There is a frequency-invariant
patch, which I can send you (This eventually will be pushed by Peter).
We want to avoid complexity to intel-pstate for non HWP power sensitive
platforms as far as possible.

Thanks,
Srinivas

> 
> > [Absolute benchmark results are unfortunately omitted from this
> > letter
> > due to company policies, but the percent change and Student's T
> > p-value are included above and in the referenced benchmark results]
> > 
> > The most obvious impact of this series will likely be the overall
> > improvement in graphics performance on systems with an IGP
> > integrated
> > into the processor package (though for the moment this is only
> > enabled
> > on BXT+), because the TDP budget shared among CPU and GPU can
> > frequently become a limiting factor in low-power devices.  On
> > heavily
> > TDP-bound devices this series improves performance of virtually any
> > non-trivial graphics rendering by a significant amount (of the
> > order
> > of the energy efficiency improvement for that workload assuming the
> > optimization didn't cause it to become non-TDP-bound).
> > 
> > See [1]-[5] for detailed numbers including various graphics
> > benchmarks
> > and a sample of the Phoronix daily-system-tracker.  Some popular
> > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve
> > between 5% and 11% on our systems.  The exact improvement can vary
> > substantially between systems (compare the benchmark results from
> > the
> > two different J3455 systems [1] and [3]) due to a number of
> > factors,
> > including the ratio between CPU and GPU processing power, the
> > behavior
> > of the userspace graphics driver, the windowing system and
> > resolution,
> > the BIOS (which has an influence on the package TDP), the thermal
> > characteristics of the system, etc.
> > 
> > Unigine Valley and Heaven improve by a similar factor on some
> > systems
> > (see the J3455 results [1]), but on others the improvement is lower
> > because the benchmark fails to fully utilize the GPU, which causes
> > the
> > heuristic to remain in low-latency state for longer, which leaves a
> > reduced TDP budget available to the GPU, which prevents performance
> > from increasing further.  This can be avoided by using the
> > alternative
> > heuristic parameters suggested in the commit message of PATCH 8,
> > which
> > provide a lower IO utilization threshold and hysteresis for the
> > controller to attempt to save energy.  I'm not proposing those for
> > upstream (yet) because they would also increase the risk for
> > latency-sensitive IO-heavy workloads to regress (like SynMark2
> > OglTerrainFly* and some arguably poorly designed IPC-bound X11
> > benchmarks).
> > 
> > Discrete graphics aren't likely to experience that much of a
> > visible
> > improvement from this, even though many non-IGP workloads *could*
> > benefit by reducing the system's energy usage while the discrete
> > GPU
> > (or really, any other IO device) becomes a bottleneck, but this is
> > not
> > attempted in this series, since that would involve making an energy
> > efficiency/latency trade-off that only the maintainers of the
> > respective drivers are in a position to make.  The cpufreq
> > interface
> > introduced in PATCH 1 to achieve this is left as an opt-in for that
> > reason, only the i915 DRM driver is hooked up since it will get the
> > most direct pay-off due to the increased energy budget available to
> > the GPU, but other power-hungry third-party gadgets built into the
> > same package (*cough* AMD *cough* Mali *cough* PowerVR *cough*) may
> > be
> > able to benefit from this interface eventually by instrumenting the
> > driver in a similar way.
> > 
> > The cpufreq interface is not exclusively tied to the intel_pstate
> > driver, because other governors can make use of the statistic
> > calculated as result to avoid over-optimizing for latency in
> > scenarios
> > where a lower frequency would be able to achieve similar throughput
> > while using less energy.  The interpretation of this statistic
> > relies
> > on the observation that for as long as the system is CPU-bound, any
> > IO
> > load occurring as a result of the execution of a program will scale
> > roughly linearly with the clock frequency the program is run at, so
> > (assuming that the CPU has enough processing power) a point will be
> > reached at which the program won't be able to execute faster with
> > increasing CPU frequency because the throughput limits of some
> > device
> > will have been attained.  Increasing frequencies past that point
> > only
> > pessimizes energy usage for no real benefit -- The optimal behavior
> > is
> > for the CPU to lock to the minimum frequency that is able to keep
> > the
> > IO devices involved fully utilized (assuming we are past the
> > maximum-efficiency inflection point of the CPU's power-to-frequency
> > curve), which is roughly the goal of this series.
> > 
> > PELT could be a useful extension for this model since its largely
> > heuristic assumptions would become more accurate if the IO and CPU
> > load could be tracked separately for each scheduling entity, but
> > this
> > is not attempted in this series because the additional complexity
> > and
> > computational cost of such an approach is hard to justify at this
> > stage, particularly since the current governor has similar
> > limitations.
> > 
> > Various frequency and step-function response graphs are available
> > in
> > [6]-[9] for comparison (obtained empirically on a BXT J3455
> > system).
> > The response curves for the low-latency and low-power states of the
> > heuristic are shown separately -- As you can see they roughly
> > bracket
> > the frequency response curve of the current governor.  The step
> > response of the aggressive heuristic is within a single update
> > period
> > (even though it's not quite obvious from the graph with the levels
> > of
> > zoom provided).  I'll attach benchmark results from a slower but
> > non-TDP-limited machine (which means there will be no TDP budget
> > increase that could possibly mask a performance regression of other
> > kind) as soon as they come out.
> > 
> > Thanks to Eero and Valtteri for testing a number of intermediate
> > revisions of this series (and there were quite a few of them) in
> > more
> > than half a dozen systems, they helped spot quite a few issues of
> > earlier versions of this heuristic.
> > 
> > [PATCH 1/9] cpufreq: Implement infrastructure keeping track of
> > aggregated IO active time.
> > [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs with
> > core_funcs"
> > [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple of long
> > names"
> > [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
> > intel_pstate_adjust_pstate()"
> > [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util from
> > pstate_funcs"
> > [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass
> > filtering controller for small core.
> > [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP controller
> > based on ACPI FADT profile.
> > [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP controller
> > parameters via debugfs.
> > [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity
> > to cpufreq.
> > 
> > [1] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
> > mark-perf-comparison-J3455.log
> > [2] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
> > mark-perf-per-watt-comparison-J3455.log
> > [3] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
> > mark-perf-comparison-J3455-1.log
> > [4] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
> > mark-perf-comparison-J4205.log
> > [5] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
> > mark-perf-comparison-J5005.log
> > [6] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequ
> > ency-response-magnitude-comparison.svg
> > [7] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequ
> > ency-response-phase-comparison.svg
> > [8] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-
> > response-comparison-1.svg
> > [9] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-
> > response-comparison-2.svg