[Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

Wed Mar 28 06:38:36 UTC 2018

This series attempts to solve an energy efficiency problem of the
current active-mode non-HWP governor of the intel_pstate driver used
for the most part on low-power platforms.  Under heavy IO load the
current controller tends to increase frequencies to the maximum turbo
P-state, partly due to IO wait boosting, partly due to the roughly
flat frequency response curve of the current controller (see
[6]), which causes it to ramp frequencies up and down repeatedly for
any oscillating workload (think of graphics, audio or disk IO when any
of them becomes a bottleneck), severely increasing energy usage
relative to a (throughput-wise equivalent) controller able to provide
the same average frequency without fluctuation.  The core energy
efficiency improvement has been observed to be of the order of 20% via
RAPL, but it's expected to vary substantially between workloads (see
perf-per-watt comparison [2]).

One might expect that this could come at some cost in terms of system
responsiveness, but the governor implemented in PATCH 6 has a variable
response curve controlled by a heuristic that keeps the controller in
a low-latency state unless the system is under heavy IO load for an
extended period of time.  The low-latency behavior is actually
significantly more aggressive than the current governor, allowing it
to achieve better throughput in some scenarios where the load
ping-pongs between the CPU and some IO device (see PATCH 6 for more of
the rationale).  The controller offers relatively lower latency than
the upstream one particularly while C0 residency is low (which by
itself contributes to mitigate the increased energy usage while on
C0).  However under certain conditions the low-latency heuristic may
increase power consumption (see perf-per-watt comparison [2], the
apparent regressions are correlated with an increase in performance in
the same benchmark due to the use of the low-latency heuristic) -- If
this is a problem a different trade-off between latency and energy
usage shouldn't be difficult to achieve, but it will come at a
performance cost in some cases.  I couldn't observe a statistically
significant increase in idle power consumption due to this behavior
(on BXT J3455):

package-0 RAPL (W):    XXXXXX ±0.14% x8 ->     XXXXXX ±0.15% x9         d=-0.04% ±0.14%      p=61.73%

[Absolute benchmark results are unfortunately omitted from this letter
due to company policies, but the percent change and Student's T
p-value are included above and in the referenced benchmark results]

The most obvious impact of this series will likely be the overall
improvement in graphics performance on systems with an IGP integrated
into the processor package (though for the moment this is only enabled
on BXT+), because the TDP budget shared among CPU and GPU can
frequently become a limiting factor in low-power devices.  On heavily
TDP-bound devices this series improves performance of virtually any
non-trivial graphics rendering by a significant amount (of the order
of the energy efficiency improvement for that workload assuming the
optimization didn't cause it to become non-TDP-bound).

See [1]-[5] for detailed numbers including various graphics benchmarks
and a sample of the Phoronix daily-system-tracker.  Some popular
graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve
between 5% and 11% on our systems.  The exact improvement can vary
substantially between systems (compare the benchmark results from the
two different J3455 systems [1] and [3]) due to a number of factors,
including the ratio between CPU and GPU processing power, the behavior
of the userspace graphics driver, the windowing system and resolution,
the BIOS (which has an influence on the package TDP), the thermal
characteristics of the system, etc.

Unigine Valley and Heaven improve by a similar factor on some systems
(see the J3455 results [1]), but on others the improvement is lower
because the benchmark fails to fully utilize the GPU, which causes the
heuristic to remain in low-latency state for longer, which leaves a
reduced TDP budget available to the GPU, which prevents performance
from increasing further.  This can be avoided by using the alternative
heuristic parameters suggested in the commit message of PATCH 8, which
provide a lower IO utilization threshold and hysteresis for the
controller to attempt to save energy.  I'm not proposing those for
upstream (yet) because they would also increase the risk for
latency-sensitive IO-heavy workloads to regress (like SynMark2
OglTerrainFly* and some arguably poorly designed IPC-bound X11
benchmarks).

Discrete graphics aren't likely to experience that much of a visible
improvement from this, even though many non-IGP workloads *could*
benefit by reducing the system's energy usage while the discrete GPU
(or really, any other IO device) becomes a bottleneck, but this is not
attempted in this series, since that would involve making an energy
efficiency/latency trade-off that only the maintainers of the
respective drivers are in a position to make.  The cpufreq interface
introduced in PATCH 1 to achieve this is left as an opt-in for that
reason, only the i915 DRM driver is hooked up since it will get the
most direct pay-off due to the increased energy budget available to
the GPU, but other power-hungry third-party gadgets built into the
same package (*cough* AMD *cough* Mali *cough* PowerVR *cough*) may be
able to benefit from this interface eventually by instrumenting the
driver in a similar way.

The cpufreq interface is not exclusively tied to the intel_pstate
driver, because other governors can make use of the statistic
calculated as result to avoid over-optimizing for latency in scenarios
where a lower frequency would be able to achieve similar throughput
while using less energy.  The interpretation of this statistic relies
on the observation that for as long as the system is CPU-bound, any IO
load occurring as a result of the execution of a program will scale
roughly linearly with the clock frequency the program is run at, so
(assuming that the CPU has enough processing power) a point will be
reached at which the program won't be able to execute faster with
increasing CPU frequency because the throughput limits of some device
will have been attained.  Increasing frequencies past that point only
pessimizes energy usage for no real benefit -- The optimal behavior is
for the CPU to lock to the minimum frequency that is able to keep the
IO devices involved fully utilized (assuming we are past the
maximum-efficiency inflection point of the CPU's power-to-frequency
curve), which is roughly the goal of this series.

PELT could be a useful extension for this model since its largely
heuristic assumptions would become more accurate if the IO and CPU
load could be tracked separately for each scheduling entity, but this
is not attempted in this series because the additional complexity and
computational cost of such an approach is hard to justify at this
stage, particularly since the current governor has similar
limitations.

Various frequency and step-function response graphs are available in
[6]-[9] for comparison (obtained empirically on a BXT J3455 system).
The response curves for the low-latency and low-power states of the
heuristic are shown separately -- As you can see they roughly bracket
the frequency response curve of the current governor.  The step
response of the aggressive heuristic is within a single update period
(even though it's not quite obvious from the graph with the levels of
zoom provided).  I'll attach benchmark results from a slower but
non-TDP-limited machine (which means there will be no TDP budget
increase that could possibly mask a performance regression of other
kind) as soon as they come out.

Thanks to Eero and Valtteri for testing a number of intermediate
revisions of this series (and there were quite a few of them) in more
than half a dozen systems, they helped spot quite a few issues of
earlier versions of this heuristic.

[PATCH 1/9] cpufreq: Implement infrastructure keeping track of aggregated IO active time.
[PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs with core_funcs"
[PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple of long names"
[PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify intel_pstate_adjust_pstate()"
[PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs"
[PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass filtering controller for small core.
[PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP controller based on ACPI FADT profile.
[PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP controller parameters via debugfs.
[PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq.

[1] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J3455.log
[2] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-per-watt-comparison-J3455.log
[3] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J3455-1.log
[4] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J4205.log
[5] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J5005.log
[6] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequency-response-magnitude-comparison.svg
[7] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequency-response-phase-comparison.svg
[8] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-response-comparison-1.svg
[9] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-response-comparison-2.svg