[Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Francisco Jerez
currojerez at riseup.net
Wed Apr 11 16:10:32 UTC 2018
Hi Srinivas,
Srinivas Pandruvada <srinivas.pandruvada at linux.intel.com> writes:
> On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote:
>> Francisco Jerez <currojerez at riseup.net> writes:
>>
> [...]
>
>
>> For the case anyone is wondering what's going on, Srinivas pointed me
>> at
>> a larger idle power usage increase off-list, ultimately caused by the
>> low-latency heuristic as discussed in the paragraph above. I have a
>> v2
>> of PATCH 6 that gives the controller a third response curve roughly
>> intermediate between the low-latency and low-power states of this
>> revision, which avoids the energy usage increase while C0 residency
>> is
>> low (e.g. during idle) expected for v1. The low-latency behavior of
>> this revision is still going to be available based on a heuristic (in
>> particular when a realtime-priority task is scheduled). We're
>> carrying
>> out some additional testing, I'll post the code here eventually.
>
> Please try sched-util governor also. There is a frequency-invariant
> patch, which I can send you (This eventually will be pushed by Peter).
> We want to avoid complexity to intel-pstate for non HWP power sensitive
> platforms as far as possible.
>
Unfortunately the schedutil governor (whether frequency invariant or
not) has the exact same energy efficiency issues as the present
intel_pstate non-HWP governor. Its response is severely underdamped
leading to energy-inefficient behavior for any oscillating non-CPU-bound
workload. To exacerbate that problem the frequency is maxed out on
frequent IO waiting just like the current intel_pstate cpu-load
controller does, even though the frequent IO waits may actually be an
indication that the system is IO-bound (which means that the large
energy usage increase may not be translated in any performance benefit
in practice, not to speak of performance being impacted negatively in
TDP-bound scenarios like GPU rendering).
Regarding run-time complexity, I haven't observed this governor to be
measurably more computationally intensive than the present one. It's a
bunch more instructions indeed, but still within the same ballpark as
the current governor. The average increase in CPU utilization on my BXT
with this series is less than 0.03% (sampled via ftrace for v1, I can
repeat the measurement for the v2 I have in the works, though I don't
expect the result to be substantially different). If this is a problem
for you there are several optimization opportunities that would cut down
the number of CPU cycles get_target_pstate_lp() takes to execute by a
large percent (most of the optimization ideas I can think of right now
though would come at some accuracy/maintainability/debuggability cost,
but may still be worth pursuing), but the computational overhead is low
enough at this point that the impact on any benchmark or real workload
would be orders of magnitude lower than its variance, which makes it
kind of difficult to keep the discussion data-driven [as possibly any
performance optimization discussion should ever be ;)].
>
> Thanks,
> Srinivas
>
>
>
>>
>> > [Absolute benchmark results are unfortunately omitted from this
>> > letter
>> > due to company policies, but the percent change and Student's T
>> > p-value are included above and in the referenced benchmark results]
>> >
>> > The most obvious impact of this series will likely be the overall
>> > improvement in graphics performance on systems with an IGP
>> > integrated
>> > into the processor package (though for the moment this is only
>> > enabled
>> > on BXT+), because the TDP budget shared among CPU and GPU can
>> > frequently become a limiting factor in low-power devices. On
>> > heavily
>> > TDP-bound devices this series improves performance of virtually any
>> > non-trivial graphics rendering by a significant amount (of the
>> > order
>> > of the energy efficiency improvement for that workload assuming the
>> > optimization didn't cause it to become non-TDP-bound).
>> >
>> > See [1]-[5] for detailed numbers including various graphics
>> > benchmarks
>> > and a sample of the Phoronix daily-system-tracker. Some popular
>> > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve
>> > between 5% and 11% on our systems. The exact improvement can vary
>> > substantially between systems (compare the benchmark results from
>> > the
>> > two different J3455 systems [1] and [3]) due to a number of
>> > factors,
>> > including the ratio between CPU and GPU processing power, the
>> > behavior
>> > of the userspace graphics driver, the windowing system and
>> > resolution,
>> > the BIOS (which has an influence on the package TDP), the thermal
>> > characteristics of the system, etc.
>> >
>> > Unigine Valley and Heaven improve by a similar factor on some
>> > systems
>> > (see the J3455 results [1]), but on others the improvement is lower
>> > because the benchmark fails to fully utilize the GPU, which causes
>> > the
>> > heuristic to remain in low-latency state for longer, which leaves a
>> > reduced TDP budget available to the GPU, which prevents performance
>> > from increasing further. This can be avoided by using the
>> > alternative
>> > heuristic parameters suggested in the commit message of PATCH 8,
>> > which
>> > provide a lower IO utilization threshold and hysteresis for the
>> > controller to attempt to save energy. I'm not proposing those for
>> > upstream (yet) because they would also increase the risk for
>> > latency-sensitive IO-heavy workloads to regress (like SynMark2
>> > OglTerrainFly* and some arguably poorly designed IPC-bound X11
>> > benchmarks).
>> >
>> > Discrete graphics aren't likely to experience that much of a
>> > visible
>> > improvement from this, even though many non-IGP workloads *could*
>> > benefit by reducing the system's energy usage while the discrete
>> > GPU
>> > (or really, any other IO device) becomes a bottleneck, but this is
>> > not
>> > attempted in this series, since that would involve making an energy
>> > efficiency/latency trade-off that only the maintainers of the
>> > respective drivers are in a position to make. The cpufreq
>> > interface
>> > introduced in PATCH 1 to achieve this is left as an opt-in for that
>> > reason, only the i915 DRM driver is hooked up since it will get the
>> > most direct pay-off due to the increased energy budget available to
>> > the GPU, but other power-hungry third-party gadgets built into the
>> > same package (*cough* AMD *cough* Mali *cough* PowerVR *cough*) may
>> > be
>> > able to benefit from this interface eventually by instrumenting the
>> > driver in a similar way.
>> >
>> > The cpufreq interface is not exclusively tied to the intel_pstate
>> > driver, because other governors can make use of the statistic
>> > calculated as result to avoid over-optimizing for latency in
>> > scenarios
>> > where a lower frequency would be able to achieve similar throughput
>> > while using less energy. The interpretation of this statistic
>> > relies
>> > on the observation that for as long as the system is CPU-bound, any
>> > IO
>> > load occurring as a result of the execution of a program will scale
>> > roughly linearly with the clock frequency the program is run at, so
>> > (assuming that the CPU has enough processing power) a point will be
>> > reached at which the program won't be able to execute faster with
>> > increasing CPU frequency because the throughput limits of some
>> > device
>> > will have been attained. Increasing frequencies past that point
>> > only
>> > pessimizes energy usage for no real benefit -- The optimal behavior
>> > is
>> > for the CPU to lock to the minimum frequency that is able to keep
>> > the
>> > IO devices involved fully utilized (assuming we are past the
>> > maximum-efficiency inflection point of the CPU's power-to-frequency
>> > curve), which is roughly the goal of this series.
>> >
>> > PELT could be a useful extension for this model since its largely
>> > heuristic assumptions would become more accurate if the IO and CPU
>> > load could be tracked separately for each scheduling entity, but
>> > this
>> > is not attempted in this series because the additional complexity
>> > and
>> > computational cost of such an approach is hard to justify at this
>> > stage, particularly since the current governor has similar
>> > limitations.
>> >
>> > Various frequency and step-function response graphs are available
>> > in
>> > [6]-[9] for comparison (obtained empirically on a BXT J3455
>> > system).
>> > The response curves for the low-latency and low-power states of the
>> > heuristic are shown separately -- As you can see they roughly
>> > bracket
>> > the frequency response curve of the current governor. The step
>> > response of the aggressive heuristic is within a single update
>> > period
>> > (even though it's not quite obvious from the graph with the levels
>> > of
>> > zoom provided). I'll attach benchmark results from a slower but
>> > non-TDP-limited machine (which means there will be no TDP budget
>> > increase that could possibly mask a performance regression of other
>> > kind) as soon as they come out.
>> >
>> > Thanks to Eero and Valtteri for testing a number of intermediate
>> > revisions of this series (and there were quite a few of them) in
>> > more
>> > than half a dozen systems, they helped spot quite a few issues of
>> > earlier versions of this heuristic.
>> >
>> > [PATCH 1/9] cpufreq: Implement infrastructure keeping track of
>> > aggregated IO active time.
>> > [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs with
>> > core_funcs"
>> > [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple of long
>> > names"
>> > [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
>> > intel_pstate_adjust_pstate()"
>> > [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util from
>> > pstate_funcs"
>> > [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass
>> > filtering controller for small core.
>> > [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP controller
>> > based on ACPI FADT profile.
>> > [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP controller
>> > parameters via debugfs.
>> > [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity
>> > to cpufreq.
>> >
>> > [1] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>> > mark-perf-comparison-J3455.log
>> > [2] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>> > mark-perf-per-watt-comparison-J3455.log
>> > [3] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>> > mark-perf-comparison-J3455-1.log
>> > [4] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>> > mark-perf-comparison-J4205.log
>> > [5] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>> > mark-perf-comparison-J5005.log
>> > [6] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequ
>> > ency-response-magnitude-comparison.svg
>> > [7] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequ
>> > ency-response-phase-comparison.svg
>> > [8] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-
>> > response-comparison-1.svg
>> > [9] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-
>> > response-comparison-2.svg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/intel-gfx/attachments/20180411/9ac98ff1/attachment.sig>
More information about the Intel-gfx
mailing list