[Intel-gfx] [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2).

Tue Mar 24 20:03:48 UTC 2020

On Tue, 2020-03-24 at 12:16 -0700, Francisco Jerez wrote:
> Francisco Jerez <currojerez at riseup.net> writes:
> 
> > "Pandruvada, Srinivas" <srinivas.pandruvada at intel.com> writes:
> > 
> > > Hi Francisco,
> > > 
> > > On Tue, 2020-03-10 at 14:41 -0700, Francisco Jerez wrote:
> > > > This is my second take on improving the energy efficiency of
> > > > the
> > > > intel_pstate driver under IO-bound conditions.  The problem and
> > > > approach to solve it are roughly the same as in my previous
> > > > series
> > > > [1]
> > > > at a high level:
> > > > 
> > > > In IO-bound scenarios (by definition) the throughput of the
> > > > system
> > > > doesn't improve with increasing CPU frequency beyond the
> > > > threshold
> > > > value at which the IO device becomes the bottleneck, however
> > > > with the
> > > > current governors (whether HWP is in use or not) the CPU
> > > > frequency
> > > > tends to oscillate with the load, often with an amplitude far
> > > > into
> > > > the
> > > > turbo range, leading to severely reduced energy efficiency,
> > > > which is
> > > > particularly problematic when a limited TDP budget is shared
> > > > among a
> > > > number of cores running some multithreaded workload, or among a
> > > > CPU
> > > > core and an integrated GPU.
> > > > 
> > > > Improving the energy efficiency of the CPU improves the
> > > > throughput of
> > > > the system in such TDP-limited conditions.  See [4] for some
> > > > preliminary benchmark results from a Razer Blade Stealth 13
> > > > Late
> > > > 2019/LY320 laptop with an Intel ICL processor and integrated
> > > > graphics,
> > > > including throughput results that range up to a ~15%
> > > > improvement and
> > > > performance-per-watt results up to a ~43% improvement
> > > > (estimated via
> > > > RAPL).  Particularly the throughput results may vary
> > > > substantially
> > > > from one platform to another depending on the TDP budget and
> > > > the
> > > > balance of load between CPU and GPU.
> > > > 
> > > 
> > > You changed the EPP to 0 intentionally or unintentionally. We
> > > know that
> > > all energy optimization will be disabled with this change. 
> > > This test was done on an ICL system.
> > > 
> > 
> > Hmm, that's bad, and fully unintentional.  It's probably a side
> > effect
> > of intel_pstate_reset_vlp() running before intel_pstate_hwp_set(),
> > which
> > could cause it to use an uninitialized value of hwp_req_cached
> > (zero?).
> > I'll fix it in v3.  Thanks a lot for pointing this out.
> > 
> 
> Sigh.  That means that the performance results I got were
> inadvertently
> obtained while using an EPP setting of "performance" (!).  That's
> unlikely to be the case in most systems but still kind of meaningful.
We know that  "performance" mode is not great for workloads which
depends on some power sharing.

Thanks,
Srinivas 

> Need to get updated performance numbers with EPP=0x80 -- The larger
> up
> to ~40% energy efficiency improvements still seem to be visible
> regardless, but the throughput benefit is likely to be lower than
> with
> EPP=0.
> 
> > > Basically without your patches on top of linux-next: EPP = 0x80
> > > $sudo rdmsr -a 0x774
> > > 80002704
> > > 80002704
> > > 80002704
> > > 80002704
> > > 80002704
> > > 80002704
> > > 80002704
> > > 80002704
> > > 
> > > 
> > > After your patches
> > > 
> > > $sudo rdmsr -a 0x774
> > > 2704
> > > 2704
> > > 2704
> > > 2704
> > > 2704
> > > 2704
> > > 2704
> > > 2704
> > > 
> > > I added some prints, basically you change the EPP at startup
> > > before
> > > regular HWP request update path and update on top. So boot up EPP
> > > is
> > > overwritten.
> > > 
> > > 
> > > [    5.867476] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.872426] intel_pstate_reset_vlp hwp_req:404
> > > [    5.881645] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.886634] intel_pstate_reset_vlp hwp_req:404
> > > [    5.895819] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.900958] intel_pstate_reset_vlp hwp_req:404
> > > [    5.910321] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.915406] intel_pstate_reset_vlp hwp_req:404
> > > [    5.924623] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.929564] intel_pstate_reset_vlp hwp_req:404
> > > [    5.944039] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.951672] intel_pstate_reset_vlp hwp_req:404
> > > [    5.966157] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.973808] intel_pstate_reset_vlp hwp_req:404
> > > [    5.988223] intel_pstate_reset_vlp hwp_req cached:0
> > > [    5.995823] intel_pstate_reset_vlp hwp_req:404
> > > [    6.010062] intel_pstate: HWP enabled
> > > 
> > > Thanks,
> > > Srinivas
> > > 
> > > 
> > > 
> > > > One of the main differences relative to my previous version is
> > > > that
> > > > the trade-off between energy efficiency and frequency ramp-up
> > > > latency
> > > > is now exposed to device drivers through a new PM QoS class [It
> > > > would
> > > > make sense to expose it to userspace too eventually but that's
> > > > beyond
> > > > the purpose of this series].  The new PM QoS class provides a
> > > > latency
> > > > target to CPUFREQ governors which gives them permission to
> > > > filter out
> > > > CPU frequency oscillations with a period significantly shorter
> > > > than
> > > > the specified target, whenever doing so leads to improved
> > > > energy
> > > > efficiency.
> > > > 
> > > > This series takes advantage of the new PM QoS class from the
> > > > i915
> > > > driver whenever the driver determines that the GPU has become a
> > > > bottleneck for an extended period of time.  At that point it
> > > > places a
> > > > PM QoS ramp-up latency target which causes CPUFREQ to limit the
> > > > CPU
> > > > to
> > > > a reasonably energy-efficient frequency able to at least
> > > > achieve the
> > > > required amount of work in a time window approximately equal to
> > > > the
> > > > ramp-up latency target (since any longer-term energy efficiency
> > > > optimization would potentially violate the latency
> > > > target).  This
> > > > seems more effective than clamping the CPU frequency to a fixed
> > > > value
> > > > directly from various subsystems, since the CPU is a shared
> > > > resource,
> > > > so the frequency bound needs to consider the load and latency
> > > > requirements of all independent workloads running on the same
> > > > CPU
> > > > core
> > > > in order to avoid performance degradation in a multitasking,
> > > > possibly
> > > > virtualized environment.
> > > > 
> > > > The main limitation of this PM QoS approach is that whenever
> > > > multiple
> > > > clients request different ramp-up latency targets, only the
> > > > strictest
> > > > (lowest latency) one will apply system-wide, potentially
> > > > leading to
> > > > suboptimal energy efficiency for the less latency-sensitive
> > > > clients,
> > > > (though it won't artificially limit the CPU throughput of the
> > > > most
> > > > latency-sensitive clients as a result of the PM QoS requests
> > > > placed
> > > > by
> > > > less latency-sensitive ones).  In order to address this
> > > > limitation
> > > > I'm
> > > > working on a more complicated solution which integrates with
> > > > the task
> > > > scheduler in order to provide response latency control with
> > > > process
> > > > granularity (pretty much in the spirit of PELT).  One of the
> > > > alternatives Rafael and I were discussing was to expose that
> > > > through
> > > > a
> > > > third cgroup clamp on top of the MIN and MAX utilization
> > > > clamps, but
> > > > I'm open to any other possibilities regarding what the
> > > > interface
> > > > should look like.  Either way the current (scheduling-unaware)
> > > > PM
> > > > QoS-based interface should provide most of the benefit except
> > > > in
> > > > heavily multitasking environments.
> > > > 
> > > > A branch with this series in testable form can be found here
> > > > [2],
> > > > based on linux-next from a few days ago.  Another important
> > > > difference
> > > > with respect to my previous revision is that the present one
> > > > targets
> > > > HWP systems (though for the moment it's only enabled by default
> > > > on
> > > > ICL, even though that can be overridden through the kernel
> > > > command
> > > > line).  I have WIP code that uses the same governor in order to
> > > > provide a similar benefit on non-HWP systems (like my previous
> > > > revision), which can be found in this branch for reference [3]
> > > > -- I'm
> > > > planning to finish that up and send it as follow-up to this
> > > > series
> > > > assuming people are happy with the overall approach.
> > > > 
> > > > Thanks in advance for any review feed-back and test reports.
> > > > 
> > > > [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS
> > > > limit.
> > > > [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based
> > > > on GPU
> > > > load.
> > > > [PATCH 03/10] OPTIONAL: drm/i915: Expose PM QoS control
> > > > parameters
> > > > via debugfs.
> > > > [PATCH 04/10] Revert "cpufreq: intel_pstate: Drop ->update_util
> > > > from
> > > > pstate_funcs"
> > > > [PATCH 05/10] cpufreq: intel_pstate: Implement VLP controller
> > > > statistics and status calculation.
> > > > [PATCH 06/10] cpufreq: intel_pstate: Implement VLP controller
> > > > target
> > > > P-state range estimation.
> > > > [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller
> > > > for HWP
> > > > parts.
> > > > [PATCH 08/10] cpufreq: intel_pstate: Enable VLP controller
> > > > based on
> > > > ACPI FADT profile and CPUID.
> > > > [PATCH 09/10] OPTIONAL: cpufreq: intel_pstate: Add tracing of
> > > > VLP
> > > > controller status.
> > > > [PATCH 10/10] OPTIONAL: cpufreq: intel_pstate: Expose VLP
> > > > controller
> > > > parameters via debugfs.
> > > > 
> > > > [1] https://marc.info/?l=linux-pm&m=152221943320908&w=2
> > > > [2] 
> > > > https://github.com/curro/linux/commits/intel_pstate-vlp-v2-hwp-only
> > > > [3] https://github.com/curro/linux/commits/intel_pstate-vlp-v2
> > > > [4] 
> > > > http://people.freedesktop.org/~currojerez/intel_pstate-vlp-v2/benchmark-comparison-ICL.log
> > > > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gfx