[Intel-gfx] [PATCH v7 0/3] Dynamic EU configuration of Slice/Sub-slice/EU

Sun Mar 15 00:12:19 UTC 2020

srinivasan.s at intel.com writes:

> From: Srinivasan S <srinivasan.s at intel.com>
>
>       drm/i915: Context aware user agnostic EU/Slice/Sub-slice control within kernel
>
> This patch sets improves GPU power consumption on Linux kernel based OS such as
> Chromium OS, Ubuntu, etc. Following are the power savings.
>
> Power savings on GLK-GT1 Bobba platform running on Chrome OS.
> -----------------------------------------------|
> App /KPI                | % Power Benefit (mW) |
> ------------------------|----------------------|
> Hangout Call- 20 minute |	1.8%           |
> Youtube 4K VPB          |       14.13%         |
> WebGL Aquarium          |       13.76%         |
> Unity3D                 |       6.78%          |
> 			|		       |
> ------------------------|----------------------|
> Chrome PLT              | BatteryLife Improves |
> 			| by ~45 minute        |
> -----------------------------------------------|
>
> Power savings on KBL-GT3 running on  Android and Ubuntu (Linux).
> -----------------------------------------------|
> App /KPI		| % Power Benefit (mW) |
>                         |----------------------|
> 			|  Android |  Ubuntu   |
> ------------------------|----------|-----------|
> 3D Mark (Ice storm)     | 2.30%    | N.A.      |
> TRex On screen          | 2.49%    | 2.97%     |
> Manhattan On screen     | 3.11%    | 4.90%     |
> Carchase On Screen	| N.A.     | 5.06%     |
> AnTuTu 6.1.4            | 3.42%    | N.A.      |
> SynMark2		| N.A.     | 1.7%      |
> -----------------------------------------------|
>

Did you get any performance (e.g. FPS) measurements from those
test-cases?  There is quite some potential for this feature to constrain
the GPU throughput inadvertently, which could lead to an apparent
reduction in power usage not accompanied by an improvement in energy
efficiency -- In fact AFAIUI there is some potential for this feature to
*decrease* the energy efficiency of the system if the GPU would have
been able to keep all EUs busy at a lower frequency, but the parallelism
constraint forces it to run at a higher frequency above RPe in order to
achieve the same throughput, because due to the convexity of the power
curve of the EU we have:

  P(k * f) > k * P(f)

Where 'k' is the ratio between the EU parallelism without and with SSEU
control, and f > RPe is the original GPU frequency without SSEU control.

In scenarios like that we *might* seem to be using less power with SSEU
control if the workload is running longer, but it would end up using
more energy overall by the time it completes, so it would be good to
have some performance-per-watt numbers to make sure that's not
happening.

> We have also observed GPU core residencies improves by 1.035%.
>
> Technical Insights of the patch:
> Current GPU configuration code for i915 does not allow us to change
> EU/Slice/Sub-slice configuration dynamically. Its done only once while context
> is created.
>
> While particular graphics application is running, if we examine the command
> requests from user space, we observe that command density is not consistent.
> It means there is scope to change the graphics configuration dynamically even
> while context is running actively. This patch series proposes the solution to
> find the active pending load for all active context at given time and based on
> that, dynamically perform graphics configuration for each context.
>
> We use a hr (high resolution) timer with i915 driver in kernel to get a
> callback every few milliseconds (this timer value can be configured through
> debugfs, default is '0' indicating timer is in disabled state i.e. original
> system without any intervention).In the timer callback, we examine pending
> commands for a context in the queue, essentially, we intercept them before
> they are executed by GPU and we update context with required number of EUs.
>

Given that the EU configuration update is synchronous with command
submission, do you really need a timer?  It sounds like it would be less
CPU overhead to adjust the EU count on demand whenever the counter
reaches or drops below the threshold instead of polling some CPU-side
data structure.

> Two questions, how did we arrive at right timer value? and what's the right
> number of EUs? For the prior one, empirical data to achieve best performance
> in least power was considered. For the later one, we roughly categorized number 
> of EUs logically based on platform. Now we compare number of pending commands
> with a particular threshold and then set number of EUs accordingly with update
> context. That threshold is also based on experiments & findings. If GPU is able
> to catch up with CPU, typically there are no pending commands, the EU config
> would remain unchanged there. In case there are more pending commands we
> reprogram context with higher number of EUs. Please note, here we are changing
> EUs even while context is running by examining pending commands every 'x'
> milliseconds.
>

I have doubts that the number of requests pending execution is a
particularly reliable indicator of the optimal number of EUs the
workload needs enabled, for starters because the execlists submission
code seems to be able to merge multiple requests into the same port, so
there might seem to be zero pending commands even if the GPU has a
backlog of several seconds or minutes worth of work.

But even if you were using an accurate measure of the GPU load, would
that really be a good indicator of whether the GPU would run more
efficiently with more or less EUs enabled?  I can think of many
scenarios where a short-lived GPU request would consume less energy and
complete faster while running with all EUs enabled (e.g. if it actually
has enough parallelism to take advantage of all EUs in the system).
Conversely I can think of some scenarios where a long-running GPU
request would benefit from SSEU control (e.g. a poorly parallelizable
but heavy 3D geometry pipeline or GPGPU workload).  The former seems
more worrying than the latter since it could lead to performance or
energy efficiency regressions.

IOW it seems to me that the optimal number of EUs enabled is more of a
function of the internal parallelism constraints of each request rather
than of the overall GPU load.  You should be able to get some
understanding of that by e.g. calculating the number of threads loaded
on the average based on the EU SPM counters, but unfortunately the ones
you'd need are only available on TGL+ IIRC.  On earlier platforms you
should be able to achieve the same thing by sampling some FLEXEU
counters, but you'd likely have to mess with the mux configuration which
would interfere with OA sampling -- However it sounds like this feature
may have to be disabled anytime OA is active anyway so that may not be a
problem after all?

Regards,
Francisco.

> Srinivasan S (3):
>   drm/i915: Get active pending request for given context
>   drm/i915: set optimum eu/slice/sub-slice configuration based on load
>     type
>   drm/i915: Predictive governor to control slice/subslice/eu
>
>  drivers/gpu/drm/i915/Makefile                     |   1 +
>  drivers/gpu/drm/i915/gem/i915_gem_context.c       |  20 +++++
>  drivers/gpu/drm/i915/gem/i915_gem_context.h       |   2 +
>  drivers/gpu/drm/i915/gem/i915_gem_context_types.h |  38 ++++++++
>  drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c    |   1 +
>  drivers/gpu/drm/i915/gt/intel_deu.c               | 104 ++++++++++++++++++++++
>  drivers/gpu/drm/i915/gt/intel_deu.h               |  31 +++++++
>  drivers/gpu/drm/i915/gt/intel_lrc.c               |  44 ++++++++-
>  drivers/gpu/drm/i915/i915_drv.h                   |   6 ++
>  drivers/gpu/drm/i915/i915_gem.c                   |   4 +
>  drivers/gpu/drm/i915/i915_params.c                |   4 +
>  drivers/gpu/drm/i915/i915_params.h                |   1 +
>  drivers/gpu/drm/i915/intel_device_info.c          |  74 ++++++++++++++-
>  13 files changed, 325 insertions(+), 5 deletions(-)
>  create mode 100644 drivers/gpu/drm/i915/gt/intel_deu.c
>  create mode 100644 drivers/gpu/drm/i915/gt/intel_deu.h
>
> -- 
> 2.7.4
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/intel-gfx/attachments/20200314/425c4b9a/attachment-0001.sig>