[Intel-gfx] [PATCH v7 0/3] Dynamic EU configuration of Slice/Sub-slice/EU
Lionel Landwerlin
lionel.g.landwerlin at intel.com
Sun Mar 15 22:30:54 UTC 2020
On 15/03/2020 20:08, Francisco Jerez wrote:
> Lionel Landwerlin <lionel.g.landwerlin at intel.com> writes:
>
>> On 15/03/2020 02:12, Francisco Jerez wrote:
>>> srinivasan.s at intel.com writes:
>>>
>>>> From: Srinivasan S <srinivasan.s at intel.com>
>>>>
>>>> drm/i915: Context aware user agnostic EU/Slice/Sub-slice control within kernel
>>>>
>>>> This patch sets improves GPU power consumption on Linux kernel based OS such as
>>>> Chromium OS, Ubuntu, etc. Following are the power savings.
>>>>
>>>> Power savings on GLK-GT1 Bobba platform running on Chrome OS.
>>>> -----------------------------------------------|
>>>> App /KPI | % Power Benefit (mW) |
>>>> ------------------------|----------------------|
>>>> Hangout Call- 20 minute | 1.8% |
>>>> Youtube 4K VPB | 14.13% |
>>>> WebGL Aquarium | 13.76% |
>>>> Unity3D | 6.78% |
>>>> | |
>>>> ------------------------|----------------------|
>>>> Chrome PLT | BatteryLife Improves |
>>>> | by ~45 minute |
>>>> -----------------------------------------------|
>>>>
>>>> Power savings on KBL-GT3 running on Android and Ubuntu (Linux).
>>>> -----------------------------------------------|
>>>> App /KPI | % Power Benefit (mW) |
>>>> |----------------------|
>>>> | Android | Ubuntu |
>>>> ------------------------|----------|-----------|
>>>> 3D Mark (Ice storm) | 2.30% | N.A. |
>>>> TRex On screen | 2.49% | 2.97% |
>>>> Manhattan On screen | 3.11% | 4.90% |
>>>> Carchase On Screen | N.A. | 5.06% |
>>>> AnTuTu 6.1.4 | 3.42% | N.A. |
>>>> SynMark2 | N.A. | 1.7% |
>>>> -----------------------------------------------|
>>>>
>>> Did you get any performance (e.g. FPS) measurements from those
>>> test-cases? There is quite some potential for this feature to constrain
>>> the GPU throughput inadvertently, which could lead to an apparent
>>> reduction in power usage not accompanied by an improvement in energy
>>> efficiency -- In fact AFAIUI there is some potential for this feature to
>>> *decrease* the energy efficiency of the system if the GPU would have
>>> been able to keep all EUs busy at a lower frequency, but the parallelism
>>> constraint forces it to run at a higher frequency above RPe in order to
>>> achieve the same throughput, because due to the convexity of the power
>>> curve of the EU we have:
>>>
>>> P(k * f) > k * P(f)
>>>
>>> Where 'k' is the ratio between the EU parallelism without and with SSEU
>>> control, and f > RPe is the original GPU frequency without SSEU control.
>>>
>>> In scenarios like that we *might* seem to be using less power with SSEU
>>> control if the workload is running longer, but it would end up using
>>> more energy overall by the time it completes, so it would be good to
>>> have some performance-per-watt numbers to make sure that's not
>>> happening.
>>>
>>>> We have also observed GPU core residencies improves by 1.035%.
>>>>
>>>> Technical Insights of the patch:
>>>> Current GPU configuration code for i915 does not allow us to change
>>>> EU/Slice/Sub-slice configuration dynamically. Its done only once while context
>>>> is created.
>>>>
>>>> While particular graphics application is running, if we examine the command
>>>> requests from user space, we observe that command density is not consistent.
>>>> It means there is scope to change the graphics configuration dynamically even
>>>> while context is running actively. This patch series proposes the solution to
>>>> find the active pending load for all active context at given time and based on
>>>> that, dynamically perform graphics configuration for each context.
>>>>
>>>> We use a hr (high resolution) timer with i915 driver in kernel to get a
>>>> callback every few milliseconds (this timer value can be configured through
>>>> debugfs, default is '0' indicating timer is in disabled state i.e. original
>>>> system without any intervention).In the timer callback, we examine pending
>>>> commands for a context in the queue, essentially, we intercept them before
>>>> they are executed by GPU and we update context with required number of EUs.
>>>>
>>> Given that the EU configuration update is synchronous with command
>>> submission, do you really need a timer? It sounds like it would be less
>>> CPU overhead to adjust the EU count on demand whenever the counter
>>> reaches or drops below the threshold instead of polling some CPU-side
>>> data structure.
>>>
>>>> Two questions, how did we arrive at right timer value? and what's the right
>>>> number of EUs? For the prior one, empirical data to achieve best performance
>>>> in least power was considered. For the later one, we roughly categorized number
>>>> of EUs logically based on platform. Now we compare number of pending commands
>>>> with a particular threshold and then set number of EUs accordingly with update
>>>> context. That threshold is also based on experiments & findings. If GPU is able
>>>> to catch up with CPU, typically there are no pending commands, the EU config
>>>> would remain unchanged there. In case there are more pending commands we
>>>> reprogram context with higher number of EUs. Please note, here we are changing
>>>> EUs even while context is running by examining pending commands every 'x'
>>>> milliseconds.
>>>>
>>> I have doubts that the number of requests pending execution is a
>>> particularly reliable indicator of the optimal number of EUs the
>>> workload needs enabled, for starters because the execlists submission
>>> code seems to be able to merge multiple requests into the same port, so
>>> there might seem to be zero pending commands even if the GPU has a
>>> backlog of several seconds or minutes worth of work.
>>>
>>> But even if you were using an accurate measure of the GPU load, would
>>> that really be a good indicator of whether the GPU would run more
>>> efficiently with more or less EUs enabled? I can think of many
>>> scenarios where a short-lived GPU request would consume less energy and
>>> complete faster while running with all EUs enabled (e.g. if it actually
>>> has enough parallelism to take advantage of all EUs in the system).
>>> Conversely I can think of some scenarios where a long-running GPU
>>> request would benefit from SSEU control (e.g. a poorly parallelizable
>>> but heavy 3D geometry pipeline or GPGPU workload). The former seems
>>> more worrying than the latter since it could lead to performance or
>>> energy efficiency regressions.
>>>
>>> IOW it seems to me that the optimal number of EUs enabled is more of a
>>> function of the internal parallelism constraints of each request rather
>>> than of the overall GPU load. You should be able to get some
>>> understanding of that by e.g. calculating the number of threads loaded
>>> on the average based on the EU SPM counters, but unfortunately the ones
>>> you'd need are only available on TGL+ IIRC. On earlier platforms you
>>> should be able to achieve the same thing by sampling some FLEXEU
>>> counters, but you'd likely have to mess with the mux configuration which
>>> would interfere with OA sampling -- However it sounds like this feature
>>> may have to be disabled anytime OA is active anyway so that may not be a
>>> problem after all?
>>
>> FLEXEU has to be configured on all contexts but does not need the mux
>> configuration.
>>
> They have a sort of mux controlled through the EU_PERF_CNT_CTL*
> registers that have to be set up correctly for each counter to count the
> right event, which would certainly interfere with userspace using OA to
> gather EU metrics.
Maybe we're not talking about the same mux then :)
>
>> I think this feature would have to be shut off everytime you end using
>> OA from userspace though.
>>
> Yeah, that's probably necessary one way or another.
>
>> -Lionel
>>
>>
>>> Regards,
>>> Francisco.
>>>
>>>> Srinivasan S (3):
>>>> drm/i915: Get active pending request for given context
>>>> drm/i915: set optimum eu/slice/sub-slice configuration based on load
>>>> type
>>>> drm/i915: Predictive governor to control slice/subslice/eu
>>>>
>>>> drivers/gpu/drm/i915/Makefile | 1 +
>>>> drivers/gpu/drm/i915/gem/i915_gem_context.c | 20 +++++
>>>> drivers/gpu/drm/i915/gem/i915_gem_context.h | 2 +
>>>> drivers/gpu/drm/i915/gem/i915_gem_context_types.h | 38 ++++++++
>>>> drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 1 +
>>>> drivers/gpu/drm/i915/gt/intel_deu.c | 104 ++++++++++++++++++++++
>>>> drivers/gpu/drm/i915/gt/intel_deu.h | 31 +++++++
>>>> drivers/gpu/drm/i915/gt/intel_lrc.c | 44 ++++++++-
>>>> drivers/gpu/drm/i915/i915_drv.h | 6 ++
>>>> drivers/gpu/drm/i915/i915_gem.c | 4 +
>>>> drivers/gpu/drm/i915/i915_params.c | 4 +
>>>> drivers/gpu/drm/i915/i915_params.h | 1 +
>>>> drivers/gpu/drm/i915/intel_device_info.c | 74 ++++++++++++++-
>>>> 13 files changed, 325 insertions(+), 5 deletions(-)
>>>> create mode 100644 drivers/gpu/drm/i915/gt/intel_deu.c
>>>> create mode 100644 drivers/gpu/drm/i915/gt/intel_deu.h
>>>>
>>>> --
>>>> 2.7.4
>>>>
>>>> _______________________________________________
>>>> Intel-gfx mailing list
>>>> Intel-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>>>
>>>> _______________________________________________
>>>> Intel-gfx mailing list
>>>> Intel-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
More information about the Intel-gfx
mailing list