[Intel-gfx] [PATCH 3/5] drm/i915: Increase busyspin limit before a context-switch
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Thu Aug 2 14:40:27 UTC 2018
On 28/07/2018 17:46, Chris Wilson wrote:
> Looking at the distribution of i915_wait_request for a set of GL
What was the set?
> benchmarks, we see:
>
> broadwell# python bcc/tools/funclatency.py -u i915_wait_request
> usecs : count distribution
> 0 -> 1 : 29184 |****************************************|
> 2 -> 3 : 5767 |******* |
> 4 -> 7 : 3000 |**** |
> 8 -> 15 : 491 | |
> 16 -> 31 : 140 | |
> 32 -> 63 : 203 | |
> 64 -> 127 : 543 | |
> 128 -> 255 : 881 |* |
> 256 -> 511 : 1209 |* |
> 512 -> 1023 : 1739 |** |
> 1024 -> 2047 : 22855 |******************************* |
> 2048 -> 4095 : 1725 |** |
> 4096 -> 8191 : 5813 |******* |
> 8192 -> 16383 : 5348 |******* |
> 16384 -> 32767 : 1000 |* |
> 32768 -> 65535 : 4400 |****** |
> 65536 -> 131071 : 296 | |
> 131072 -> 262143 : 225 | |
> 262144 -> 524287 : 4 | |
> 524288 -> 1048575 : 1 | |
> 1048576 -> 2097151 : 1 | |
> 2097152 -> 4194303 : 1 | |
>
> broxton# python bcc/tools/funclatency.py -u i915_wait_request
> usecs : count distribution
> 0 -> 1 : 5523 |************************************* |
> 2 -> 3 : 1340 |********* |
> 4 -> 7 : 2100 |************** |
> 8 -> 15 : 755 |***** |
> 16 -> 31 : 211 |* |
> 32 -> 63 : 53 | |
> 64 -> 127 : 71 | |
> 128 -> 255 : 113 | |
> 256 -> 511 : 262 |* |
> 512 -> 1023 : 358 |** |
> 1024 -> 2047 : 1105 |******* |
> 2048 -> 4095 : 848 |***** |
> 4096 -> 8191 : 1295 |******** |
> 8192 -> 16383 : 5894 |****************************************|
> 16384 -> 32767 : 4270 |**************************** |
> 32768 -> 65535 : 5622 |************************************** |
> 65536 -> 131071 : 306 |** |
> 131072 -> 262143 : 50 | |
> 262144 -> 524287 : 76 | |
> 524288 -> 1048575 : 34 | |
> 1048576 -> 2097151 : 0 | |
> 2097152 -> 4194303 : 1 | |
>
> Picking 20us for the context-switch busyspin has the dual advantage of
> catching most frequent short waits while avoiding the cost of a context
> switch. 20us is a typical latency of 2 context-switches, i.e. the cost
> of taking the sleep, without the secondary effects of cache flushing.
>
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Sagar Kamble <sagar.a.kamble at intel.com>
> Cc: Eero Tamminen <eero.t.tamminen at intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Ben Widawsky <ben at bwidawsk.net>
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> Cc: MichaĆ Winiarski <michal.winiarski at intel.com>
> ---
> drivers/gpu/drm/i915/Kconfig.profile | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> index 63cb744d920d..de394dea4a14 100644
> --- a/drivers/gpu/drm/i915/Kconfig.profile
> +++ b/drivers/gpu/drm/i915/Kconfig.profile
> @@ -14,7 +14,7 @@ config DRM_I915_SPIN_REQUEST_IRQ
>
> config DRM_I915_SPIN_REQUEST_CS
> int
> - default 2 # microseconds
> + default 20 # microseconds
> help
> After sleeping for a request (GPU operation) to complete, we will
> be woken up on the completion of every request prior to the one
>
I'd be more tempted to pick 10us given the histograms. It would avoid
wasting cycles on Broadwell and keep the majority of the benefit on Broxton.
However.. it also raises the question if we perhaps want to have this
initialized per-platform at runtime.. ? That would open up the way of
auto-tuning it, if the goal is to eliminate the low part of the histogram.
Also, please add to the commit what kind of perf/watt or something
effect on the benchmarks we get with it.
Regards,
Tvrtko
More information about the Intel-gfx
mailing list