[Intel-gfx] [PATCH 4/5] drm/i915: Disable semaphore busywaits on saturated systems

Fri May 3 13:25:47 UTC 2019

On 29/04/2019 19:00, Chris Wilson wrote:
> Asking the GPU to busywait on a memory address, perhaps not unexpectedly
> in hindsight for a shared system, leads to bus contention that affects
> CPU programs trying to concurrently access memory. This can manifest as
> a drop in transcode throughput on highly over-saturated workloads.
> 
> The only clue offered by perf, is that the bus-cycles (perf stat -e
> bus-cycles) jumped by 50% when enabling semaphores. This corresponds
> with extra CPU active cycles being attributed to intel_idle's mwait.
> 
> This patch introduces a heuristic to try and detect when more than one
> client is submitting to the GPU pushing it into an oversaturated state.
> As we already keep track of when the semaphores are signaled, we can
> inspect their state on submitting the busywait batch and if we planned
> to use a semaphore but were too late, conclude that the GPU is
> overloaded and not try to use semaphores in future requests. In
> practice, this means we optimistically try to use semaphores for the
> first frame of a transcode job split over multiple engines, and fail is
> there are multiple clients active and continue not to use semaphores for
> the subsequent frames in the sequence. Periodically, trying to
> optimistically switch semaphores back on whenever the client waits to
> catch up with the transcode results.
> 
> With 1 client, on Broxton J3455, with the relative fps normalized by %cpu:
> 
> x no semaphores
> + drm-tip
> * throttle
> +------------------------------------------------------------------------+
> |                                                    *                   |
> |                                                    *+                  |
> |                                                    **+                 |
> |                                                    **+  x              |
> |                                x               *  +**+  x              |
> |                                x  x       *    *  +***x xx             |
> |                                x  x       *    * *+***x *x             |
> |                                x  x*   +  *    * *****x *x x           |
> |                         +    x xx+x*   + ***   * ********* x   *       |
> |                         +    x xx+x*   * *** +** ********* xx  *       |
> |    *   +         ++++*  +    x*x****+*+* ***+*************+x*  *       |
> |*+ +** *+ + +* + *++****** *xxx**********x***+*****************+*++    *|
> |                                   |__________A_____M_____|             |
> |                           |_______________A____M_________|             |
> |                                 |____________A___M________|            |
> +------------------------------------------------------------------------+
>      N           Min           Max        Median           Avg        Stddev
> x 120       2.60475       3.50941       3.31123     3.2143953    0.21117399
> + 120        2.3826       3.57077       3.25101     3.1414161    0.28146407
> Difference at 95.0% confidence
> 	-0.0729792 +/- 0.0629585
> 	-2.27039% +/- 1.95864%
> 	(Student's t, pooled s = 0.248814)
> * 120       2.35536       3.66713        3.2849     3.2059917    0.24618565
> No difference proven at 95.0% confidence
> 
> With 10 clients over-saturating the pipeline:
> 
> x no semaphores
> + drm-tip
> * patch
> +------------------------------------------------------------------------+
> |                     ++                                        **       |
> |                     ++                                        **       |
> |                     ++                                        **       |
> |                     ++                                        **       |
> |                     ++                                    xx ***       |
> |                     ++                                    xx ***       |
> |                     ++                                    xxx***       |
> |                     ++                                    xxx***       |
> |                    +++                                    xxx***       |
> |                    +++                                    xx****       |
> |                    +++                                    xx****       |
> |                    +++                                    xx****       |
> |                    +++                                    xx****       |
> |                    ++++                                   xx****       |
> |                   +++++                                   xx****       |
> |                   +++++                                 x x******      |
> |                  ++++++                                 xxx*******     |
> |                  ++++++                                 xxx*******     |
> |                  ++++++                                 xxx*******     |
> |                  ++++++                                 xx********     |
> |                  ++++++                               xxxx********     |
> |                  ++++++                               xxxx********     |
> |                ++++++++                             xxxxx*********     |
> |+ +  +        + ++++++++                           xxx*xx**********x*  *|
> |                                                         |__A__|        |
> |                 |__AM__|                                               |
> |                                                            |__A_|      |
> +------------------------------------------------------------------------+
>      N           Min           Max        Median           Avg        Stddev
> x 120       2.47855        2.8972       2.72376     2.7193402   0.074604933
> + 120       1.17367       1.77459       1.71977     1.6966782   0.085850697
> Difference at 95.0% confidence
> 	-1.02266 +/- 0.0203502
> 	-37.607% +/- 0.748352%
> 	(Student's t, pooled s = 0.0804246)
> * 120       2.57868       3.00821       2.80142     2.7923878   0.058646477
> Difference at 95.0% confidence
> 	0.0730476 +/- 0.0169791
> 	2.68622% +/- 0.624383%
> 	(Student's t, pooled s = 0.0671018)
> 
> Indicating that we've recovered the regression from enabling semaphores
> on this saturated setup, with a hint towards an overall improvement.
> 
> Very similar, but of smaller magnitude, results are observed on both
> Skylake(gt2) and Kabylake(gt4). This may be due to the reduced impact of
> bus-cycles, where we see a 50% hit on Broxton, it is only 10% on the big
> core, in this particular test.
> 
> One observation to make here is that for a greedy client trying to
> maximise its own throughput, using semaphores is the right choice. It is
> only the holistic system-wide view that semaphores of one client
> impacts another and reduces the overall throughput where we would choose
> to disable semaphores.
> 
> The most noticeable negactive impact this has is on the no-op
> microbenchmarks, which are also very notable for having no cpu bus load.
> In particular, this increases the runtime and energy consumption of
> gem_exec_whisper.
> 
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin at intel.com>
> Cc: Dmitry Ermilov <dmitry.ermilov at intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c       |  2 ++
>   drivers/gpu/drm/i915/gt/intel_context_types.h |  3 ++
>   drivers/gpu/drm/i915/i915_request.c           | 28 ++++++++++++++++++-
>   3 files changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 1f1761fc6597..5b31e1e05ddd 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -116,6 +116,7 @@ intel_context_init(struct intel_context *ce,
>   	ce->engine = engine;
>   	ce->ops = engine->cops;
>   	ce->sseu = engine->sseu;
> +	ce->saturated = 0;
>   
>   	INIT_LIST_HEAD(&ce->signal_link);
>   	INIT_LIST_HEAD(&ce->signals);
> @@ -158,6 +159,7 @@ void intel_context_enter_engine(struct intel_context *ce)
>   
>   void intel_context_exit_engine(struct intel_context *ce)
>   {
> +	ce->saturated = 0;
>   	intel_engine_pm_put(ce->engine);
>   }
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index d5a7dbd0daee..963a312430e6 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -13,6 +13,7 @@
>   #include <linux/types.h>
>   
>   #include "i915_active_types.h"
> +#include "intel_engine_types.h"
>   #include "intel_sseu.h"
>   
>   struct i915_gem_context;
> @@ -52,6 +53,8 @@ struct intel_context {
>   	atomic_t pin_count;
>   	struct mutex pin_mutex; /* guards pinning and associated on-gpuing */
>   
> +	intel_engine_mask_t saturated; /* submitting semaphores too late? */
> +
>   	/**
>   	 * active_tracker: Active tracker for the external rq activity
>   	 * on this intel_context object.
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 8cb3ed5531e3..2d429967f403 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -410,6 +410,26 @@ void __i915_request_submit(struct i915_request *request)
>   	if (i915_gem_context_is_banned(request->gem_context))
>   		i915_request_skip(request, -EIO);
>   
> +	/*
> +	 * Are we using semaphores when the gpu is already saturated?
> +	 *
> +	 * Using semaphores incurs a cost in having the GPU poll a
> +	 * memory location, busywaiting for it to change. The continual
> +	 * memory reads can have a noticeable impact on the rest of the
> +	 * system with the extra bus traffic, stalling the cpu as it too
> +	 * tries to access memory across the bus (perf stat -e bus-cycles).
> +	 *
> +	 * If we installed a semaphore on this request and we only submit
> +	 * the request after the signaler completed, that indicates the
> +	 * system is overloaded and using semaphores at this time only
> +	 * increases the amount of work we are doing. If so, we disable
> +	 * further use of semaphores until we are idle again, whence we
> +	 * optimistically try again.
> +	 */
> +	if (request->sched.semaphores &&
> +	    i915_sw_fence_signaled(&request->semaphore))
> +		request->hw_context->saturated |= request->sched.semaphores;
> +
>   	/* We may be recursing from the signal callback of another i915 fence */
>   	spin_lock_nested(&request->lock, SINGLE_DEPTH_NESTING);
>   
> @@ -785,6 +805,12 @@ i915_request_await_start(struct i915_request *rq, struct i915_request *signal)
>   					     I915_FENCE_GFP);
>   }
>   
> +static intel_engine_mask_t
> +already_busywaiting(struct i915_request *rq)
> +{
> +	return rq->sched.semaphores | rq->hw_context->saturated;
> +}
> +
>   static int
>   emit_semaphore_wait(struct i915_request *to,
>   		    struct i915_request *from,
> @@ -798,7 +824,7 @@ emit_semaphore_wait(struct i915_request *to,
>   	GEM_BUG_ON(INTEL_GEN(to->i915) < 8);
>   
>   	/* Just emit the first semaphore we see as request space is limited. */
> -	if (to->sched.semaphores & from->engine->mask)
> +	if (already_busywaiting(to) & from->engine->mask)
>   		return i915_sw_fence_await_dma_fence(&to->submit,
>   						     &from->fence, 0,
>   						     I915_FENCE_GFP);
> 

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>

Regards,

Tvrtko