[Intel-gfx] [PATCH 4/5] drm/i915: Disable semaphore busywaits on saturated systems
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Fri May 3 13:25:47 UTC 2019
On 29/04/2019 19:00, Chris Wilson wrote:
> Asking the GPU to busywait on a memory address, perhaps not unexpectedly
> in hindsight for a shared system, leads to bus contention that affects
> CPU programs trying to concurrently access memory. This can manifest as
> a drop in transcode throughput on highly over-saturated workloads.
>
> The only clue offered by perf, is that the bus-cycles (perf stat -e
> bus-cycles) jumped by 50% when enabling semaphores. This corresponds
> with extra CPU active cycles being attributed to intel_idle's mwait.
>
> This patch introduces a heuristic to try and detect when more than one
> client is submitting to the GPU pushing it into an oversaturated state.
> As we already keep track of when the semaphores are signaled, we can
> inspect their state on submitting the busywait batch and if we planned
> to use a semaphore but were too late, conclude that the GPU is
> overloaded and not try to use semaphores in future requests. In
> practice, this means we optimistically try to use semaphores for the
> first frame of a transcode job split over multiple engines, and fail is
> there are multiple clients active and continue not to use semaphores for
> the subsequent frames in the sequence. Periodically, trying to
> optimistically switch semaphores back on whenever the client waits to
> catch up with the transcode results.
>
> With 1 client, on Broxton J3455, with the relative fps normalized by %cpu:
>
> x no semaphores
> + drm-tip
> * throttle
> +------------------------------------------------------------------------+
> | * |
> | *+ |
> | **+ |
> | **+ x |
> | x * +**+ x |
> | x x * * +***x xx |
> | x x * * *+***x *x |
> | x x* + * * *****x *x x |
> | + x xx+x* + *** * ********* x * |
> | + x xx+x* * *** +** ********* xx * |
> | * + ++++* + x*x****+*+* ***+*************+x* * |
> |*+ +** *+ + +* + *++****** *xxx**********x***+*****************+*++ *|
> | |__________A_____M_____| |
> | |_______________A____M_________| |
> | |____________A___M________| |
> +------------------------------------------------------------------------+
> N Min Max Median Avg Stddev
> x 120 2.60475 3.50941 3.31123 3.2143953 0.21117399
> + 120 2.3826 3.57077 3.25101 3.1414161 0.28146407
> Difference at 95.0% confidence
> -0.0729792 +/- 0.0629585
> -2.27039% +/- 1.95864%
> (Student's t, pooled s = 0.248814)
> * 120 2.35536 3.66713 3.2849 3.2059917 0.24618565
> No difference proven at 95.0% confidence
>
> With 10 clients over-saturating the pipeline:
>
> x no semaphores
> + drm-tip
> * patch
> +------------------------------------------------------------------------+
> | ++ ** |
> | ++ ** |
> | ++ ** |
> | ++ ** |
> | ++ xx *** |
> | ++ xx *** |
> | ++ xxx*** |
> | ++ xxx*** |
> | +++ xxx*** |
> | +++ xx**** |
> | +++ xx**** |
> | +++ xx**** |
> | +++ xx**** |
> | ++++ xx**** |
> | +++++ xx**** |
> | +++++ x x****** |
> | ++++++ xxx******* |
> | ++++++ xxx******* |
> | ++++++ xxx******* |
> | ++++++ xx******** |
> | ++++++ xxxx******** |
> | ++++++ xxxx******** |
> | ++++++++ xxxxx********* |
> |+ + + + ++++++++ xxx*xx**********x* *|
> | |__A__| |
> | |__AM__| |
> | |__A_| |
> +------------------------------------------------------------------------+
> N Min Max Median Avg Stddev
> x 120 2.47855 2.8972 2.72376 2.7193402 0.074604933
> + 120 1.17367 1.77459 1.71977 1.6966782 0.085850697
> Difference at 95.0% confidence
> -1.02266 +/- 0.0203502
> -37.607% +/- 0.748352%
> (Student's t, pooled s = 0.0804246)
> * 120 2.57868 3.00821 2.80142 2.7923878 0.058646477
> Difference at 95.0% confidence
> 0.0730476 +/- 0.0169791
> 2.68622% +/- 0.624383%
> (Student's t, pooled s = 0.0671018)
>
> Indicating that we've recovered the regression from enabling semaphores
> on this saturated setup, with a hint towards an overall improvement.
>
> Very similar, but of smaller magnitude, results are observed on both
> Skylake(gt2) and Kabylake(gt4). This may be due to the reduced impact of
> bus-cycles, where we see a 50% hit on Broxton, it is only 10% on the big
> core, in this particular test.
>
> One observation to make here is that for a greedy client trying to
> maximise its own throughput, using semaphores is the right choice. It is
> only the holistic system-wide view that semaphores of one client
> impacts another and reduces the overall throughput where we would choose
> to disable semaphores.
>
> The most noticeable negactive impact this has is on the no-op
> microbenchmarks, which are also very notable for having no cpu bus load.
> In particular, this increases the runtime and energy consumption of
> gem_exec_whisper.
>
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin at intel.com>
> Cc: Dmitry Ermilov <dmitry.ermilov at intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> ---
> drivers/gpu/drm/i915/gt/intel_context.c | 2 ++
> drivers/gpu/drm/i915/gt/intel_context_types.h | 3 ++
> drivers/gpu/drm/i915/i915_request.c | 28 ++++++++++++++++++-
> 3 files changed, 32 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 1f1761fc6597..5b31e1e05ddd 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -116,6 +116,7 @@ intel_context_init(struct intel_context *ce,
> ce->engine = engine;
> ce->ops = engine->cops;
> ce->sseu = engine->sseu;
> + ce->saturated = 0;
>
> INIT_LIST_HEAD(&ce->signal_link);
> INIT_LIST_HEAD(&ce->signals);
> @@ -158,6 +159,7 @@ void intel_context_enter_engine(struct intel_context *ce)
>
> void intel_context_exit_engine(struct intel_context *ce)
> {
> + ce->saturated = 0;
> intel_engine_pm_put(ce->engine);
> }
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index d5a7dbd0daee..963a312430e6 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -13,6 +13,7 @@
> #include <linux/types.h>
>
> #include "i915_active_types.h"
> +#include "intel_engine_types.h"
> #include "intel_sseu.h"
>
> struct i915_gem_context;
> @@ -52,6 +53,8 @@ struct intel_context {
> atomic_t pin_count;
> struct mutex pin_mutex; /* guards pinning and associated on-gpuing */
>
> + intel_engine_mask_t saturated; /* submitting semaphores too late? */
> +
> /**
> * active_tracker: Active tracker for the external rq activity
> * on this intel_context object.
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 8cb3ed5531e3..2d429967f403 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -410,6 +410,26 @@ void __i915_request_submit(struct i915_request *request)
> if (i915_gem_context_is_banned(request->gem_context))
> i915_request_skip(request, -EIO);
>
> + /*
> + * Are we using semaphores when the gpu is already saturated?
> + *
> + * Using semaphores incurs a cost in having the GPU poll a
> + * memory location, busywaiting for it to change. The continual
> + * memory reads can have a noticeable impact on the rest of the
> + * system with the extra bus traffic, stalling the cpu as it too
> + * tries to access memory across the bus (perf stat -e bus-cycles).
> + *
> + * If we installed a semaphore on this request and we only submit
> + * the request after the signaler completed, that indicates the
> + * system is overloaded and using semaphores at this time only
> + * increases the amount of work we are doing. If so, we disable
> + * further use of semaphores until we are idle again, whence we
> + * optimistically try again.
> + */
> + if (request->sched.semaphores &&
> + i915_sw_fence_signaled(&request->semaphore))
> + request->hw_context->saturated |= request->sched.semaphores;
> +
> /* We may be recursing from the signal callback of another i915 fence */
> spin_lock_nested(&request->lock, SINGLE_DEPTH_NESTING);
>
> @@ -785,6 +805,12 @@ i915_request_await_start(struct i915_request *rq, struct i915_request *signal)
> I915_FENCE_GFP);
> }
>
> +static intel_engine_mask_t
> +already_busywaiting(struct i915_request *rq)
> +{
> + return rq->sched.semaphores | rq->hw_context->saturated;
> +}
> +
> static int
> emit_semaphore_wait(struct i915_request *to,
> struct i915_request *from,
> @@ -798,7 +824,7 @@ emit_semaphore_wait(struct i915_request *to,
> GEM_BUG_ON(INTEL_GEN(to->i915) < 8);
>
> /* Just emit the first semaphore we see as request space is limited. */
> - if (to->sched.semaphores & from->engine->mask)
> + if (already_busywaiting(to) & from->engine->mask)
> return i915_sw_fence_await_dma_fence(&to->submit,
> &from->fence, 0,
> I915_FENCE_GFP);
>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
Regards,
Tvrtko
More information about the Intel-gfx
mailing list