[Intel-gfx] [PATCH v3] drm/i915: Move CSB MMIO reads out of the execlists lock

Thu Mar 17 13:14:41 UTC 2016

On Thu, Mar 17, 2016 at 12:59:46PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> 
> By reading the CSB (slow MMIO accesses) into a temporary local
> buffer we can decrease the duration of holding the execlist
> lock.
> 
> Main advantage is that during heavy batch buffer submission we
> reduce the execlist lock contention, which should decrease the
> latency and CPU usage between the submitting userspace process
> and interrupt handling.
> 
> Downside is that we need to grab and relase the forcewake twice,
> but as the below numbers will show this is completely hidden
> by the primary gains.
> 
> Testing with "gem_latency -n 100" (submit batch buffers with a
> hundred nops each) shows more than doubling of the throughput
> and more than halving of the dispatch latency, overall latency
> and CPU time spend in the submitting process.
> 
> Submitting empty batches ("gem_latency -n 0") does not seem
> significantly affected by this change with throughput and CPU
> time improving by half a percent, and overall latency worsening
> by the same amount.
> 
> Above tests were done in a hundred runs on a big core Broadwell.
> 
> v2:
>   * Overflow protection to local CSB buffer.
>   * Use closer dev_priv in execlists_submit_requests. (Chris Wilson)
> 
> v3: Rebase.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 77 ++++++++++++++++++++--------------------
>  1 file changed, 38 insertions(+), 39 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index f72782200226..c6b3a9d19af2 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -416,15 +416,23 @@ static void execlists_update_context(struct drm_i915_gem_request *rq)
>  static void execlists_submit_requests(struct drm_i915_gem_request *rq0,
>  				      struct drm_i915_gem_request *rq1)
>  {
> +	struct drm_i915_private *dev_priv = rq0->i915;
> +
>  	execlists_update_context(rq0);
>  
>  	if (rq1)
>  		execlists_update_context(rq1);
>  
> +	spin_lock(&dev_priv->uncore.lock);
> +	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);

My only problem with this is that it puts the onus on the caller to
remember to disable irqs first. (That would be a silent conflict with
the patch to move the submission to a kthread for example.)
What's the cost of being upfront with spin_lock_irqsave() here?
/* BUG_ON(!irqs_disabled());  */ ?

I'm quite happy to go with the comment here and since it doesn't make
the lockups any worse,
Reviewed-by: Chris Wilsno <chris at chris-wilson.co.uk>
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre