[Intel-gfx] [PATCH 1/5] drm/i915: Expose the busyspin durations for i915_wait_request

Thu Aug 2 14:15:35 UTC 2018

On 28/07/2018 17:46, Chris Wilson wrote:
> An interesting discussion regarding "hybrid interrupt polling" for NVMe
> came to the conclusion that the ideal busyspin before sleeping was half
> of the expected request latency (and better if it was already halfway
> through that request). This suggested that we too should look again at
> our tradeoff between spinning and waiting. Currently, our spin simply
> tries to hide the cost of enabling the interrupt, which is good to avoid
> penalising nop requests (i.e. test throughput) and not much else.
> Studying real world workloads suggests that a spin of upto 500us can
> dramatically boost performance, but the suggestion is that this is not
> from avoiding interrupt latency per-se, but from secondary effects of
> sleeping such as allowing the CPU reduce cstate and context switch away.
> 
> In a truly hybrid interrupt polling scheme, we would aim to sleep until
> just before the request completed and then wake up in advance of the
> interrupt and do a quick poll to handle completion. This is tricky for
> ourselves at the moment as we are not recording request times, and since
> we allow preemption, our requests are not on as a nicely ordered
> timeline as IO. However, the idea is interesting, for it will certainly
> help us decide when busyspinning is worthwhile.
> 
> v2: Expose the spin setting via Kconfig options for easier adjustment
> and testing.
> v3: Don't get caught sneaking in a change to the busyspin parameters.
> v4: Explain more about the "hybrid interrupt polling" scheme that we
> want to migrate towards.
> 
> Suggested-by: Sagar Kamble <sagar.a.kamble at intel.com>
> References: http://events.linuxfoundation.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdf
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Sagar Kamble <sagar.a.kamble at intel.com>
> Cc: Eero Tamminen <eero.t.tamminen at intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Ben Widawsky <ben at bwidawsk.net>
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> Cc: Michał Winiarski <michal.winiarski at intel.com>
> Reviewed-by: Sagar Kamble <sagar.a.kamble at intel.com>
> ---
>   drivers/gpu/drm/i915/Kconfig         |  6 +++++
>   drivers/gpu/drm/i915/Kconfig.profile | 26 +++++++++++++++++++
>   drivers/gpu/drm/i915/i915_request.c  | 39 +++++++++++++++++++++++++---
>   3 files changed, 67 insertions(+), 4 deletions(-)
>   create mode 100644 drivers/gpu/drm/i915/Kconfig.profile
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
> index 5c607f2c707b..387613f29cb0 100644
> --- a/drivers/gpu/drm/i915/Kconfig
> +++ b/drivers/gpu/drm/i915/Kconfig
> @@ -132,3 +132,9 @@ depends on DRM_I915
>   depends on EXPERT
>   source drivers/gpu/drm/i915/Kconfig.debug
>   endmenu
> +
> +menu "drm/i915 Profile Guided Optimisation"

Or something like a more generic drm/i915 Manual Tuning Options so it 
sounds less like an automated thing where you can feed the results of 
something straight in?

> +	visible if EXPERT
> +	depends on DRM_I915
> +	source drivers/gpu/drm/i915/Kconfig.profile
> +endmenu
> diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> new file mode 100644
> index 000000000000..8a230eeb98df
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/Kconfig.profile
> @@ -0,0 +1,26 @@
> +config DRM_I915_SPIN_REQUEST_IRQ
> +	int
> +	default 5 # microseconds
> +	help
> +	  Before sleeping waiting for a request (GPU operation) to complete,
> +	  we may spend some time polling for its completion. As the IRQ may
> +	  take a non-negligible time to setup, we do a short spin first to
> +	  check if the request will complete in the time it would have taken
> +	  us to enable the interrupt.
> +
> +	  May be 0 to disable the initial spin. In practice, we estimate
> +	  the cost of enabling the interrupt (if currently disabled) to be
> +	  a few microseconds.
> +
> +config DRM_I915_SPIN_REQUEST_CS
> +	int
> +	default 2 # microseconds
> +	help
> +	  After sleeping for a request (GPU operation) to complete, we will
> +	  be woken up on the completion of every request prior to the one
> +	  being waited on. For very short requests, going back to sleep and
> +	  be woken up again may add considerably to the wakeup latency. To
> +	  avoid incurring extra latency from the scheduler, we may choose to
> +	  spin prior to sleeping again.
> +
> +	  May be 0 to disable spinning after being woken.
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 5c2c93cbab12..ca25c53f8daa 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1336,8 +1336,32 @@ long i915_request_wait(struct i915_request *rq,
>   	GEM_BUG_ON(!intel_wait_has_seqno(&wait));
>   	GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
>   
> -	/* Optimistic short spin before touching IRQs */
> -	if (__i915_spin_request(rq, wait.seqno, state, 5))
> +	/*
> +	 * Optimistic spin before touching IRQs.
> +	 *
> +	 * We may use a rather large value here to offset the penalty of
> +	 * switching away from the active task. Frequently, the client will
> +	 * wait upon an old swapbuffer to throttle itself to remain within a
> +	 * frame of the gpu. If the client is running in lockstep with the gpu,
> +	 * then it should not be waiting long at all, and a sleep now will incur
> +	 * extra scheduler latency in producing the next frame. To try to
> +	 * avoid adding the cost of enabling/disabling the interrupt to the
> +	 * short wait, we first spin to see if the request would have completed
> +	 * in the time taken to setup the interrupt.
> +	 *
> +	 * We need upto 5us to enable the irq, and upto 20us to hide the
> +	 * scheduler latency of a context switch, ignoring the secondary
> +	 * impacts from a context switch such as cache eviction.
> +	 *
> +	 * The scheme used for low-latency IO is called "hybrid interrupt
> +	 * polling". The suggestion there is to sleep until just before you
> +	 * expect to be woken by the device interrupt and then poll for its
> +	 * completion. That requires having a good predictor for the request
> +	 * duration, which we currently lack.
> +	 */
> +	if (CONFIG_DRM_I915_SPIN_REQUEST_IRQ &&
> +	    __i915_spin_request(rq, wait.seqno, state,
> +				CONFIG_DRM_I915_SPIN_REQUEST_IRQ))
>   		goto complete;
>   
>   	set_current_state(state);
> @@ -1396,8 +1420,15 @@ long i915_request_wait(struct i915_request *rq,
>   		    __i915_wait_request_check_and_reset(rq))
>   			continue;
>   
> -		/* Only spin if we know the GPU is processing this request */
> -		if (__i915_spin_request(rq, wait.seqno, state, 2))
> +		/*
> +		 * A quick spin now we are on the CPU to offset the cost of
> +		 * context switching away (and so spin for roughly the same as
> +		 * the scheduler latency). We only spin if we know the GPU is
> +		 * processing this request, and so likely to finish shortly.
> +		 */
> +		if (CONFIG_DRM_I915_SPIN_REQUEST_CS &&
> +		    __i915_spin_request(rq, wait.seqno, state,
> +					CONFIG_DRM_I915_SPIN_REQUEST_CS))
>   			break;
>   
>   		if (!intel_wait_check_request(&wait, rq)) {
> 

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>

We could add PMU metrics to count hit/missed busy spins so profile 
guided bit becomes more direct?

Regards,

Tvrtko