[Intel-gfx] [PATCH 02/15] drm/i915: Limit the busy wait on requests to 10us not 10ms!

Mon Nov 30 02:02:30 PST 2015


On 29/11/15 08:48, Chris Wilson wrote:
> When waiting for high frequency requests, the finite amount of time
> required to set up the irq and wait upon it limits the response rate. By
> busywaiting on the request completion for a short while we can service
> the high frequency waits as quick as possible. However, if it is a slow
> request, we want to sleep as quickly as possible. The tradeoff between
> waiting and sleeping is roughly the time it takes to sleep on a request,
> on the order of a microsecond. Based on measurements of synchronous
> workloads from across big core and little atom, I have set the limit for
> busywaiting as 10 microseconds. In most of the synchronous cases, we can
> reduce the limit down to as little as 2 miscroseconds, but that leaves
> quite a few test cases regressing by factors of 3 and more.
>
> The code currently uses the jiffie clock, but that is far too coarse (on
> the order of 10 milliseconds) and results in poor interactivity as the
> CPU ends up being hogged by slow requests. To get microsecond resolution
> we need to use a high resolution timer. The cheapest of which is polling
> local_clock(), but that is only valid on the same CPU. If we switch CPUs
> because the task was preempted, we can also use that as an indicator that
>   the system is too busy to waste cycles on spinning and we should sleep
> instead.
>
> __i915_spin_request was introduced in
> commit 2def4ad99befa25775dd2f714fdd4d92faec6e34 [v4.2]
> Author: Chris Wilson <chris at chris-wilson.co.uk>
> Date:   Tue Apr 7 16:20:41 2015 +0100
>
>       drm/i915: Optimistically spin for the request completion
>
> v2: Drop full u64 for unsigned long - the timer is 32bit wraparound safe,
> so we can use native register sizes on smaller architectures. Mention
> the approximate microseconds units for elapsed time and add some extra
> comments describing the reason for busywaiting.
>
> v3: Raise the limit to 10us
>
> Reported-by: Jens Axboe <axboe at kernel.dk>
> Link: https://lkml.org/lkml/2015/11/12/621
> Cc; "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin at intel.com>
> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
> Cc: Eero Tamminen <eero.t.tamminen at intel.com>
> Cc: "Rantala, Valtteri" <valtteri.rantala at intel.com>
> Cc: stable at kernel.vger.org
> ---
>   drivers/gpu/drm/i915/i915_gem.c | 47 +++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 45 insertions(+), 2 deletions(-)

Again,

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>

Regards,

Tvrtko

> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 87fc34f5899f..bad112abb16b 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -1146,14 +1146,57 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
>   	return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
>   }
>
> +static unsigned long local_clock_us(unsigned *cpu)
> +{
> +	unsigned long t;
> +
> +	/* Cheaply and approximately convert from nanoseconds to microseconds.
> +	 * The result and subsequent calculations are also defined in the same
> +	 * approximate microseconds units. The principal source of timing
> +	 * error here is from the simple truncation.
> +	 *
> +	 * Note that local_clock() is only defined wrt to the current CPU;
> +	 * the comparisons are no longer valid if we switch CPUs. Instead of
> +	 * blocking preemption for the entire busywait, we can detect the CPU
> +	 * switch and use that as indicator of system load and a reason to
> +	 * stop busywaiting, see busywait_stop().
> +	 */
> +	*cpu = get_cpu();
> +	t = local_clock() >> 10;
> +	put_cpu();
> +
> +	return t;
> +}
> +
> +static bool busywait_stop(unsigned long timeout, unsigned cpu)
> +{
> +	unsigned this_cpu;
> +
> +	if (time_after(local_clock_us(&this_cpu), timeout))
> +		return true;
> +
> +	return this_cpu != cpu;
> +}
> +
>   static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
>   {
>   	unsigned long timeout;
> +	unsigned cpu;
> +
> +	/* When waiting for high frequency requests, e.g. during synchronous
> +	 * rendering split between the CPU and GPU, the finite amount of time
> +	 * required to set up the irq and wait upon it limits the response
> +	 * rate. By busywaiting on the request completion for a short while we
> +	 * can service the high frequency waits as quick as possible. However,
> +	 * if it is a slow request, we want to sleep as quickly as possible.
> +	 * The tradeoff between waiting and sleeping is roughly the time it
> +	 * takes to sleep on a request, on the order of a microsecond.
> +	 */
>
>   	if (i915_gem_request_get_ring(req)->irq_refcount)
>   		return -EBUSY;
>
> -	timeout = jiffies + 1;
> +	timeout = local_clock_us(&cpu) + 10;
>   	while (!need_resched()) {
>   		if (i915_gem_request_completed(req, true))
>   			return 0;
> @@ -1161,7 +1204,7 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
>   		if (signal_pending_state(state, current))
>   			break;
>
> -		if (time_after_eq(jiffies, timeout))
> +		if (busywait_stop(timeout, cpu))
>   			break;
>
>   		cpu_relax_lowlatency();
>