[Intel-gfx] [PATCH 05/20] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress

Wed Jan 13 12:49:39 PST 2016

On Wed, Jan 13, 2016 at 05:28:17PM +0000, Arun Siluvery wrote:
> From: Tomas Elf <tomas.elf at intel.com>
> 
> i915_gem_wedge now returns a non-zero result in three different cases:
> 
> 1. Legacy: A hang has been detected and full GPU reset is in progress.
> 
> 2. Per-engine recovery:
> 
> 	a. A single engine reference can be passed to the function, in which
> 	case only that engine will be checked. If that particular engine is
> 	detected to be hung and is to be reset this will yield a non-zero
> 	result but not if reset is in progress for any other engine.
> 
> 	b. No engine reference is passed to the function, in which case all
> 	engines are checked for ongoing per-engine hang recovery.
> 
> Also, i915_wait_request was updated to take advantage of this new
> functionality. This is important since the TDR hang recovery mechanism needs a
> way to force waiting threads that hold the struct_mutex to give up the
> struct_mutex and try again after the hang recovery has completed. If
> i915_wait_request does not take per-engine hang recovery into account there is
> no way for a waiting thread to know that a per-engine recovery is about to
> happen and that it needs to back off.
> 
> Signed-off-by: Tomas Elf <tomas.elf at intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery at intel.com>
> Signed-off-by: Ian Lister <ian.lister at intel.com>
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h         |  3 +-
>  drivers/gpu/drm/i915/i915_gem.c         | 60 +++++++++++++++++++++++++++------
>  drivers/gpu/drm/i915/intel_lrc.c        |  4 +--
>  drivers/gpu/drm/i915/intel_ringbuffer.c |  4 ++-
>  4 files changed, 56 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 85cf692..5be7d3e 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -3033,7 +3033,8 @@ i915_gem_find_active_request(struct intel_engine_cs *ring);
>  
>  bool i915_gem_retire_requests(struct drm_device *dev);
>  void i915_gem_retire_requests_ring(struct intel_engine_cs *ring);
> -int __must_check i915_gem_check_wedge(struct i915_gpu_error *error,
> +int __must_check i915_gem_check_wedge(struct drm_i915_private *dev_priv,
> +				      struct intel_engine_cs *engine,
>  				      bool interruptible);
>  
>  static inline bool i915_reset_in_progress(struct i915_gpu_error *error)
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index e3cfed2..e6eb45d 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -80,12 +80,38 @@ static void i915_gem_info_remove_obj(struct drm_i915_private *dev_priv,
>  	spin_unlock(&dev_priv->mm.object_stat_lock);
>  }
>  
> +static inline int
> +i915_engine_reset_in_progress(struct drm_i915_private *dev_priv,
> +	struct intel_engine_cs *engine)
> +{
> +	int ret = 0;
> +
> +	if (engine) {
> +		ret = !!(atomic_read(&dev_priv->ring[engine->id].hangcheck.flags)
> +			& I915_ENGINE_RESET_IN_PROGRESS);
> +	} else {
> +		int i;
> +
> +		for (i = 0; i < I915_NUM_RINGS; i++)
> +			if (atomic_read(&dev_priv->ring[i].hangcheck.flags)
> +				& I915_ENGINE_RESET_IN_PROGRESS) {
> +
> +				ret = 1;
> +				break;
> +			}
> +	}

Since this side will be called far more often than the writer, could you
not make this more convenient for the reader and move it to a global set
of flags in dev_priv->gpu_error?

To avoid regressing on the EIO front, the waiter sequence should look
like

if (req->reset_counter != i915_reset_counter(&req->i915->gpu_error))
	return 0;

if (flags & LOCKED && i915_engine_reset_in_process(&req->i915->gpu_error, req->engine))
	return -EAGAIN;

Oh, and don't add a second boolean to __i915_wait_request, just
transform the first into flags.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre