[Intel-gfx] [RFC 05/11] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress

Tue Jun 9 04:11:06 PDT 2015

On Mon, Jun 08, 2015 at 06:03:23PM +0100, Tomas Elf wrote:
> @@ -1239,11 +1271,17 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
>  
>  		/* We need to check whether any gpu reset happened in between
>  		 * the caller grabbing the seqno and now ... */
> -		if (reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) {
> +		reset_in_progress =
> +			i915_gem_check_wedge(ring->dev->dev_private, NULL, interruptible);
> +
> +		if ((reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) ||
> +		     reset_in_progress) {
> +
>  			/* ... but upgrade the -EAGAIN to an -EIO if the gpu
>  			 * is truely gone. */
> -			ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible);
> -			if (ret == 0)
> +			if (reset_in_progress)
> +				ret = reset_in_progress;
> +			else
>  				ret = -EAGAIN;

I've had a bit more time to digest, anything that touches this piece of
code makes me recoil in horror as, as it currently stands, it is already
a buggy piece of code that returns an error in cases where the caller
cannot possibly tolerate an error (and in circumstances where it is not
actually an error at all). This leads to unfortunates WARNs and SIGBUS
which I have been long arguing against and trying to get fixed for about
5 years.

Obviously we need the EAGAIN in order to do the struct_mutex backoff in
the caller, but there are a class of non-blocking __i915_wait_request
users that can just wait until the reset is resolved. Having that
support would simplify a number of cases. What scares me most though is
the possiblity of an EIO being reported here, those are almost entirely
wrong (if the GPU is hung, the request should be aborted and any waits
automatically completed). I'd like to be sure that TDR doesn't make EIO
handling any worse...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre