[Intel-gfx] [PATCH] drm/i915: Resurrect ring kicking for semaphores, selectively

Thu Jan 10 18:43:34 CET 2013

On Thu, Jan 10, 2013 at 02:23:37AM +0000, Chris Wilson wrote:
> Once we thought we got semaphores working, we disabled kicking the ring
> if hangcheck fired whilst waiting upon a ring as it was doing more harm
> than good:
> 
> commit 4e0e90dcb8a7df1229c69e30abebb59b0b3c2a1f
> Author: Daniel Vetter <daniel.vetter at ffwll.ch>
> Date:   Wed Dec 14 13:56:58 2011 +0100
> 
>     drm/i915: kicking rings stuck on semaphores considered harmful
> 
> However, life is never that easy and semaphores are still causing
> problems whereby the value written by one ring (bcs) is not being
> propagated to the waiter (rcs). Thus the waiter never wakes up and we
> declare the GPU hung, which often has unfortunate consequences, even if
> we successfully reset the GPU.
> 
> But the GPU is idle as it has completed the work, just didn't notify its
> clients. So we can detect the incomplete wait during hang check and
> probe the target ring to see if has indeed emitted the breadcrumb seqno
> following the work and then and only then kick the waiter.
> 
> Based on a suggestion by Ben Widawsky.
> 
> References: https://bugs.freedesktop.org/show_bug.cgi?id=54226
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> Cc: Ben Widawsky <ben at bwidawsk.net>
> ---
>  drivers/gpu/drm/i915/i915_irq.c |   34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 39e4177..3cc52b2 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1693,6 +1693,32 @@ static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring, bool *err)
>  	return false;
>  }
>  
> +static bool semaphore_passed(struct intel_ring_buffer *waiter)
> +{
> +	struct drm_i915_private *dev_priv = waiter->dev->dev_private;
> +	int acthd = intel_ring_get_active_head(waiter) & HEAD_ADDR;
> +	struct intel_ring_buffer *signaller;
> +	u32 cmd;
> +
> +	/* ACTHD is likely pointing to the dword after the actual command,
> +	 * so scan backwards until we find the MBOX.
> +	 */
> +#define WAIT (MI_SEMAPHORE_MBOX | MI_SEMAPHORE_COMPARE | MI_SEMAPHORE_REGISTER)
> +	do {
> +		cmd = ioread32(waiter->virtual_start+acthd);
> +		if ((cmd & ~(0x3 << 16)) == WAIT)
> +			break;
> +		acthd -= 4;
> +		if (acthd < 0)
> +			return false;
> +	} while (1);
> +#undef WAIT
> +
> +	signaller = &dev_priv->ring[(waiter->id + (((cmd >> 16) & 3) - 1)) % 3];

I hate you for this line ^^

> +	return i915_seqno_passed(signaller->get_seqno(signaller, false),
> +				 ioread32(waiter->virtual_start+acthd+4)+1);
> +}
> +

For the sake of just testing if this makes the problem go away, this
could have been a lot simpler. Just check the RCS mbox (since in our
discussion, you said all the bugs had been RCS waiting).

i915_seqno_passed(bcs->get_seqno, GEN6_RBSYNC)

>  static bool kick_ring(struct intel_ring_buffer *ring)
>  {
>  	struct drm_device *dev = ring->dev;
> @@ -1704,6 +1730,14 @@ static bool kick_ring(struct intel_ring_buffer *ring)
>  		I915_WRITE_CTL(ring, tmp);
>  		return true;
>  	}
> +	if (INTEL_INFO(dev)->gen >= 6 &&
> +	    tmp & RING_WAIT_SEMAPHORE &&
> +	    semaphore_passed(ring)) {
> +		DRM_ERROR("Kicking stuck semaphore on %s\n",
> +			  ring->name);
> +		I915_WRITE_CTL(ring, tmp);
> +		return true;
> +	}
>  	return false;
>  }

I'd go with just gen == 6, for now. Also, since we know the waiter from
your horrifically complex calculation above, maybe we can get that info in
the DRM_ERROR as well?

On the idea:
Acked-by: Ben Widawsky <ben at bwidawsk.net>

I'll happily take time to turn it into an r-b if it fixes a bug.

-- 
Ben Widawsky, Intel Open Source Technology Center