[Intel-gfx] [PATCH 2/5] drm/i915: don't bail out of intel_wait_ring_buffer too early

Ben Widawsky ben at bwidawsk.net
Mon Oct 31 02:29:13 CET 2011


On Sun, 30 Oct 2011 20:12:09 +0100
Daniel Vetter <daniel.vetter at ffwll.ch> wrote:

> In the pre-gem days with non-existing hangcheck and gpu reset code,
> this timeout of 3 seconds was pretty important to avoid stuck
> processes.
> 
> But now we have the hangcheck code in gem that goes to great length
> to ensure that the gpu is really dead before declaring it wedged.
> 
> So there's no need for this timeout anymore. Actually it's even harmful
> because we can bail out too early (e.g. with xscreensaver slip)
> when running giant batchbuffers. And our code isn't robust enough
> to properly unroll any state-changes, we pretty much rely on the gpu
> reset code cleaning up the mess (like cache tracking, fencing state,
> active list/request tracking, ...).
> 
> With this change intel_begin_ring can only fail when the gpu is
> wedged, and it will return -EAGAIN (like wait_request in case the
> gpu reset is still outstanding).
> 
> v2: Chris Wilson noted that on resume timers aren't running and hence
> we won't ever get kicked out of this loop by the hangcheck code. Use
> an insanely large timeout instead for the HAS_GEM case to prevent
> resume bugs from totally hanging the machine.
> 
> Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch>
> Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/intel_ringbuffer.c |   11 ++++++++++-
>  1 files changed, 10 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index ca70e2f..6e28301 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -1119,7 +1119,16 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
>  	}
>  
>  	trace_i915_ring_wait_begin(ring);
> -	end = jiffies + 3 * HZ;
> +	if (drm_core_check_feature(dev, DRIVER_GEM))
> +		/* With GEM the hangcheck timer should kick us out of the loop,
> +		 * leaving it early runs the risk of corrupting GEM state (due
> +		 * to running on almost untested codepaths). But on resume
> +		 * timers don't work yet, so prevent a complete hang in that
> +		 * case by choosing an insanely large timeout. */
> +		end = jiffies + 60 * HZ;
> +	else
> +		end = jiffies + 3 * HZ;
> +
>  	do {
>  		ring->head = I915_READ_HEAD(ring);
>  		ring->space = ring_space(ring);

I didn't really check to see if there is actually an issue here, but
instead of 60, do you want to play nice with timeouts such as
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT (ie. the min of all the timeouts and
60)?

Acked-by: Ben Widawsky <ben at bwidawsk.net>



More information about the Intel-gfx mailing list