[Intel-gfx] [PATCH resend v2 3/8] drm/i915: Cope with request list state change during error state capture

Mon Oct 19 09:06:27 PDT 2015

On Mon, Oct 19, 2015 at 03:55:48PM +0100, Tomas Elf wrote:
> Since we're not synchronizing the ring request list during error state capture
> the request list state might change between the time the corresponding error
> request list was allocated and dimensioned to the time when the ring request
> list is actually captured into the error state. If this happens then do an
> early exit and be aware that the captured error state might not be fully
> reliable.
> 
> * v2:
> - Chris Wilson: Removed WARN_ON from size check since having the error state
>   request list and the live driver request list diverge like this is a
>   legitimate behaviour.
> 
> - Tomas Elf: Removed update of num_request field since this made no sense. Just
>   exit and move on.
> 
> * Resend:
> - Responded to the wrong mailthread
> 
> Signed-off-by: Tomas Elf <tomas.elf at intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gpu_error.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 2f04e4f..b08a76b 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1071,6 +1071,18 @@ static void i915_gem_record_rings(struct drm_device *dev,
>  		list_for_each_entry(request, &ring->request_list, list) {
>  			struct drm_i915_error_request *erq;
>  
> +			if (count >= error->ring[i].num_requests) {
> +				/*
> +				 * If the ring request list was changed in
> +				 * between the point where the error request
> +				 * list was created and dimensioned and this
> +				 * point then just exit early to avoid crashes.
> +				 */
> +				DRM_ERROR("Request list changed size since allocation (%u->%u)\n",
> +					error->ring[i].num_requests, count);

The error message simply isn't that interesting. That requests were
added after the gpu hang occurred doesn't affect post-mortem debugging
of the hang, and if it were at all interesting, that information should
be stored in the error state itself.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre