[Intel-gfx] [RFC 1/2] drm/i915: Improve record of hung engines in error state

Chris Wilson chris at chris-wilson.co.uk
Wed Nov 4 12:30:55 UTC 2020


Quoting Tvrtko Ursulin (2020-11-04 12:20:42)
> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> 
> Between events which trigger engine and GPU resets and capturing the error
> state we lose information on which engine triggered the reset. Improve
> this by passing in the hung engine mask down to error capture.
> 
> Result is that the list of engines in user visible "GPU HANG: ecode
> <gen>:<engines>:<ecode>, <process>" is now a list of hanging and not just
> active engines. Most importantly the displayed process is now the one
> which was actually hung.

You could also suggest to only include the hanging engine in the report,
as is intended to be the normal means of generating the report

> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
> index 0220b0992808..3a7ca90a3436 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> @@ -59,6 +59,7 @@ struct i915_request_coredump {
>  struct intel_engine_coredump {
>         const struct intel_engine_cs *engine;
>  
> +       bool hung;
>         bool simulated;
>         u32 reset_count;
>  
> @@ -218,8 +219,10 @@ struct drm_i915_error_state_buf {
>  __printf(2, 3)
>  void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...);
>  
> -struct i915_gpu_coredump *i915_gpu_coredump(struct drm_i915_private *i915);
> -void i915_capture_error_state(struct drm_i915_private *i915);
> +struct i915_gpu_coredump *i915_gpu_coredump(struct intel_gt *gt,
> +                                           intel_engine_mask_t engine_mask);
> +void i915_capture_error_state(struct intel_gt *gt,
> +                             intel_engine_mask_t engine_mask);

Don't forget the stubs.
-Chris


More information about the Intel-gfx mailing list