[Intel-xe] [PATCH v5 1/4] drm/xe: Handle errors from various components.

Aravind Iddamsetty aravind.iddamsetty at linux.intel.com
Wed Oct 4 12:07:53 UTC 2023


On 23/08/23 14:28, Himal Prasad Ghimiray wrote:
> The GFX device can generate numbers of classes of error under the new
> infrastructure: correctable, non-fatal, and fatal errors.
>
> The non-fatal and fatal error classes distinguish between levels of
> severity for uncorrectable errors. Driver will only handle logging
> of errors and updating counters from various components within the
> graphics device. Anything more will be handled at system level.
>
> For errors that will route as interrupts, three bits in the Master
> Interrupt Register will be used to convey the class of error.
>
> For each class of error: Determine source of error (IP block) by reading
> the Device Error Source Register (RW1C) that
> corresponds to the class of error being serviced.
>
> Bspec: 50875, 53073, 53074, 53075
>
> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> Cc: Aravind Iddamsetty <aravind.iddamsetty at intel.com>
> Cc: Matthew Brost <matthew.brost at intel.com>
> Cc: Matt Roper <matthew.d.roper at intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> Cc: Jani Nikula <jani.nikula at intel.com>
> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile                  |   1 +
>  drivers/gpu/drm/xe/regs/xe_regs.h            |   2 +-
>  drivers/gpu/drm/xe/regs/xe_tile_error_regs.h |  15 ++
>  drivers/gpu/drm/xe/xe_device_types.h         |  11 +
>  drivers/gpu/drm/xe/xe_hw_error.c             | 211 +++++++++++++++++++
>  drivers/gpu/drm/xe/xe_hw_error.h             |  64 ++++++
>  drivers/gpu/drm/xe/xe_irq.c                  |   3 +
>  7 files changed, 306 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
>  create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
>  create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h

<snip>

>
> +/* Count of  Correctable and Uncorrectable errors reported on tile */
> +enum xe_tile_hw_errors {
> +	XE_TILE_HW_ERR_GT_FATAL = 0,
> +	XE_TILE_HW_ERR_SGGI_FATAL,
> +	XE_TILE_HW_ERR_DISPLAY_FATAL,
> +	XE_TILE_HW_ERR_SGDI_FATAL,
> +	XE_TILE_HW_ERR_SGLI_FATAL,
> +	XE_TILE_HW_ERR_SGUNIT_FATAL,
> +	XE_TILE_HW_ERR_SGCI_FATAL,
> +	XE_TILE_HW_ERR_GSC_FATAL,
> +	XE_TILE_HW_ERR_SOC_FATAL,
> +	XE_TILE_HW_ERR_MERT_FATAL,
> +	XE_TILE_HW_ERR_SGMI_FATAL,
> +	XE_TILE_HW_ERR_UNKNOWN_FATAL,
> +	XE_TILE_HW_ERR_SGGI_NONFATAL,
> +	XE_TILE_HW_ERR_DISPLAY_NONFATAL,
> +	XE_TILE_HW_ERR_SGDI_NONFATAL,
> +	XE_TILE_HW_ERR_SGLI_NONFATAL,
> +	XE_TILE_HW_ERR_GT_NONFATAL,
> +	XE_TILE_HW_ERR_SGUNIT_NONFATAL,
> +	XE_TILE_HW_ERR_SGCI_NONFATAL,
> +	XE_TILE_HW_ERR_GSC_NONFATAL,
> +	XE_TILE_HW_ERR_SOC_NONFATAL,
> +	XE_TILE_HW_ERR_MERT_NONFATAL,
> +	XE_TILE_HW_ERR_SGMI_NONFATAL,
> +	XE_TILE_HW_ERR_UNKNOWN_NONFATAL,
> +	XE_TILE_HW_ERR_GT_CORR,
> +	XE_TILE_HW_ERR_DISPLAY_CORR,
> +	XE_TILE_HW_ERR_SGUNIT_CORR,
> +	XE_TILE_HW_ERR_GSC_CORR,
> +	XE_TILE_HW_ERR_SOC_CORR,
> +	XE_TILE_HW_ERR_UNKNOWN_CORR,
> +	XE_TILE_HW_ERROR_MAX,
> +};
> +

there are some defines which are not used in any platform specific structures,
let's clean that up and at present let's add support to only DG2 and PVC and
any specific errors for a particular platform to be added specifically for it later, applies
to all errors in the series.

Thanks,
Aravind.


More information about the Intel-xe mailing list