<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
<div dir="auto">Hi Nikula,</div>
<div dir="auto"><br>
</div>
<div dir="auto">I had missed adding mailing list to reply done to you. </div>
<div dir="auto"><br>
</div>
<div dir="auto">Below mail is  cc addition to ma<span>iling list without change in content sent to you as reply. </span></div>
<div dir="auto"><span><br>
</span></div>
<div dir="auto"><span>BR</span></div>
<div dir="auto"><span>Himal </span></div>
<div id="mail-editor-reference-message-container" dir="auto"><br>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" style="font-size: 11pt;"><strong>From:</strong> Nikula, Jani <jani.nikula@intel.com><br>
<strong>Sent:</strong> Thursday, August 10, 2023 3:47:49 pm<br>
<strong>To:</strong> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>; intel-xe@lists.freedesktop.org <intel-xe@lists.freedesktop.org><br>
<strong>Subject:</strong> Re: [PATCH v2 1/4] drm/xe: Handle errors from various components.<br>
</div>
<br>
<meta name="Generator" content="Microsoft Exchange Server">
<!-- converted from text --><font size="2"><span style="font-size:11pt;">
<div class="PlainText"><br>
Is there any new content here? This is what it looks like to me:<br>
<br>
<a href="https://lore.kernel.org/r/78b0e2d2-3c4c-484f-bcd4-fa896adef534@intel.com">https://lore.kernel.org/r/78b0e2d2-3c4c-484f-bcd4-fa896adef534@intel.com</a><br>
<br>
BR,<br>
Jani<br>
<br>
On Thu, 10 Aug 2023, "Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com> wrote:<br>
> On 10-08-2023 15:01, Ghimiray, Himal Prasad wrote:<br>
>><br>
>><br>
>> On 10-08-2023 13:24, Jani Nikula wrote:<br>
>>> On Thu, 10 Aug 2023, Himal Prasad Ghimiray<himal.prasad.ghimiray@intel.com>  wrote:<br>
>>>> The GFX device can generate numbers of classes of error under the new<br>
>>>> infrastructure: correctable, non-fatal, and fatal errors.<br>
>>>><br>
>>>> The non-fatal and fatal error classes distinguish between levels of<br>
>>>> severity for uncorrectable errors. Driver will only handle logging<br>
>>>> of errors and updating counters from various components within the<br>
>>>> graphics device. Anything more will be handled at system level.<br>
>>>><br>
>>>> For errors that will route as interrupts, three bits in the Master<br>
>>>> Interrupt Register will be used to convey the class of error.<br>
>>>><br>
>>>> For each class of error: Determine source of error (IP block) by reading<br>
>>>> the Device Error Source Register (RW1C) that<br>
>>>> corresponds to the class of error being serviced.<br>
>>>><br>
>>>> Bspec: 50875, 53073, 53074, 53075<br>
>>>><br>
>>>> Cc: Rodrigo Vivi<rodrigo.vivi@intel.com><br>
>>>> Cc: Aravind Iddamsetty<aravind.iddamsetty@intel.com><br>
>>>> Cc: Matthew Brost<matthew.brost@intel.com><br>
>>>> Cc: Matt Roper<matthew.d.roper@intel.com><br>
>>>> Cc: Joonas Lahtinen<joonas.lahtinen@linux.intel.com><br>
>>>> Cc: Jani Nikula<jani.nikula@intel.com><br>
>>>> Signed-off-by: Himal Prasad Ghimiray<himal.prasad.ghimiray@intel.com><br>
>>>> ---<br>
>>>>   drivers/gpu/drm/xe/regs/xe_regs.h            |   7 +<br>
>>>>   drivers/gpu/drm/xe/regs/xe_tile_error_regs.h | 108 +++++++++<br>
>>>>   drivers/gpu/drm/xe/xe_device_types.h         |   6 +<br>
>>>>   drivers/gpu/drm/xe/xe_irq.c                  | 220 +++++++++++++++++++<br>
>>>>   4 files changed, 341 insertions(+)<br>
>>>>   create mode 100644 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h<br>
>>>><br>
>>>> diff --git a/drivers/gpu/drm/xe/regs/xe_regs.h b/drivers/gpu/drm/xe/regs/xe_regs.h<br>
>>>> index ec45b1ba9db1..9901e55fc89c 100644<br>
>>>> --- a/drivers/gpu/drm/xe/regs/xe_regs.h<br>
>>>> +++ b/drivers/gpu/drm/xe/regs/xe_regs.h<br>
>>>> @@ -87,7 +87,14 @@<br>
>>>>   #define   GU_MISC_IRQ                            REG_BIT(29)<br>
>>>>   #define   DISPLAY_IRQ                            REG_BIT(16)<br>
>>>>   #define   GT_DW_IRQ(x)                           REG_BIT(x)<br>
>>>> +#define   XE_ERROR_IRQ(x)                 REG_BIT(26 + (x))<br>
>>>>   <br>
>>>>   #define PVC_RP_STATE_CAP                 XE_REG(0x281014)<br>
>>>>   <br>
>>>> +enum hardware_error {<br>
>>>> +  HARDWARE_ERROR_CORRECTABLE = 0,<br>
>>>> +  HARDWARE_ERROR_NONFATAL = 1,<br>
>>>> +  HARDWARE_ERROR_FATAL = 2,<br>
>>>> +  HARDWARE_ERROR_MAX,<br>
>>>> +};<br>
>>> This file is about registers. IMO enums belong somewhere else. Define<br>
>>> hardware registers using macros.<br>
>><br>
>> Hmm. Will look for better placeholder for this enum.<br>
>><br>
>>>>   #endif<br>
>>>> diff --git a/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h b/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h<br>
>>>> new file mode 100644<br>
>>>> index 000000000000..fbb794b2f183<br>
>>>> --- /dev/null<br>
>>>> +++ b/drivers/gpu/drm/xe/regs/xe_tile_error_regs.h<br>
>>>> @@ -0,0 +1,108 @@<br>
>>>> +/* SPDX-License-Identifier: MIT */<br>
>>>> +/*<br>
>>>> + * Copyright © 2023 Intel Corporation<br>
>>>> + */<br>
>>>> +#ifndef XE_TILE_ERROR_REGS_H_<br>
>>>> +#define XE_TILE_ERROR_REGS_H_<br>
>>>> +<br>
>>>> +#include <linux/stddef.h><br>
>>>> +<br>
>>>> +#define _DEV_ERR_STAT_NONFATAL                         0x100178<br>
>>>> +#define _DEV_ERR_STAT_CORRECTABLE                      0x10017c<br>
>>>> +#define DEV_ERR_STAT_REG(x)                            XE_REG(_PICK_EVEN((x), \<br>
>>>> +                                                          _DEV_ERR_STAT_CORRECTABLE, \<br>
>>>> +                                                          _DEV_ERR_STAT_NONFATAL))<br>
>>>> +<br>
>>>> +#define  DEV_ERR_STAT_MAX_ERROR_BIT         (21)<br>
>>>> +<br>
>>>> +/* Count of  Correctable and Uncorrectable errors reported on tile */<br>
>>>> +enum xe_tile_hw_errors {g<br>
>>>> +  XE_TILE_HW_ERR_GT_FATAL = 0,<br>
>>>> +  XE_TILE_HW_ERR_SGGI_FATAL,<br>
>>>> +  XE_TILE_HW_ERR_DISPLAY_FATAL,<br>
>>>> +  XE_TILE_HW_ERR_SGDI_FATAL,<br>
>>>> +  XE_TILE_HW_ERR_SGLI_FATAL,<br>
>>>> +  XE_TILE_HW_ERR_SGUNIT_FATAL,<br>
>>>> +  XE_TILE_HW_ERR_SGCI_FATAL,<br>
>>>> +  XE_TILE_HW_ERR_GSC_FATAL,<br>
>>>> +  XE_TILE_HW_ERR_SOC_FATAL,<br>
>>>> +  XE_TILE_HW_ERR_MERT_FATAL,<br>
>>>> +  XE_TILE_HW_ERR_SGMI_FATAL,<br>
>>>> +  XE_TILE_HW_ERR_UNKNOWN_FATAL,<br>
>>>> +  XE_TILE_HW_ERR_SGGI_NONFATAL,<br>
>>>> +  XE_TILE_HW_ERR_DISPLAY_NONFATAL,<br>
>>>> +  XE_TILE_HW_ERR_SGDI_NONFATAL,<br>
>>>> +  XE_TILE_HW_ERR_SGLI_NONFATAL,<br>
>>>> +  XE_TILE_HW_ERR_GT_NONFATAL,<br>
>>>> +  XE_TILE_HW_ERR_SGUNIT_NONFATAL,<br>
>>>> +  XE_TILE_HW_ERR_SGCI_NONFATAL,<br>
>>>> +  XE_TILE_HW_ERR_GSC_NONFATAL,<br>
>>>> +  XE_TILE_HW_ERR_SOC_NONFATAL,<br>
>>>> +  XE_TILE_HW_ERR_MERT_NONFATAL,<br>
>>>> +  XE_TILE_HW_ERR_SGMI_NONFATAL,<br>
>>>> +  XE_TILE_HW_ERR_UNKNOWN_NONFATAL,<br>
>>>> +  XE_TILE_HW_ERR_GT_CORR,<br>
>>>> +  XE_TILE_HW_ERR_DISPLAY_CORR,<br>
>>>> +  XE_TILE_HW_ERR_SGUNIT_CORR,<br>
>>>> +  XE_TILE_HW_ERR_GSC_CORR,<br>
>>>> +  XE_TILE_HW_ERR_SOC_CORR,<br>
>>>> +  XE_TILE_HW_ERR_UNKNOWN_CORR,<br>
>>>> +};<br>
>>> Ditto about enums and regs.<br>
>> Will address.<br>
>>>> +<br>
>>>> +#define XE_TILE_HW_ERROR_MAX (XE_TILE_HW_ERR_UNKNOWN_CORR + 1)<br>
>>> If it's an enum, adding that last in the enum does the trick.<br>
>> Makes sense.<br>
>>>> +<br>
>>>> +#define PVC_DEV_ERR_STAT_FATAL_MASK \<br>
>>>> +          (REG_BIT(0) | \<br>
>>>> +           REG_BIT(1) | \<br>
>>>> +           REG_BIT(8) | \<br>
>>>> +           REG_BIT(9) | \<br>
>>>> +           REG_BIT(13) | \<br>
>>>> +           REG_BIT(16) | \<br>
>>>> +           REG_BIT(20))<br>
>>>> +<br>
>>>> +#define PVC_DEV_ERR_STAT_NONFATAL_MASK \<br>
>>>> +          (REG_BIT(0) | \<br>
>>>> +           REG_BIT(1) | \<br>
>>>> +           REG_BIT(8) | \<br>
>>>> +           REG_BIT(9) | \<br>
>>>> +           REG_BIT(13) | \<br>
>>>> +           REG_BIT(16) | \<br>
>>>> +           REG_BIT(20))<br>
>>>> +<br>
>>>> +#define PVC_DEV_ERR_STAT_CORRECTABLE_MASK \<br>
>>>> +          (REG_BIT(0) | \<br>
>>>> +           REG_BIT(8))<br>
>>>> +<br>
>>>> +#define DG2_DEV_ERR_STAT_FATAL_MASK \<br>
>>>> +          (REG_BIT(0) | \<br>
>>>> +           REG_BIT(4) | \<br>
>>>> +           REG_BIT(8) | \<br>
>>>> +           REG_BIT(12) | \<br>
>>>> +           REG_BIT(16))<br>
>>>> +<br>
>>>> +#define DG2_DEV_ERR_STAT_NONFATAL_MASK \<br>
>>>> +          (REG_BIT(0) | \<br>
>>>> +           REG_BIT(4) | \<br>
>>>> +           REG_BIT(8) | \<br>
>>>> +           REG_BIT(12) | \<br>
>>>> +           REG_BIT(16) | \<br>
>>>> +           REG_BIT(20))<br>
>>>> +<br>
>>>> +#define DG2_DEV_ERR_STAT_CORRECTABLE_MASK \<br>
>>>> +          (REG_BIT(0) | \<br>
>>>> +           REG_BIT(4) | \<br>
>>>> +           REG_BIT(8) | \<br>
>>>> +           REG_BIT(12) | \<br>
>>>> +           REG_BIT(16))<br>
>>> Are the above supposed to match what's in xe_tile_hw_errors? Seems<br>
>>> rather unmaintainable.<br>
>> xe_tile_hw_errors contains superset of applicable platforms and mask determines<br>
>> what are applicable bits for a platform.<br>
>>>> +<br>
>>>> +#define REG_SIZE 32<br>
>>>> +<br>
>>>> +#define xe_tile_log_hw_err(tile, fmt, ...) \<br>
>>>> +  drm_err_ratelimited(&tile_to_xe(tile)->drm, HW_ERR "TILE%d detected " fmt, \<br>
>>>> +                      tile->id, ##__VA_ARGS__)<br>
>>>> +<br>
>>>> +#define xe_tile_log_hw_warn(tile, fmt, ...) \<br>
>>>> +  drm_warn(&tile_to_xe(tile)->drm, HW_ERR "TILE%d detected " fmt, \<br>
>>>> +           tile->id, ##__VA_ARGS__)<br>
>>> Do we really want to keep adding new macros for all possible scenarios<br>
>>> in the driver? This is getting out of hand.<br>
>>><br>
>>> Where's HW_ERR defined?<br>
>> include/linux/printk.h defines HW_ERR.<br>
>>>> +<br>
>>>> +#endif<br>
>>>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h<br>
>>>> index f84ecb976f5d..1335ba74981a 100644<br>
>>>> --- a/drivers/gpu/drm/xe/xe_device_types.h<br>
>>>> +++ b/drivers/gpu/drm/xe/xe_device_types.h<br>
>>>> @@ -16,6 +16,7 @@<br>
>>>>   #include "xe_gt_types.h"<br>
>>>>   #include "xe_platform_types.h"<br>
>>>>   #include "xe_step_types.h"<br>
>>>> +#include "regs/xe_tile_error_regs.h"<br>
>>>>   <br>
>>>>   #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)<br>
>>>>   #include "ext/intel_device_info.h"<br>
>>>> @@ -166,6 +167,11 @@ struct xe_tile {<br>
>>>>   <br>
>>>>     /** @sysfs: sysfs' kobj used by xe_tile_sysfs */<br>
>>>>     struct kobject *sysfs;<br>
>>>> +<br>
>>>> +  /** @tile_hw_errors: hardware errors reported for the tile */<br>
>>>> +  struct tile_hw_errors {<br>
>>>> +          unsigned long hw[XE_TILE_HW_ERROR_MAX];<br>
>>> Even with the documentation comment, I have to look up the source code<br>
>>> to realize this is the *number* of errors for each class.<br>
>>><br>
>>> Maybe "count" is more informative than "hw".<br>
>> Ok.<br>
>>>> +  } errors;<br>
>>>>   };<br>
>>>>   <br>
>>>>   /**<br>
>>>> diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c<br>
>>>> index 2022a5643e01..04a665faea23 100644<br>
>>>> --- a/drivers/gpu/drm/xe/xe_irq.c<br>
>>>> +++ b/drivers/gpu/drm/xe/xe_irq.c<br>
>>>> @@ -362,6 +362,223 @@ static void dg1_intr_enable(struct xe_device *xe, bool stall)<br>
>>>>             xe_mmio_read32(mmio, DG1_MSTR_TILE_INTR);<br>
>>>>   }<br>
>>>>   <br>
>>>> +static const char *<br>
>>>> +hardware_error_type_to_str(const enum hardware_error hw_err)<br>
>>>> +{<br>
>>>> +  switch (hw_err) {<br>
>>>> +  case HARDWARE_ERROR_CORRECTABLE:<br>
>>>> +          return "CORRECTABLE";<br>
>>>> +  case HARDWARE_ERROR_NONFATAL:<br>
>>>> +          return "NONFATAL";<br>
>>>> +  case HARDWARE_ERROR_FATAL:<br>
>>>> +          return "FATAL";<br>
>>>> +  default:<br>
>>>> +          return "UNKNOWN";<br>
>>>> +  }<br>
>>>> +}<br>
>>>> +<br>
>>>> +struct error_msg_counter_pair {<br>
>>>> +  const char *errmsg;<br>
>>>> +  int errcounter;<br>
>>> Counter? Or type/class/whatever?<br>
>>><br>
>>>> +};<br>
>>>> +<br>
>>>> +struct error_msg_counter_pair dev_err_stat_fatal_reg[] = {<br>
>>>> +  {"GT",                  XE_TILE_HW_ERR_GT_FATAL                 /* Bit Pos 0 */},<br>
>>> Does this again tie the enums and the bit positions together, similar to<br>
>>> how the mask macros also do above?<br>
>>><br>
>>> There needs to be a single point of truth for all of this.<br>
>>><br>
>>> I think this needs a redesign.<br>
>><br>
>> As commented above this is array which defines all valid bit positions <br>
>> irrespective of platform. Mask was<br>
>><br>
>> to determine platform specific applicability.<br>
>><br>
>>> BR,<br>
>>> Jani.<br>
>>><br>
>>>> +  {"SGGI Cmd Parity",     XE_TILE_HW_ERR_SGGI_FATAL               /* Bit Pos 1 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},<br>
>>>> +  {"DISPLAY",             XE_TILE_HW_ERR_DISPLAY_FATAL            /* Bit Pos 4 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},<br>
>>>> +  {"GSC error",           XE_TILE_HW_ERR_GSC_FATAL                /* Bit Pos 8 */},<br>
>>>> +  {"SGLI Cmd Parity",     XE_TILE_HW_ERR_SGLI_FATAL               /* Bit Pos 9 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},<br>
>>>> +  {"SGUNIT",              XE_TILE_HW_ERR_SGUNIT_FATAL             /* Bit Pos 12 */},<br>
>>>> +  {"SGCI Cmd Parity",     XE_TILE_HW_ERR_SGCI_FATAL               /* Bit Pos 13 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},<br>
>>>> +  {"SOC ERROR",           XE_TILE_HW_ERR_SOC_FATAL                /* Bit Pos 16 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_FATAL},<br>
>>>> +  {"MERT Cmd Parity",     XE_TILE_HW_ERR_MERT_FATAL               /* Bit Pos 20 */},<br>
>>>> +};<br>
>>>> +<br>
>>>> +struct error_msg_counter_pair dev_err_stat_nonfatal_reg[] = {<br>
>>>> +  {"GT",                  XE_TILE_HW_ERR_GT_NONFATAL              /* Bit Pos 0 */},<br>
>>>> +  {"SGGI Data Parity",    XE_TILE_HW_ERR_SGGI_NONFATAL            /* Bit Pos 1 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},<br>
>>>> +  {"DISPLAY",             XE_TILE_HW_ERR_DISPLAY_NONFATAL         /* Bit Pos 4 */},<br>
>>>> +  {"SGDI Data Parity",    XE_TILE_HW_ERR_SGDI_NONFATAL            /* Bit Pos 5 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},<br>
>>>> +  {"GSC",                 XE_TILE_HW_ERR_GSC_NONFATAL             /* Bit Pos 8 */},<br>
>>>> +  {"SGLI Data Parity",    XE_TILE_HW_ERR_SGLI_NONFATAL            /* Bit Pos 9 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},<br>
>>>> +  {"SGUNIT",              XE_TILE_HW_ERR_SGUNIT_NONFATAL          /* Bit Pos 12 */},<br>
>>>> +  {"SGCI Data Parity",    XE_TILE_HW_ERR_SGCI_NONFATAL            /* Bit Pos 13 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},<br>
>>>> +  {"SOC",                 XE_TILE_HW_ERR_SOC_NONFATAL             /* Bit Pos 16 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL         /* Bit Pos 17 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_NONFATAL},<br>
>>>> +  {"MERT Data Parity",    XE_TILE_HW_ERR_MERT_NONFATAL            /* Bit Pos 20 */},<br>
>>>> +};<br>
>>>> +<br>
>>>> +struct error_msg_counter_pair dev_err_stat_correctable_reg[] = {<br>
>>>> +  {"GT",                  XE_TILE_HW_ERR_GT_CORR                  /* Bit Pos 0 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +  {"DISPLAY",             XE_TILE_HW_ERR_DISPLAY_CORR             /* Bit Pos 4 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +  {"GSC",                 XE_TILE_HW_ERR_GSC_CORR                 /* Bit Pos 8 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +  {"SGUNIT",              XE_TILE_HW_ERR_SGUNIT_CORR              /* Bit Pos 12 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +  {"SOC",                 XE_TILE_HW_ERR_SOC_CORR                 /* Bit Pos 16 */},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +  {"Undefined",           XE_TILE_HW_ERR_UNKNOWN_CORR},<br>
>>>> +};<br>
>>>> +<br>
>>>> +static void update_valid_error_regs(struct xe_device *xe)<br>
>>>> +{<br>
>>>> +  unsigned long mask = 0;<br>
>>>> +<br>
>>>> +  u32 i;<br>
>>>> +<br>
>>>> +  if (xe->info.platform == XE_DG2) {<br>
>>>> +          mask = ~(0 | DG2_DEV_ERR_STAT_FATAL_MASK);<br>
>>>> +          for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)<br>
>>>> +                  dev_err_stat_fatal_reg[i] = (struct error_msg_counter_pair)<br>
>>>> +                  {.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_FATAL};<br>
>>> Nope. For one thing, the arrays really should be static const, placed in<br>
>>> rodata, and not mutable.<br>
>> My Idea of keeping it mutable is to avoid multiple platform specific <br>
>> arrays.<br>
>>> For another, if you have a platform with two or more different devices,<br>
>>> whichever gets probed last clobbers the data.<br>
>><br>
>> Thanks for pointing it out. I had missed this particular scenario.  I <br>
>> believe defining platform specific array is better<br>
>><br>
>> because then we can ensure them to be immutable and we will not be <br>
>> required to have platform specific mask also.<br>
>><br>
>> I believe this will help with maintainability too.<br>
>><br>
>> BR<br>
>><br>
>> Himal Ghimiray<br>
>><br>
>>>> +<br>
>>>> +          mask = ~(0 | DG2_DEV_ERR_STAT_NONFATAL_MASK);<br>
>>>> +          for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)<br>
>>>> +                  dev_err_stat_nonfatal_reg[i] = (struct error_msg_counter_pair)<br>
>>>> +                  {.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL};<br>
>>>> +<br>
>>>> +          mask = ~(0 | DG2_DEV_ERR_STAT_CORRECTABLE_MASK);<br>
>>>> +          for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)<br>
>>>> +                  dev_err_stat_correctable_reg[i] = (struct error_msg_counter_pair)<br>
>>>> +                  {.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_CORR};<br>
>>>> +  } else if (xe->info.platform == XE_PVC) {<br>
>>>> +          mask = ~(0 | PVC_DEV_ERR_STAT_FATAL_MASK);<br>
>>>> +          for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)<br>
>>>> +                  dev_err_stat_fatal_reg[i] = (struct error_msg_counter_pair)<br>
>>>> +                  {.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_FATAL};<br>
>>>> +<br>
>>>> +          mask = ~(0 | PVC_DEV_ERR_STAT_NONFATAL_MASK);<br>
>>>> +          for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)<br>
>>>> +                  dev_err_stat_nonfatal_reg[i] = (struct error_msg_counter_pair)<br>
>>>> +                  {.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL};<br>
>>>> +<br>
>>>> +          mask = ~(0 | PVC_DEV_ERR_STAT_CORRECTABLE_MASK);<br>
>>>> +          for_each_set_bit(i, &mask, DEV_ERR_STAT_MAX_ERROR_BIT)<br>
>>>> +                  dev_err_stat_correctable_reg[i] = (struct error_msg_counter_pair)<br>
>>>> +                  {.errmsg = "Undefined", .errcounter = XE_TILE_HW_ERR_UNKNOWN_CORR};<br>
>>>> +  }<br>
>>>> +}<br>
>>>> +<br>
>>>> +static void<br>
>>>> +xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)<br>
>>>> +{<br>
>>>> +  const char *hw_err_str = hardware_error_type_to_str(hw_err);<br>
>>>> +  struct error_msg_counter_pair *errstat;<br>
>>>> +  unsigned long errsrc;<br>
>>>> +  unsigned long flags;<br>
>>>> +  const char *errmsg;<br>
>>>> +  struct xe_gt *mmio;<br>
>>>> +  u32 counter;<br>
>>>> +  u32 errcntr;<br>
>>>> +  u32 errbit;<br>
>>>> +<br>
>>>> +  switch (hw_err) {<br>
>>>> +  case HARDWARE_ERROR_FATAL:<br>
>>>> +          errstat = (struct error_msg_counter_pair *)dev_err_stat_fatal_reg;<br>
>>> Why the casts?<br>
>>><br>
>>>> +          counter = XE_TILE_HW_ERR_UNKNOWN_FATAL;<br>
>>>> +          break;<br>
>>>> +  case HARDWARE_ERROR_NONFATAL:<br>
>>>> +          errstat = (struct error_msg_counter_pair *)dev_err_stat_nonfatal_reg;<br>
>>>> +          counter = XE_TILE_HW_ERR_UNKNOWN_NONFATAL;<br>
>>>> +          break;<br>
>>>> +  case HARDWARE_ERROR_CORRECTABLE:<br>
>>>> +          errstat = (struct error_msg_counter_pair *)dev_err_stat_correctable_reg;<br>
>>>> +          counter = XE_TILE_HW_ERR_UNKNOWN_CORR;<br>
>>>> +          break;<br>
>>>> +  default:<br>
>>>> +          return;<br>
>>>> +  }<br>
>>>> +<br>
>>>> +  spin_lock_irqsave(&tile_to_xe(tile)->irq.lock, flags);<br>
>>>> +  mmio = tile->primary_gt;<br>
>>>> +  errsrc = xe_mmio_read32(mmio, DEV_ERR_STAT_REG(hw_err));<br>
>>>> +<br>
>>>> +  if (!errsrc) {<br>
>>>> +          xe_tile_log_hw_err(tile, "DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);<br>
>>>> +          goto unlock;<br>
>>>> +  }<br>
>>>> +<br>
>>>> +  for_each_set_bit(errbit, &errsrc, REG_SIZE) {<br>
>>>> +          if (errbit < DEV_ERR_STAT_MAX_ERROR_BIT) {<br>
>>>> +                  errmsg = errstat[errbit].errmsg;<br>
>>>> +                  errcntr = errstat[errbit].errcounter;<br>
>>>> +          } else {<br>
>>>> +                  errmsg = "Undefined";<br>
>>>> +                  errcntr = counter;<br>
>>>> +          }<br>
>>>> +<br>
>>>> +          if (hw_err == HARDWARE_ERROR_CORRECTABLE)<br>
>>>> +                  xe_tile_log_hw_warn(tile, "%s %s error bit[%d] is set\n",<br>
>>>> +                                      errmsg, hw_err_str, errbit);<br>
>>>> +          else<br>
>>>> +                  xe_tile_log_hw_err(tile, "%s %s error bit[%d] is set\n",<br>
>>>> +                                     errmsg, hw_err_str, errbit);<br>
>>>> +<br>
>>>> +          tile->errors.hw[errcntr]++;<br>
>>>> +  }<br>
>>>> +<br>
>>>> +  xe_mmio_write32(mmio, DEV_ERR_STAT_REG(hw_err), errsrc);<br>
>>>> +unlock:<br>
>>>> +  spin_unlock_irqrestore(&tile_to_xe(tile)->irq.lock, flags);<br>
>>>> +}<br>
>>>> +<br>
>>>> +/*<br>
>>>> + * XE Platforms adds three Error bits to the Master Interrupt<br>
>>>> + * Register to support error handling. These three bits are<br>
>>>> + * used to convey the class of error:<br>
>>>> + * FATAL, NONFATAL, or CORRECTABLE.<br>
>>>> + *<br>
>>>> + * To process an interrupt:<br>
>>>> + *       Determine source of error (IP block) by reading<br>
>>>> + *         the Device Error Source Register (RW1C) that<br>
>>>> + *         corresponds to the class of error being serviced<br>
>>>> + *         and log the error.<br>
>>>> + */<br>
>>>> +static void<br>
>>>> +xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)<br>
>>>> +{<br>
>>>> +  enum hardware_error hw_err;<br>
>>>> +<br>
>>>> +  for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {<br>
>>>> +          if (master_ctl & XE_ERROR_IRQ(hw_err))<br>
>>>> +                  xe_hw_error_source_handler(tile, hw_err);<br>
>>>> +  }<br>
>>>> +}<br>
>>>> +<br>
>>>>   /*<br>
>>>>    * Top-level interrupt handler for Xe_LP+ and beyond.  These platforms have<br>
>>>>    * a "master tile" interrupt register which must be consulted before the<br>
>>>> @@ -413,6 +630,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)<br>
>>>>             xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);<br>
>>>>   <br>
>>>>             gt_irq_handler(tile, master_ctl, intr_dw, identity);<br>
>>>> +          xe_hw_error_irq_handler(tile, master_ctl);<br>
>>>>   <br>
>>>>             /*<br>
>>>>              * Display interrupts (including display backlight operations<br>
>>>> @@ -561,6 +779,8 @@ int xe_irq_install(struct xe_device *xe)<br>
>>>>             return -EINVAL;<br>
>>>>     }<br>
>>>>   <br>
>>>> +  update_valid_error_regs(xe);<br>
>>>> +<br>
>>>>     xe->irq.enabled = true;<br>
>>>>   <br>
>>>>     xe_irq_reset(xe);<br>
<br>
-- <br>
Jani Nikula, Intel Open Source Graphics Center<br>
</div>
</span></font><br>
</div>
</body>
</html>